Implementation of a GPT agent with vision using image captioning AI
In the era before large vision-language models like GPT-4o, I enabled conversation with an LLM agent incorporating vision by inserting the results of an image captioning model (BLIP) into the dialogue with the LLM. To make the latency appear natural, I adopted an interface that mimics a Zoom conversation.