Google Gemini vs YouTube Real-Time GPT-4 Vision Demo

After Google released its impressive Gemini hands-on demo video, it was discovered to be a little too good to be true. But now someone has recreated that demo in GPT-4 Vision, accomplishing what Gemini couldn’t do in its video.

Google’s Gemini large language model (LLM) is the company’s most powerful suite of AI models to date, and its biggest shot at OpenAI’s GPT-4 architecture. In an attempt to show off just how capable its multimodal LLM is, Google released a hands-on video of Gemini supposedly responding to voice prompts in real time. Initially, the demo was pretty impressive, but viewers eventually discovered a disclaimer that said latency was reduced and Gemini’s outputs were shortened for brevity.

While those issues make the demo a little less impressive, it was the realization that it wasn’t actually responding to speech in real-time like Google said that turned it into a real egg-on-the-face kind of moment for the company. Google admitted to Bloomberg that Gemini wasn’t responding to voice prompts in real-time, but was instead responding to text prompts. To address the criticism, Gemini co-lead Oriol Vinyals later explained that Gemini has all the capabilities needed for this function, but the video was meant to show what “multimodal user experiences built with Gemini could look like.”

While the damage has been done, it looks like a YouTuber has added a little insult to injury. The YouTube channel Greg Technology has published a video where Gemini’s demo was recreated in GPT-4 Vision. Unlike Google’s hands-on video, this video was actually done in real-time with voice prompts.

In the video, GPT-4 is asked to recognize hand signs, identify a game the host was playing with his hands, and identify a drawing. While not as polished or as quick as what was shown in the Gemini demo, it is responding in real-time.