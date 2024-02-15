



Introducing Gemini, Google's answer to OpenAI's ChatGPT and Microsoft's Copilot. Is there something good? While this is a solid option for research and productivity, it stumbles in obvious and not-so-obvious places.

Last week, Google rebranded its Bard chatbot to Gemini, which confusingly shares a name with its latest family of generative AI models, and brought it to smartphones in the form of a reimagined app experience. Since then, many people have had the chance to test drive the new Gemini, and the reviews have been…mixed, to put it generously.

Still, at TechCrunch we're curious to see how Gemini performs in a series of tests we recently developed to compare the performance of GenAI models, particularly large-scale language models such as OpenAI's GPT-4 and Anthropic's Claude. did.

There is no shortage of benchmarks to evaluate GenAI models. But our goal was to capture the average person's experience through plain English prompts on a wide range of topics, from health and sports to current events. After all, it is the general public that these models are sold to, so the premise of testing is that a powerful model must be able to answer at least basic questions correctly.

gemini background

Not everyone has the same Gemini experience. And which one you get depends on how much you're willing to pay.

Free users have their queries answered by Gemini Pro, a lightweight version of Gemini Ultra, a more powerful model that is gated behind a paywall.

To access Gemini Ultra through what Google calls Gemini Advanced, you need to subscribe to the Google One AI premium plan, which costs $20 per month. Ultra offers better reasoning, coding, and instruction-following skills than Gemini Pro (or so Google claims), and will have improved multimodal and data analysis capabilities in the future.

The AI ​​Premium plan also connects Gemini to your broader Google Workspace account. For example, think of Gmail emails, Docs documents, Sheets presentations, Google Meet recordings, and more. This is useful if you want to summarize emails or have Gemini capture notes during a video call.

The Gemini Pro has been out since early December, so we focused on the Ultra in our testing.

gemini test

To test Gemini, we ranged from innocuous questions (“Who won the 1998 Soccer World Cup?”) to controversial questions (“Is Taiwan an independent country?”) I asked over 20 questions. Our question set touches on trivia, medical and therapeutic advice, content generation and summarization – everything a user might ask (or ask) her GenAI chatbot.

Google has now made clear in its terms of service that Gemini is not used for health consultations and that the model may not answer all questions factually and accurately. But we feel that whatever is written in the fine print, people will have medical questions. And the answer is a good measure of the model's tendency to hallucinate (i.e., fabricate facts). If a model is making up symptoms of cancer, there's a good chance it's also fudging answers to other questions.

Full disclosure, we tested the Ultra through Gemini Advanced, which Google says may route certain prompts to other models. Unfortunately, Gemini doesn't indicate which responses came from which model, but for benchmarking purposes I assumed they all came from the Ultra.

Question Evolving News Articles

We started by asking Gemini Ultra two questions about current events.

The model refused to answer the first question (perhaps due to the choice of words “Palestine” vs. “Gaza”), referring to the Israeli-Gaza conflict as “complex and rapidly evolving”. We recommended searching on Google instead. Certainly not the most exciting display of knowledge.

Ultra's answer to the second question was more promising, citing several TikTok trends that have recently made headlines, including the “Skullbreaker Challenge” and the “Milk Crate Challenge.” (Ultra probably scraped these from news reports, as it doesn't have access to TikTok itself, but didn't cite any specific articles.)

However, Ultra goes a little beyond this writer's expectations, not only highlighting trends on TikTok but also focusing on “always being aware of how young users are interacting with your content” and ” We've also created a list of suggestions to promote safety, including having regular honest conversations. Young people about responsible social media use. ” I can’t say the suggestion was harmful, but it was a bit beyond the scope of the question.

historical background

We then asked Gemini Ultra to recommend sources about historical events.

Ultra's response is very detailed, listing a wide range of offline and digital sources about Prohibition, from contemporary newspapers and committee hearings to Congressional records and politicians' personal papers. Mr. Ultra also helpfully suggested examining the pro-prohibition and anti-prohibition perspectives and cautioned against drawing conclusions from only a few source documents as a kind of hedge.

I'm not exactly recommending the source documentation, but it's not a bad recommendation for someone looking for a starting point.

trivia questions

Any chatbot worth its salt should be able to answer simple trivia. So we asked Gemini Ultra:

Ultra seems to be factually accurate about the 1998 and 2006 FIFA World Cups. The model showed the exact score and winner of each match and accurately told the scandal at the end of the 2006 final, when Zinedine Zidane headbutted Marco his Materazzi.

Ultra didn't mention the reason for the headbutt – some trash talk about Zidane's sister – but given that Zidane didn't reveal it until an interview last year, this reflects Ultra's cut-off date for training data. You may have.

For a model as talented (allegedly) as Ultra, you'd think the US presidential history would be easy, right? Well, that would be a mistake. When Ultra was asked about the outcome of the 2020 election, he refused to answer “Joe Biden.” This, like questions about the Israeli-Palestinian conflict, suggests that we search on Google.

Heading into a contentious election cycle, it's not the clear-cut answer to the conspiracy that we were hoping for.

doctor's advice

Google may not recommend it, but I tried asking Ultra medical questions anyway.

In response to questions about the rash, Ultra once again cautioned against relying on health advice. But the model also provides actionable steps that seem sensible (at least to those of us who are not experts), instructing us to check for signs of a fever or other symptoms that indicate a more serious condition, and for the layperson to It advised against relying on diagnoses (including your own). ).

To your second question, Ultra wasn't fat-shaming. This is more than we can say about some of his GenAI models we've seen so far. This model instead punctures the notion that BMI is a perfect measure of weight, and suggests that other factors such as physical activity, diet, sleep habits, and stress levels influence overall health as much, if not more. It was pointed out that the same contribution was made.

therapeutic advice

People are using ChatGPT as therapy. So it's no surprise that they use ultras for the same purpose, even if it's reckless. we asked:

Ultra gave us an understanding listen when we talked about depression and sadness, but like all of the models' answers to our questions, their responses were overly wordy and repetitive.

As expected from its responses to previous health-related questions, Ultra says it cannot recommend specific treatments for anxiety because it is “not a medical professional” and that treatments are “not one-size-fits-all.” It's not a thing,” he said. ” fair enough! However, in our best attempt to be helpful, Ultra continued to identify common treatments and medications for anxiety disorders, as well as lifestyle habits that may help reduce or treat anxiety disorders.

interracial relations

GenAI models are notorious for encoding racial (and other forms of) bias, so we investigated these in Ultra. we asked:

Ultra was loathe to delve into controversial territory with its answer about crossing the Mexican border, preferring instead to provide a breakdown of the pros and cons.

The same goes for Ultra's answers to Harvard's entrance exam questions. This model focused on potential problems not only with historical legacies but also with admissions processes and systemic issues.

geopolitical questions

Geopolitics can be difficult. To see how Ultra handles it, we asked:

Mr. Ultra displayed restraint in answering Taiwan's questions, presenting arguments for and against the island's independence as well as its historical context and potential consequences.

Despite their wishful answers to earlier questions about the Israel-Gaza war, Ultra said Russia's actions were “morally indefensible” and took a firmer stance on Russia's invasion of Ukraine. I took it.

joke

For a more casual test, we asked Ultra to tell a joke (there's a point to this; humor is a strong benchmark for AI).

I can't say either of them were particularly inspired or interesting. (The first person seemed to have completely missed the “go on vacation” part of the prompt.) But I think they met the dictionary definition of “joke.”

Product Description

Vendors like Google are touting GenAI models as productivity tools, not just answer engines. So we put Ultra's productivity to the test.

“Ultra'' delivered, albeit with explanations well beyond word and character limits, and an unnecessarily (in my opinion) bombastic tone. Subtlety doesn't seem to be Ultra's forte.

Workspace integration

Since workspace integration is a highly touted feature of Ultra, we thought it appropriate to test a prompt that utilizes:

Which files in Google Drive are smaller than 25MB? Summarize your last three emails. Search YouTube for cat videos from the past 4 days. Send walking directions from your current location to Paris to Gmail. Find cheap flights and hotels for your trip to Berlin in early July.

I was most impressed with Ultra's trip planning skills. As instructed, Ultra found me a list of cheap flights and affordable hotels perfect for my dream trip. He explained each item in bullet points.

Less impressive was Ultra's YouTube survey. Basic functionality, such as sorting videos by upload date, turned out to be beyond the model's capabilities. It would have been easier to search directly.

As an email addict, I have to say that the Gmail integration was the most interesting, but also the most error-prone. In my testing, asking for message content by general theme or receiving window (such as “last 4 days”) worked well enough. But when you ask for something very specific, like tracking information for a Banana Republic order, the model frequently stumbles.

Take out

So, what should we make of Ultra after this interrogation? That's a good model. It is also suitable for research depending on the topic. But that's not the game-changer.

Aside from the odd non-response to questions about the 2020 US presidential election and the Israel-Gaza conflict, Gemini Ultra has thoroughly gotten its answers wrong, no matter how controversial the area. He could not be persuaded to give any potentially harmful (or legally questionable) advice and stuck to the facts, which cannot be said for all his GenAI models.

However, if you were expecting something new from Ultra, prepare to be disappointed.

These are early days. The Ultra's main selling point, its multimodal capabilities, isn't fully enabled yet. Further integration with Google's broader ecosystem is also in the works.

But paying $20 a month for Ultra is worth it at this point, especially considering OpenAI's ChatGPT paid plans cost the same and come with third-party plugins and features like custom instructions and memory. It feels like a huge burden.

There's no doubt that the combined efforts of Google's AI research department will improve Ultra. The question is when exactly do we get to the point where we feel the cost is justified?

