Chatbot blunders reveal the limitations of this confounding new technology

Credit: Tinpixels / iStock / Getty Images Plus.

February 10, 2023

Evrim Yazgin

Evrim Yazgin has a Bachelor of Science majoring in mathematical physics and a Master of Science in physics, both from the University of Melbourne.

It’s taken just a few days for Google AI chatbot Bard to make headlines for the wrong reasons.

Google shared a GIF showing Bard answering the question: “What new discoveries from the James Webb Space Telescope can I tell my 9 year old about?” One of Bard’s answers – that the telescope “took the very first pictures of a planet outside of our own solar system” – is more artificial than intelligent.

A number of astronomers have taken to Twitter to point out that the first exoplanet image was taken in 2004 – 18 years before Webb began taking its first snaps of the universe.

Google’s embarrassment over this mistake is compounded by the fact that it’s Bard’s first answer ever… and it was wrong! Bard is Google’s rushed answer to Microsoft-backed ChatGPT.

Both Bard and ChatGPT are powered by large language models (LLM) – deep learning algorithms that can recognise and generate content based on huge amounts of data. The problem is that, sometimes, these chatbots simply make stuff up. There have even been reports that ChatGPT has produced made-up references.

Read more: Google announces Bard, its answer to AI chatbot phenomenon ChatGPT

It’s not “conscious” because the AI itself is not conscious, but nevertheless they are called “hallucinations.” They are the result of the software trying to fill in gaps and trying to make things sound natural and accurate. It’s a well-known problem for LLMs and was even acknowledged by ChatGPT developers OpenAI in its release statement on November 30, 2022: “ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers.”

Experts say even the responses to the “successes” of artificial intelligence chatbots need to be tempered by an element of restraint.

In a paper published last week, University of Minnesota Law School researchers subjected ChatGPT to four real exams at the university. The exams were then graded blind. After answering nearly 100 multiple choice questions and 12 essay questions, ChatGPT received an average score of C+ – a low but passing grade.

Read more: ChatGPT banned in some schools, but many experts say it can improve education

Lead author Professor Jonathon Choi wrote on Twitter that two of the three examiners caught on that the paper was written by a bot.

Another team of researchers put ChatGPT through the United States Medical Licensing Exam (USMLE) – a notoriously difficult series of three exams.

A pass grade for the USMLE is usually around 60 percent. The researchers found that ChatGPT, tested on 350 of the 376 public questions available from the June 2022 USMLE release scored between 52.4 and 75.0 percent.

The authors claim in their research, published in PLOS Digital Health, that “ChatGPT produced at least one significant insight in 88.9% of all responses.” In this case, “significant insight” refers to something in the chatbot’s responses that is new, non-obvious, and clinically valid.

But Dr Simon McCallum, a senior lecturer in software engineering at New Zealand’s Victoria University of Wellington, says that ChatGPT’s performance isn’t even the most impressive of AI trained in medical settings.

Google’s Med-PaLM, a specialist arm of the chat tool Glan-PaLM, is another LLM focused on medical texts and conversations. “ChatGPT may pass the exam, but Med-PaLM is able to give advice to patients that is as good as a professional GP. And both of these systems are improving.”

Read more: ChatGPT is making waves, but what do AI chat tools mean for the future of writing?

Dr Collin Bjork, a senior lecturer in science communication and podcasting at Massey University is much more circumspect, saying the “claim that ChatGPT can pass US medical exams is overblown and should come with a lengthy series of asterisks.”

Among these, Bjork includes the fact that the authors’ claim that ChatGPT showed “insight” in its USMLE responses is based on a definition of “insight” that “is too vague to be useful.”

“The authors’ claims about ChatGPT’s insights and teaching potential are misleading and naïve,” Bjork adds.

Bjork and McCallum’s full reactions to the PLOS Digital Health paper can be found here.

Cosmos did not include comment from ChatGPT or Bard!