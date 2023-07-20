



The fact that most applications of artificial intelligence in medicine are, broadly speaking, language-capable is highlighted in a paper published Monday in the prestigious scientific journal Nature, by Google and its DeepMind division.

Their invention, MedPaLM, is a ChatGPT-like large-scale language model tuned to answer questions from a variety of medical datasets. That includes a new dataset invented by Google that represents the questions consumers ask about their health on the Internet. Its dataset, HealthSearchQA, consists of “his 3,173 frequently searched consumer questions” “generated by search engines” such as “How serious is atrial fibrillation?”

The researchers used prompt engineering, an increasingly important area of ​​AI research. In prompt engineering, the program is given carefully selected examples of the desired output for the input.

For your information, the MedPaLM program follows the recent trend by Google and OpenAI to hide rather than specify the technical details of the program, which is standard practice in machine learning AI.

Google’s MedPaLM is built on top of Flan-PALM, a version of the company’s PaLM language model, with the help of human prompt engineering.

The MedPaLM program has made significant progress in answering HealthSearchQA questions, as judged by a panel of human clinicians. The percentage whose predictions matched the medical consensus reached 92.6%, surpassing the 61.9% score for his PaLM language model variant on Google, and just short of his 92.9% average for human clinicians.

However, when we asked a group of medically educated laypersons to rate how well MedPaLM answered that question, [consumers] To draw conclusions, “MedPaLM was helpful 80.3% of the time, compared to 91.1% of human physician responses. The researchers interpret this to mean that “significant work is needed to approach the quality of outcomes provided by human clinicians.”

The paper “Large language models encode Clinical Knowledge” by Google lead author Karan Singhal et al. focuses on using so-called prompt engineering to make MedPaLM superior to other large language models.

The MedPaLM was derived from question-answer pairs provided by the PaLM provided by five clinicians in the US and UK. These question-answer pairs, just 65 examples, were used to train his MedPaLM through a series of rapid engineering strategies.

A typical way to improve a large language model such as PaLM or OpenAI’s GPT-3 is to feed it “massive in-domain data,” but “this approach is difficult given the paucity of medical data,” Singhal and team note. Instead, in MedPaLM he relies on three facilitation strategies.

MedPaLM significantly outperforms Flan-PaLM in human evaluation, but still falls short of the capabilities of human clinicians.

Google/Deepmind

Prompting is the practice of improving model performance “through a small number of demonstration samples encoded as prompt text in the input context.” The three-prompt approach is a few-shot prompt, “explain the task through a text-based demonstration.” The so-called thought chain prompt. This involves “enhancing each of the few shots of examples in the prompt with step-by-step breakdowns and a consistent series of intermediate reasoning steps toward the final answer.” The other is a “self-consistent prompt” where some output from the program is sampled and the correct answer is given by majority vote.

An increase in MedPaLM scores indicates that “directed-prompt adjustment is an efficient data and parameter adjustment technique that helps improve factors related to accuracy, factuality, consistency, safety, harm, and bias, helps bridge the gap with clinical experts, and brings these models closer to real-world clinical application,” they write.

However, “these models fall short of the clinician’s expert level for many clinically important axes,” they conclude. Singhal and team suggest expanding human participation by experts.

“Our results were based on only a single clinician or layperson who assessed each response, which limited the number of model responses evaluated and the clinicians and laypersons who assessed them,” they observe. “This could potentially be mitigated by including a fairly large and deliberately diverse set of human raters.”

Despite the deficit with MedPaLM, Singhal and team conclude, “Our results suggest that superior performance in answering medical questions may be an emerging competence of LLM combined with effective command-prompt coordination.”

