Watch#3: Literate LLMs, Human Errors and Chains-of-Verification

Third time's a charm

Sep 22, 2023

Foreword:

Since everyone seemed to like the new “Papers of the Week” section, it will be continued this week. Other than that, I’ll keep the foreword short and we’ll jump straight into it!

Have a great day all and don’t forget to share the newsletter if you liked it,

Pascal

In this issue:

Summarization has been declared dead
The first LLM to be considered literate
“Think before you speak!”, a human solution to a machine problem

1. Summarization is (Almost) Dead

Watching: Summarization (paper)

What problem does it solve? In a world with a sheer endless flow of information, summarization is crucial to reduce complexity and avoid information overload. Researchers, experts and decision makers of all kinds are relying on some form of summarization on a regular basis. Be it review papers, executive summaries or market reports. This paper explores a thought-provoking question: “Do we need human summaries?”

How does it solve the problem? The researchers recruited graduate students and let them write summaries of news articles, dialogues, crosslingual texts and code. They then compared these human-made outputs to summaries written by GPT-4. One pain point of GPT-4, and any other generative LLM to this date, is the risk of hallucinations and the technology has been rightfully criticized for it. But a question that’s rarely ever asked is how likely are human errors in comparison? The paper provides the perspective that sometimes humans are actually struggling more with factuality than performant LLMs.

What’s next? One factor that kind of muddies the results for me, among many others, it hat they didn’t recruit an expert group. I would’ve liked to see a comparison to professional writers or peopel familiar with the subjects they were writing about. The texts weren’t that complicated, so it’s fair to assume graduate students would do decently well on the task. Nevertheless, they are not in the position to write professional summaries.

2. Kosmos-2.5: A Multimodal Literate Model

Watching: Kosmos-2.5 (paper/code)

What problem does it solve? Multimodal LLMs have come far in just a few months of time. But one thing all LLMs are still struggling with is dealing with structured formats that are somewhere between text and image, such as markdown, LaTeX and tables. At the end of last year, Meta pushed the door open with their Nougat model. Now Microsoft storms into the room with Kosmos-2.5 and is completely leaving every other LLM behind when it comes to LaTeX and table understanding.

How does it solve the problem? Kosmos-2.5 is utilizing a 2-step process where the model is first creating text blocks and assigning image coordinates to each of them. It then translates the text block images to markdown in order to capture style and structure of the text. And all this can be applied to multiple tasks and datasets without any further fine-tuning.

What’s next? The results shown in the paper are impressive. They also compared their model to a commercial OCR solution on a popular OCR task and fount it to be on par with the vendor. Microsoft hasn’t released the code yet, but once that happens, the Open Source community will try to validate the results. It will be important to evaluate the method thoroughly to better understand its limits.

3. Chain-of-Verification Reduces Hallucination in Large Language Models

Watching: CoVe (paper)

What problem does it solve? I hope you’re not tired of this term by now, but yet again, this papers is adressing the problem of hallucinations. Or more specifically, the problem that a lot of times a model could theoretically answer the question correctly, but depending on the exact prompt used, it might simply not. Researchers are trying out various systematic prompting schemes in order to reduce output variability.

How does it solve the problem? Chain-of-Verficiation, or CoVe, is tackling this with a multi-step process. First, a baseline response is generated. This response can include factually incorrect generations. In the next step, clarifying questions are asked based on the query and the baseline response. These questions are then answered and the answers are compared to the baseline responses. If there’s a mismatch, that part of the generation will be excluded and only verified information gets passed on to the final response.

What's next? The ideas of self-correction and self-analysis aren’t new and there has been a lot of work going in this direction. Future studies will have to compare these different approaches and distill what’s actually working and what isn’t. There’s also always the risk of response degeneration, meaning that answers get worse by applying additional techniques. This risk hasn’t been thorougly investigated yet.

LLM Watch

Discussion about this post