LLM Accuracy: My Experiment with Five Models

Venkatarangan Thirumalai
3 min readMay 24, 2024


Comparing AI capabilities on spotting a currency conversion error in a Taylor Swift article. Spoiler: ChatGPT-4o shines, while others lag behind for now.

I recently decided to put five AI models to a simple but practical test: ChatGPT-4o, Gemini, Copilot, Llama 3, and ChatGPT-4. My experiment was an easy one: I had spotted a currency conversion error in a Taylor Swift article in the newspaper, and I wanted to see which model could identify it accurately. The error in question was a significant one: GBP 1 billion was incorrectly converted to INR 100 crores. For those unfamiliar with the conversion rates, this is a substantial mistake; the actual amount should be INR 10,000 crores. I was curious to see how the different models would handle it.

The source news article I had scanned. Courtesy: Times of India, Chennai edition, 23 May 2024

The Results

ChatGPT-4o stood out by correctly identifying the mistake. It nailed the task, showing that it’s currently the best among the models I tested. What amazed me was how it not only caught the important mistakes but also left out the inconsequential ones. For example, it pointed out the significant conversion mistake but ignored minor discrepancies elsewhere. Google’s Gemini has the technical capabilities to catch up, but they seem to be playing it safe for now. Llama 3, being a text model, required a bit of extra effort. I had to OCR the scanned image of the article before inputting the text into the model. While Llama 3 showed promise, it didn’t catch the important error. Similarly, Microsoft Copilot, though based on OpenAI’s ChatGPT-4, and ChatGPT-4 itself both focused on less important errors.

ChatGPT 4o nailed it!
Google Gemini, Copilot, ChatGPT-4, and Llama3 missed spotting the error

Why ChatGPT-4o’s Performance is Awesome

The accuracy and efficiency of ChatGPT-4o in identifying the currency conversion error are impressive. Such capabilities make the model more useful for everyday tasks, significantly enhancing productivity-this can save a lot of time and effort.

The Future with AI Models and Fact-Checking Agents

Looking ahead, attaching these AI models to agents for more rigorous fact-checking could be incredibly beneficial. They could help automate the process of verifying information, reducing the likelihood of errors in published content. This combination of AI models and fact-checking agents could lead to even greater accuracy and reliability in various applications. It’s fascinating to see how differently these AI models perform and how tech giants are approaching AI. While LLMs have been handling these tasks fairly well over the past 1.5 years, my practical experience shows that there’s still room for improvement and innovation. And for now, ChatGPT-4o seems to be leading the pack.

Originally published at https://venkatarangan.com on May 24, 2024.



Venkatarangan Thirumalai

A Founder Catalyst and a Microsoft Regional Director (Honorary).