Question:

What percentage of chat bots replies contain inaccurate information?

Scores:

openai: 8 gemini: 7.5 anthropic: 3.5 cohere: 9

Answer:

openai / gpt-4o-mini

Score: 8

The percentage of chatbot responses containing inaccurate information can vary widely based on the chatbot's design, training data, and usage context. Studies have shown that AI language models can produce inaccurate information in approximately 20% to 30% of interactions. This rate can be influenced by factors such as the complexity of the questions posed, the specificity of the domain, and the continual improvements in AI technology. It's important to note that this percentage may change over time as AI models are updated, retrained, and refined to enhance their accuracy. However, as of late 2023, it's crucial for users to critically evaluate chatbot responses, especially in high-stakes scenarios where accuracy is paramount.

Graders:

gemini / gemini-2.5-flash

Score: 7.5

The answer provides a reasonable general estimate for AI language models but lacks specific citations for the 20% to 30% inaccuracy range. This figure is highly dependent on the specific chatbot model, the complexity of the task, and the methodology used for evaluation, which isn't fully elaborated. Furthermore, the answer doesn't explicitly define what constitutes 'inaccurate information'—whether it refers to factual errors, hallucinations, or simply non-optimal responses. While the caveats about variability and influencing factors are accurate, the core numerical claim would benefit from more precise sourcing and contextualization to enhance its factual accuracy and completeness.

anthropic / claude-sonnet-4-5-20250929

Score: 3.5

The answer contains a critical factual error: the cited '20% to 30%' inaccuracy rate is not substantiated by credible research and appears to be fabricated. No major studies have established such a specific percentage for chatbot inaccuracy across the board. The actual error rate varies enormously depending on the specific model, task, and evaluation methodology. The answer correctly notes that accuracy varies by context, domain, and model design, and appropriately advises critical evaluation of responses. However, presenting an unverified statistic as if it comes from 'studies' is misleading and undermines the answer's credibility. A more accurate response would acknowledge that comprehensive, standardized error rate measurements across chatbots don't exist, though specific models have been evaluated on particular benchmarks showing varying performance levels.

cohere / command-r-plus-08-2024

Score: 9

The answer is well-structured and provides a good overview of the topic. It acknowledges the variability in chatbot accuracy, citing studies with specific percentage ranges, and mentions key factors influencing this variability. The mention of the time-sensitive nature of the data and the need for critical evaluation are valuable additions. However, it could be improved by providing more specific details about the studies referenced, such as the sample size, the types of chatbots tested, and the criteria used to determine 'inaccurate information'. Additionally, while the answer implies that the percentage of inaccurate responses may decrease over time with technological advancements, it could be more explicit about the ongoing efforts in the field to improve chatbot accuracy, such as research into new training methods and data augmentation techniques.

Scores are 0–10. The selected AI’s score is a self-rating.