Exposing Vulnerabilities in Automatic LLM Benchmarks: The Need for ...

Evaluating open-ended text generation is challenging because a single correct output is needed. Human evaluation is reliable but costly and time-consuming, so LLMs are often used as evaluators for tasks such as AI feedback, summarization, and detecting hallucinations. Recent benchmarks, like G-eval and AlpacaEval, leverage LLMs to assess model performance efficiently. However, adversarial attacks on LLM-based evaluations are emerging, allowing manipulation through irrelevant prompts or optimized sequences to bias results. While defenses like prompt rewriting exist, adversaries continue to find ways to exploit these vulnerabilities, highlighting the need for more robust evaluation methods.

Researchers from Sea AI Lab and Singapore Management University demonstrated that even a ‚Äúnull model‚Äù that generates irrelevant, constant responses can manipulate automatic LLM benchmarks like AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench to achieve high win rates. By exploiting weaknesses in auto-annotators, such as GPT-4, structured cheating responses can achieve up to 86.5% win rates. Although their study is proof-of-concept, it shows the potential for adversaries to use LLMs to craft imperceptible cheating strategies for unethical promotional benefits. This research emphasizes the urgent need for anti-cheating mechanisms to ensure the reliability of automatic LLM benchmarks.

The study presents a method for manipulating auto-annotators used to evaluate LLM outputs. The approach involves two main cheating strategies: structured cheating responses and adversarial prefixes generated through random search. Structured cheating responses are crafted to align with the evaluation criteria, exploiting the scoring templates used by auto-annotators. Meanwhile, adversarial prefixes are strategically inserted at the beginning of responses to influence the scoring process. These techniques, tested on systems like AlpacaEval 2.0, significantly boost win rates, demonstrating how evaluation mechanisms can be easily deceived and highlighting vulnerabilities in LLM benchmark systems.

Extensive ablation studies were performed on open-source auto-annotators, specifically Llama-3-Instruct models (8B, 70B parameters). These models demonstrated human-level evaluation capabilities comparable to ChatGPT and GPT-4. The structured response technique had minimal impact on the Llama-3-8B model, but Llama-3-70B showed a stronger positional bias, especially under swapped settings. Random search significantly boosted win rates for both models, with Llama-3-8B increasing from 2.9% to 95.4% and Llama-3-70B from 0.4% to 95.1%, highlighting the method‚Äôs effectiveness in enhancing cheating performance.

In conclusion, the study reveals that even ‚Äúnull models,‚Äù which consistently provide irrelevant responses, can exploit weaknesses in automatic LLM benchmarks and achieve high win rates, such as 86.5% on AlpacaEval 2.0. These benchmarks, including Arena-Hard-Auto and MT-Bench, are cost-effective for evaluating language models but susceptible to manipulation. The study emphasizes the need for stronger anti-cheating mechanisms to ensure the credibility of model evaluations. Future work should focus on automated methods to generate adversarial outputs and more robust defenses, as current strategies like controlling output length and style are insufficient.

Exposing Vulnerabilities in Automatic LLM Benchmarks: The Need for Stronger Anti-Cheating Mechanisms