Exposing Vulnerabilities in Automatic LLM Benchmarks: The Need for ...