Teaching to the Test: The Hidden Flaws in AI Rankings