As AI systems began acing traditional tests, researchers realized those benchmarks were no longer tough enough. In response, ...