I don't like common AI benchmarks, they are usually academic, not real-life. My favorite large language model benchmark:
"Here is an e-mail writing to staff that they are fired because of late delivery of the project and higher costs:
Dear staff,
it is with utmost "