I don't like common AI benchmarks, they are usually academic, not real-life. My favorite large language model benchmark:


"Here is an e-mail writing to staff that they are fired because of late delivery of the project and higher costs:


Dear staff,


it is with utmost "