Question: People are often looking at number of trainable parameters (weights) as a proxy for llm model complexity (and memory requirements).

Isn't the number of attention heads also an important parameter to consider in evaluating language models?