Question: People are often looking at number of trainable parameters (weights) as a proxy for llm model complexity (and memory requirements).

Isn't the number of attention heads also an important parameter to consider in evaluating language models?

This post and comments are published on Nostr.