Wow, this looks amazing:
2B and 4B model (but the 2B in q8 is over 4GB, so I’m not sure exactly what they mean by that). Anyway, it’s a multimodal model for end-user devices like mobile phones. It understands images, text, audio, and video!
If you have little RAM and you only want text, then you don’t load the vision and audio parameters.
And you can keep those PLE parameters outside of RAM on fast storage.
Meaning, on the phone’s flash storage.
(Thanks to stick for explanation).
