Perhaps separate vision and text tokens, then they can be unified in a single measure.
Perhaps separate vision and text tokens, then they can be unified in a single measure.