Search results for: 'From clip to dino: Visual encoders shout in multi-modal large language models openreview'