🔥Microsoft drops Florence-2: Vision foundation model that slays! 🚀 All models are released on Hugging Face hub. Learn more👉
- 230M & 770M param models crush specialists in captioning, detection & more 💪
- 230M model beats Flamingo 80B (400x bigger!) in zero-shot 🤯
- Trained on FLD-5B: 5.4B annotations, 126M images 📊
- Fine-tuned: SOTA in captioning, VQA, referring expressions 🏆
- Excel in captioning, object detection, segmentation, VQA & more 🎨🔍❓
- Leverage multi-task learning on massive FLD-5B dataset 💡
- Beat larger models like PaLI, PaLI-X in specialist tasks 🥊
- Available in 230M & 770M param versions for all 🤗
🌟 Florence-2 is clearly a unified vision representation powerhouse! 🦾
🙌 Kudos to Microsoft for advancing vision foundation models and for 👏 for open-sourcing!
All models are on Hugging Face Hub.