ClipTagger-12b Playground
Upload or paste an image, then annotate using Inference.net
Max 4.5MB
JPEG · PNG · WebP · GIF
Code
Read the blog →
ClipTagger-12b
is a 12B-parameter vision-language model for scalable video understanding. It outputs schema-consistent JSON per frame and delivers frontier-quality at ~17x lower cost than frontier closed models while matching their accuracy.
Blog post
Docs: Video understanding
Model card (HF)
Serverless API
GitHub
Drop an image here
or
Upload
or press ⌘/Ctrl+V to paste
Video (5-frame) Annotator
Drop a video here
or
Upload
MP4 · WebM · MOV