ClipTagger-12b Playground

Upload or paste an image, then annotate using Inference.net

Grass × Inference
Read the blog →
ClipTagger-12b is a 12B-parameter vision-language model for scalable video understanding. It outputs schema-consistent JSON per frame and delivers frontier-quality at ~17x lower cost than frontier closed models while matching their accuracy.
Drop an image here
or
or press ⌘/Ctrl+V to paste

Video (5-frame) Annotator

Drop a video here
or
MP4 · WebM · MOV