ClipTagger-12b Playground

Upload or paste an image, then annotate using Inference.net

Max 4.5MB

JPEG · PNG · WebP · GIF

ClipTagger-12b is a 12B-parameter vision-language model for scalable video understanding. It outputs schema-consistent JSON per frame and delivers frontier-quality at ~17x lower cost than frontier closed models while matching their accuracy.

Blog post Docs: Video understanding Model card (HF)Serverless API GitHub

Drop an image here

Upload

or press ⌘/Ctrl+V to paste

Video (5-frame) Annotator

Drop a video here

Upload

MP4 · WebM · MOV