Multimodal-Outpost-Notebooks

Multimodal-Outpost is a collection of Colab notebooks designed for image inference and multimodal vision-language model (VLM) experimentation. It provides tools for OCR, image captioning, video understanding and generating DOCX or PDF documents containing both images and extracted text.

Notebooks List 📘

This repository contains a curated collection of notebooks for implementing state-of-the-art multimodal Vision-Language Models (VLMs).

Notebook Name	Link ↗
Aya-Vision-8B-VideoUnderstanding	Link
Behemoth-3B-070225-post.1	Link
Camel-Doc-OCR-080125	Link
Florence-2-Models-Image-Caption	Link
Gemma3-VL-VideoUnderstanding	Link
Imgscope-OCR-2B-0527-VideoUnderstanding	Link
Inkscope-Captions-2B-0526-VideoUnderstanding	Link
LFM2-VL-1.6B-LiquidAI	Link
LFM2-VL-450M-LiquidAI	Link
Lumian-VLR-7B-Thinking-Demo-Notebook	Link
Lumian2-VLR-7B-Thinking-Demo-Notebook	Link
Megalodon-OCR-Sync-0713-ColabNotebook	Link
MiMo-VL-7B-RL-VideoUnderstanding	Link
MiMo-VL-7B-SFT-VideoUnderstanding	Link
MonkeyOCR-0709	Link
OCRFlux3B	Link
Qwen2-VL-MessyOCR-VideoUnderstanding	Link
Qwen2-VL-OCR-2B-Instruct	Link
Qwen2-VL-VideoUnderstanding	Link
Qwen2.5-VL-3B-Abliterated-Caption-it(caption)	Link
Qwen2.5-VL-3B-Instruct	Link
Qwen2.5-VL-7B-Abliterated-Caption-it	Link
Qwen2.5-VL-VideoUnderstanding	Link
RolmOCR-Qwen2.5-VL-VideoUnderstanding	Link
SmolDocling-256M-preview	Link
monkey-OCR	Link
moondream2-2025-06-21	Link
nanonets-OCR	Link
olmOCR-Qwen2-VL-VideoUnderstanding	Link
typhoon-OCR	Link
typhoon-ocr-7b-Qwen2.5VL-VideoUnderstanding	Link

Features

Extracts text from images using various OCR models
Supports image captioning and multimodal inference
Embeds images and extracted text into DOCX or PDF formats
Designed for quick deployment via Google Colab

Inference Image (Demo)

Screenshot 2025-08-22 at 18-30-51 Gradio

Dependencies

Python
PyTorch
Hugging Face Transformers
ReportLab
Gradio (for UI)
(Qwen2.5-VL based) / Others

All dependencies are automatically installed in the Colab environment.

Author

Created and maintained by PRITHIVSAKTHIUR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multimodal-Outpost-Notebooks

Notebooks List 📘

Features

Inference Image (Demo)

Dependencies

Author

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
Apple-FastVLM-0.5B-Live-Cam		Apple-FastVLM-0.5B-Live-Cam
Apple-FastVLM-0.5B		Apple-FastVLM-0.5B
Apple-FastVLM-1.5B		Apple-FastVLM-1.5B
Aya-Vision-8B-VideoUnderstanding		Aya-Vision-8B-VideoUnderstanding
Behemoth-3B-070225-post0.1		Behemoth-3B-070225-post0.1
Behemoth-3B-070225-post0.1_Traffic_Analysis		Behemoth-3B-070225-post0.1_Traffic_Analysis
Camel-Doc-OCR-080125		Camel-Doc-OCR-080125
DeepCaption-VLA-7B[4bit - notebook demo]		DeepCaption-VLA-7B[4bit - notebook demo]
Florence-2-Models-Image-Caption		Florence-2-Models-Image-Caption
Gemma3-VL-VideoUnderstanding		Gemma3-VL-VideoUnderstanding
Imgscope-OCR-2B-0527--VideoUnderstanding		Imgscope-OCR-2B-0527--VideoUnderstanding
Inkscope-Captions-2B-0526-VideoUnderstanding		Inkscope-Captions-2B-0526-VideoUnderstanding
InternVL-3.5-Notebook		InternVL-3.5-Notebook
LFM2-VL-1.6B-LiquidAI		LFM2-VL-1.6B-LiquidAI
LFM2-VL-450M-LiquidAI		LFM2-VL-450M-LiquidAI
LiquidAI-LFM2-VL-Live-Cam		LiquidAI-LFM2-VL-Live-Cam
Lumian-VLR-7B-Thinking-Demo-Notebook		Lumian-VLR-7B-Thinking-Demo-Notebook
Lumian2-VLR-7B-Thinking-Demo-Notebook		Lumian2-VLR-7B-Thinking-Demo-Notebook
Megalodon-OCR-Sync-0713-ColabNotebook		Megalodon-OCR-Sync-0713-ColabNotebook
MiMo-VL-7B-RL-VideoUnderstanding		MiMo-VL-7B-RL-VideoUnderstanding
MiMo-VL-7B-SFT-VideoUnderstanding		MiMo-VL-7B-SFT-VideoUnderstanding
MonkeyOCR-0709		MonkeyOCR-0709
OCRFlux3B		OCRFlux3B
Qwen-2VL-MessyOCR-VideoUnderstanding		Qwen-2VL-MessyOCR-VideoUnderstanding
Qwen2-VL-OCR-2B-Instruct		Qwen2-VL-OCR-2B-Instruct
Qwen2-VL-VideoUnderstanding		Qwen2-VL-VideoUnderstanding
Qwen2.5-VL-3B-Abliterated-Caption-it(caption)		Qwen2.5-VL-3B-Abliterated-Caption-it(caption)
Qwen2.5-VL-3B-Instruct		Qwen2.5-VL-3B-Instruct
Qwen2.5-VL-7B-Abliterated-Caption-it		Qwen2.5-VL-7B-Abliterated-Caption-it
Qwen2.5-VL-VideoUnderstanding		Qwen2.5-VL-VideoUnderstanding
RolmOCR-Qwen2.5-VL-VideoUnderstanding		RolmOCR-Qwen2.5-VL-VideoUnderstanding
SmolDocling-256M-preview		SmolDocling-256M-preview
deepattricap-vla-3b-colab-notebook-demo		deepattricap-vla-3b-colab-notebook-demo
monkey-OCR		monkey-OCR
moondream2 -2025-06-21		moondream2 -2025-06-21
nanonets-OCR		nanonets-OCR
olmOCR-Qwen2-VL-VideoUnderstanding		olmOCR-Qwen2-VL-VideoUnderstanding
typhoon-OCR		typhoon-OCR
typhoon-ocr-7b-Qwen2.5VL-VideoUnderstanding		typhoon-ocr-7b-Qwen2.5VL-VideoUnderstanding
LICENSE		LICENSE
README.md		README.md

License

PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks

Folders and files

Latest commit

History

Repository files navigation

Multimodal-Outpost-Notebooks

Notebooks List 📘

Features

Inference Image (Demo)

Dependencies

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages