Skip to content

PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal-Outpost-Notebooks

Multimodal-Outpost is a collection of Colab notebooks designed for image inference and multimodal vision-language model (VLM) experimentation. It provides tools for OCR, image captioning, video understanding and generating DOCX or PDF documents containing both images and extracted text.

Notebooks List 📘

This repository contains a curated collection of notebooks for implementing state-of-the-art multimodal Vision-Language Models (VLMs).

Notebook Name Link ↗
Aya-Vision-8B-VideoUnderstanding Link
Behemoth-3B-070225-post.1 Link
Camel-Doc-OCR-080125 Link
Florence-2-Models-Image-Caption Link
Gemma3-VL-VideoUnderstanding Link
Imgscope-OCR-2B-0527-VideoUnderstanding Link
Inkscope-Captions-2B-0526-VideoUnderstanding Link
LFM2-VL-1.6B-LiquidAI Link
LFM2-VL-450M-LiquidAI Link
Lumian-VLR-7B-Thinking-Demo-Notebook Link
Lumian2-VLR-7B-Thinking-Demo-Notebook Link
Megalodon-OCR-Sync-0713-ColabNotebook Link
MiMo-VL-7B-RL-VideoUnderstanding Link
MiMo-VL-7B-SFT-VideoUnderstanding Link
MonkeyOCR-0709 Link
OCRFlux3B Link
Qwen2-VL-MessyOCR-VideoUnderstanding Link
Qwen2-VL-OCR-2B-Instruct Link
Qwen2-VL-VideoUnderstanding Link
Qwen2.5-VL-3B-Abliterated-Caption-it(caption) Link
Qwen2.5-VL-3B-Instruct Link
Qwen2.5-VL-7B-Abliterated-Caption-it Link
Qwen2.5-VL-VideoUnderstanding Link
RolmOCR-Qwen2.5-VL-VideoUnderstanding Link
SmolDocling-256M-preview Link
monkey-OCR Link
moondream2-2025-06-21 Link
nanonets-OCR Link
olmOCR-Qwen2-VL-VideoUnderstanding Link
typhoon-OCR Link
typhoon-ocr-7b-Qwen2.5VL-VideoUnderstanding Link

Features

  • Extracts text from images using various OCR models
  • Supports image captioning and multimodal inference
  • Embeds images and extracted text into DOCX or PDF formats
  • Designed for quick deployment via Google Colab

Inference Image (Demo)

Screenshot 2025-08-22 at 18-30-51 Gradio

Dependencies

  • Python
  • PyTorch
  • Hugging Face Transformers
  • ReportLab
  • Gradio (for UI)
  • (Qwen2.5-VL based) / Others

All dependencies are automatically installed in the Colab environment.

Author

Created and maintained by PRITHIVSAKTHIUR