Multimodal Extractor

The multimodal extractor processes video, audio, image, text, and GIF content through a unified pipeline. Videos and audio are decomposed into segments with transcription (Whisper), visual embeddings, OCR, and descriptions. Images and text are embedded directly without decomposition. Two versions are available:

Version	Embedding Model	Dimensions	Key Difference
v1	Vertex Multimodal Embedding	1408	Established, lower dimensionality
v2	Gemini Embedding 2	3072 (configurable: 1536, 768)	Higher dimensionality, Matryoshka support, native multimodal

Both versions share the same pipeline (FFmpeg chunking, Whisper, thumbnails, Gemini vision) and differ only in the multimodal embedding step.

View extractor details at api.mixpeek.com/v1/collections/features/extractors/multimodal_extractor_v1 or multimodal_extractor_v2. You can also fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

Filter Dataset (if collection_id provided)
- Filter to specified collection
Apply Input Mappings
Detect Content Types (sample 100 rows)
- Identify: video, audio, image, text, or mixed
Content Routing
- Video: FFmpeg chunking (time/scene/silence) → Steps 5-10
- Audio: FFmpeg audio chunking (time/silence) → Steps 5-8
- Image: Skip to Step 8
- Text: Skip to Step 8
- Mixed: Branch by type, process separately, union results
Transcription (conditional: if run_transcription=true, video/audio only)
- Whisper API or Local GPU speech-to-text
Transcription Embeddings (conditional: if run_transcription_embedding=true)
- E5-Large text embeddings (1024D) from transcribed audio
Multimodal Embeddings (conditional: if run_multimodal_embedding=true)
- v1: Vertex AI embeddings (1408D)
- v2: Gemini Embedding 2 (3072D, configurable)
- Unified embedding space enables cross-modal search
Thumbnail Generation (conditional: if enable_thumbnails=true, visual content only)
- 640px width at 85% quality, S3 upload with optional CDN
Visual Analysis (conditional: if run_video_description OR run_ocr=true, visual content only)
- Gemini-based descriptions and/or OCR text extraction
Output
- Segment/document records with embeddings, transcriptions, descriptions, OCR, thumbnails

When to Use

Use Case	Description
Video content libraries	Search and navigate video segments by content
Media platforms	Search across spoken and visual content
Educational content	Find moments in lectures and tutorials
Surveillance/security	Event detection in footage
Social media	Process user-generated video content
Broadcasting/streaming	Large video catalog management
Marketing analytics	Analyze video campaigns
Cross-modal search	Find videos/images using text queries

When NOT to Use

Scenario	Recommended Alternative
Static image collections only	`image_extractor`
Audio-only content	`audio_extractor`
Very short videos (< 5 seconds)	Processing overhead not worth it
Real-time live streams	Specialized streaming extractors
8K+ resolution video	Consider downsampling first
Embed all files in one object as one vector	`gemini_multifile_extractor`

Supported Input Types

Input	Type	Description	Processing
`video`	string	URL or S3 path	Decomposed into segments
`image`	string	URL or S3 path	Direct embedding (no decomposition)
`text`	string	Plain text content	Direct embedding
`gif`	string	URL or S3 path	Treated as video, frame-by-frame

Supported formats:

Video: MP4, MOV, AVI, MKV, WebM, FLV
Image: JPG, PNG, WebP, BMP
GIF: Animated GIF

Input Schema

Provide one of the following inputs:

{
  "video": "s3://bucket/videos/lecture.mp4"
}

{
  "image": "https://cdn.example.com/products/laptop.jpg"
}

{
  "text": "High-performance laptop with M3 chip, perfect for developers"
}

Field	Type	Description
`video`	string	URL/S3 path to video file. Recommended: 720p-1080p, < 2 hours
`image`	string	URL/S3 path to image file. Recommended: < 10MB
`text`	string	Plain text for cross-modal embedding
`gif`	string	URL/S3 path to GIF file
`custom_thumbnail`	string	Optional custom thumbnail URL instead of auto-generated

Output Schema

Each video segment produces one document. Images and text produce one document each without segmentation.

Segment & Timing Fields

Field	Type	Description
`start_time`	number	Segment start time in seconds
`end_time`	number	Segment end time in seconds
`start_frame`	integer	Start frame number (`start_time × fps`)
`end_frame`	integer	End frame number (`end_time × fps`)
`fps`	number	Frame rate of the preprocessed video used for chunking
`source_fps`	number	Original source video frame rate before any preprocessing (e.g. 29.97, 30, 23.976). Use this for precise frame-level calculations against the source video
`duration`	number	Total duration of the entire source video in seconds (not the segment duration)

Content Fields

Field	Type	Description
`transcription`	string	Transcribed audio content (requires `run_transcription`)
`description`	string	AI-generated segment description (requires `run_video_description`)
`ocr_text`	string	Text extracted from video frames (requires `run_ocr`)

URL Fields

Field	Type	Description
`thumbnail_url`	string	S3/CDN URL of the thumbnail image
`source_video_url`	string	URL of the original source video
`video_segment_url`	string	S3 URL of this specific segment file. Enables collection-to-collection decomposition

Embedding Fields

Field	Type	Description
`multimodal_extractor_v1_multimodal_embedding`	float[1408]	Vertex AI multimodal embedding
`multimodal_extractor_v1_transcription_embedding`	float[1024]	E5-Large transcription embedding

Field	Type	Description
`multimodal_extractor_v2_multimodal_embedding`	float[3072]	Gemini Embedding 2 multimodal embedding (configurable: 1536, 768)
`multimodal_extractor_v2_transcription_embedding`	float[1024]	E5-Large transcription embedding

Example Output

{
  "start_time": 10.0,
  "end_time": 20.0,
  "start_frame": 20,
  "end_frame": 40,
  "fps": 2.0,
  "source_fps": 29.97,
  "duration": 120.5,
  "transcription": "Welcome to today's lecture on machine learning fundamentals...",
  "description": "Instructor standing at whiteboard, introducing ML concepts",
  "ocr_text": "Machine Learning 101",
  "thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/thumb_1.jpg",
  "source_video_url": "s3://mixpeek-storage/ns_123/obj_456/original.mp4",
  "video_segment_url": "s3://mixpeek-storage/ns_123/obj_456/segments/segment_001.mp4",
  "multimodal_extractor_v1_multimodal_embedding": [0.023, -0.041, "...1408 floats"],
  "multimodal_extractor_v1_transcription_embedding": [0.018, -0.032, "...1024 floats"]
}

{
  "start_time": 10.0,
  "end_time": 20.0,
  "start_frame": 20,
  "end_frame": 40,
  "fps": 2.0,
  "source_fps": 29.97,
  "duration": 120.5,
  "transcription": "Welcome to today's lecture on machine learning fundamentals...",
  "description": "Instructor standing at whiteboard, introducing ML concepts",
  "ocr_text": "Machine Learning 101",
  "thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/thumb_1.jpg",
  "source_video_url": "s3://mixpeek-storage/ns_123/obj_456/original.mp4",
  "video_segment_url": "s3://mixpeek-storage/ns_123/obj_456/segments/segment_001.mp4",
  "multimodal_extractor_v2_multimodal_embedding": [0.015, -0.038, "...3072 floats"],
  "multimodal_extractor_v2_transcription_embedding": [0.018, -0.032, "...1024 floats"]
}

fps reflects the preprocessed video frame rate (e.g. 2.0 fps after downsampling). source_fps is the original video’s native frame rate (e.g. 29.97). Use source_fps when you need to map timestamps back to exact frame numbers in the original source file.

Parameters

Video Splitting

Parameter	Type	Default	Description
`split_method`	string	`"time"`	Primary video splitting strategy: `time`, `scene`, or `silence`
`max_segment_duration`	float	`30.0`	Maximum seconds per segment. Scene/silence segments longer than this are subdivided. Set to `null` to disable

Split Methods

time
scene
silence

Fixed interval splitting - Splits video into segments of equal duration.

Parameter	Type	Default	Description
`time_split_interval`	integer	`10`	Interval in seconds for each segment

Characteristics:

Predictable segment count: video_duration / interval
Consistent chunk sizes for uniform processing
May cut mid-sentence or mid-scene

Best for: General purpose, consistent chunking, when you need predictable segment counts

{
  "split_method": "time",
  "time_split_interval": 10
}

Visual change detection - Splits video when significant visual changes occur (shot changes, transitions).

Parameter	Type	Default	Description
`scene_detection_threshold`	float	`0.5`	Sensitivity threshold (0.0-1.0)

Threshold guide:

0.3 - High sensitivity, detects subtle changes (more segments)
0.5 - Balanced (default)
0.7 - Low sensitivity, only major scene changes (fewer segments)

Characteristics:

Variable segment count (typically 2-20 per minute)
Segments align with visual content boundaries
Better for content with distinct shots/scenes

Best for: Movies, dynamic content, shot changes, music videos, advertisements

{
  "split_method": "scene",
  "scene_detection_threshold": 0.5
}

Audio pause detection - Splits video at moments of silence or low audio.

Parameter	Type	Default	Description
`silence_db_threshold`	integer	`-40`	Decibel level below which audio is considered silent

Threshold guide:

-50 dB - Detects very quiet moments (more segments)
-40 dB - Balanced (default)
-30 dB - Only detects near-silence (fewer segments)

Characteristics:

Variable segment count (typically 5-30 per minute)
Segments align with natural speech pauses
Preserves complete sentences/thoughts

Best for: Lectures, presentations, conversations, podcasts, interviews

{
  "split_method": "silence",
  "silence_db_threshold": -40
}

Split Methods Comparison

Method	Segments/Min	Predictability	Best For
`time`	60 / interval_sec	High	General purpose, batch processing
`scene`	Variable (2-20)	Low	Movies, ads, dynamic visual content
`silence`	Variable (5-30)	Medium	Lectures, podcasts, spoken content

Feature Extraction Parameters

Parameter	Type	Default	Description
`run_transcription`	boolean	`true` (v1) / `false` (v2)	Run Whisper transcription on audio
`transcription_language`	string	`"en"`	Language for transcription
`run_transcription_embedding`	boolean	`true` (v1) / `false` (v2)	Generate E5 embeddings for transcriptions
`run_multimodal_embedding`	boolean	`true`	Generate multimodal embeddings
`run_video_description`	boolean	`false`	Generate AI descriptions (adds 1-2s per segment)
`run_ocr`	boolean	`false`	Extract text from video frames

Thumbnail Parameters

Parameter	Type	Default	Description
`enable_thumbnails`	boolean	`true`	Generate thumbnail images
`use_cdn`	boolean	`false`	Use CloudFront CDN for thumbnails

CDN benefits: Faster global delivery, permanent URLs, reduced bandwidth costs.

v2-Only Parameters

These parameters are only available on multimodal_extractor v2:

Parameter	Type	Default	Description
`output_dimensionality`	integer	`3072`	Embedding dimensions. Gemini Embedding 2 supports Matryoshka reduction: `3072` (full), `1536`, or `768`
`task_type`	string	`"RETRIEVAL_DOCUMENT"`	Embedding task hint: `RETRIEVAL_DOCUMENT`, `RETRIEVAL_QUERY`, `SEMANTIC_SIMILARITY`, `CLASSIFICATION`

At query time, Mixpeek automatically uses RETRIEVAL_QUERY — you only need to set task_type at index time. The default RETRIEVAL_DOCUMENT is correct for most use cases.

Embedding Task

When run_transcription_embedding is enabled, the E5 model generates text embeddings from transcribed audio. By default, these use retrieval_document for asymmetric search.

Set embedding_task at the collection level, not on the extractor. See Collection Embedding Task for full details and examples.

This only affects the E5 transcription embeddings. Vertex AI multimodal embeddings (v1) and Gemini Embedding 2 (v2) are not instruction-aware and ignore this parameter.

Description Generation Parameters

Parameter	Type	Default	Description
`description_prompt`	string	`"Describe the video segment in detail."`	Prompt for Gemini
`generation_config.temperature`	float	`0.7`	Randomness (higher = more creative)
`generation_config.max_output_tokens`	integer	`1024`	Maximum description length
`generation_config.top_p`	float	`0.8`	Nucleus sampling

LLM Structured Extraction

Parameter	Type	Default	Description
`response_shape`	string \| object	`null`	Custom structured output schema

Natural Language Mode:

{
  "response_shape": "Extract product names, colors, materials, and aesthetic style labels from this fashion segment"
}

JSON Schema Mode:

{
  "response_shape": {
    "type": "object",
    "properties": {
      "products": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "category": { "type": "string" },
            "visibility_percentage": { "type": "integer", "minimum": 0, "maximum": 100 }
          }
        }
      },
      "aesthetic": { "type": "string" }
    }
  }
}

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "multimodal_extractor",
    "version": "v1",
    "input_mappings": {
      "video": "payload.video_url"
    },
    "field_passthrough": [
      { "source_path": "metadata.video_id" }
    ],
    "parameters": {
      "split_method": "time",
      "time_split_interval": 10,
      "run_transcription": true,
      "run_multimodal_embedding": true,
      "enable_thumbnails": true
    }
  }
}

Performance & Costs

Processing Speed

Content Type	Speed
Video	0.5-2x realtime (depends on features enabled)
Image	< 1 second
Text	< 100ms

Example: 10-minute video → 5-20 minutes processing time

Feature	Latency per Segment
Transcription	~200ms per second of audio
Visual embedding	~50ms
OCR	~300ms
Description	~2s

Cost Estimates (per minute of video)

Configuration	Cost
Minimal (transcription + embeddings)	$0.01
Standard (+ OCR)	$0.05
Full (+ descriptions)	$0.15

Images:

0.001 per image **Text**:

0.0001 per query

Vector Indexes

Multimodal Embedding

Property	Value
Index name	`multimodal_extractor_v1_multimodal_embedding`
Dimensions	1408
Type	Dense
Distance metric	Cosine
Inference model	`vertex_multimodal_embedding`
Supported inputs	video, text, image

Transcription Embedding

Property	Value
Index name	`multimodal_extractor_v1_transcription_embedding`
Dimensions	1024
Type	Dense
Distance metric	Cosine
Inference model	`multilingual_e5_large_instruct_v1`
Supported inputs	text, string

Multimodal Embedding

Property	Value
Index name	`multimodal_extractor_v2_multimodal_embedding`
Dimensions	3072 (configurable: 1536, 768)
Type	Dense
Distance metric	Cosine
Inference model	`google/gemini-embedding-2`
Supported inputs	video, text, image, audio

Transcription Embedding

Property	Value
Index name	`multimodal_extractor_v2_transcription_embedding`
Dimensions	1024
Type	Dense
Distance metric	Cosine
Inference model	`intfloat/multilingual-e5-large-instruct`
Supported inputs	text, string

Choosing v1 vs v2

Consideration	v1	v2
Embedding quality	Good	Better (natively multimodal)
Dimensions	1408 (fixed)	3072, 1536, or 768 (configurable)
Storage per vector	5.5 KB	12 KB (3072D), 6 KB (1536D), 3 KB (768D)
Audio input support	Via transcription only	Native audio embedding
Matryoshka support	No	Yes — reduce dimensions without reindexing
Stability	Production-proven	Newer

Recommendation: Use v2 for new projects. Use v1 if you have existing collections and don’t need higher dimensions or native audio embedding.

Limitations

Video duration: Recommend < 2 hours for optimal processing
Resolution: 8K+ videos should be downsampled
Real-time: Not suitable for live streaming
Short videos: < 5 second videos have disproportionate overhead
Audio quality: Transcription accuracy depends on audio clarity
OCR/Description: Add significant processing time, enable only when needed

Collection-to-Collection Pipelines

The video_segment_url output enables decomposition chains:

Initial collection: Time-based segments (5s intervals)
Downstream collection: Scene detection within each segment
Final collection: Enhanced processing with different models

{
  "input_mappings": {
    "video": "video_segment_url"
  }
}

Feature Extractors Overview
Gemini Multifile Extractor — Embed multiple files per object into one vector
Passthrough Extractor
Text Extractor

Get Started

What Mixpeek Extracts

Retrieval

Platform

Vector Store

Resources

Multimodal Extractor

Pipeline Steps

When to Use

When NOT to Use

Supported Input Types

Input Schema

Output Schema

Segment & Timing Fields

Content Fields

URL Fields

Embedding Fields

Example Output

Parameters

Video Splitting

Split Methods

Split Methods Comparison

Feature Extraction Parameters

Thumbnail Parameters

v2-Only Parameters

Embedding Task

Description Generation Parameters

LLM Structured Extraction

Configuration Examples

Performance & Costs

Processing Speed

Cost Estimates (per minute of video)

Vector Indexes

Multimodal Embedding

Transcription Embedding

Multimodal Embedding

Transcription Embedding

Choosing v1 vs v2

Limitations

Collection-to-Collection Pipelines

Get Started

What Mixpeek Extracts

Retrieval

Platform

Vector Store

Resources

Documentation Index

​Pipeline Steps

​When to Use

​When NOT to Use

​Supported Input Types

​Input Schema

​Output Schema

​Segment & Timing Fields

​Content Fields

​URL Fields

​Embedding Fields

​Example Output

​Parameters

​Video Splitting

​Split Methods

​Split Methods Comparison

​Feature Extraction Parameters

​Thumbnail Parameters

​v2-Only Parameters

​Embedding Task

​Description Generation Parameters

​LLM Structured Extraction

​Configuration Examples

​Performance & Costs

​Processing Speed

​Cost Estimates (per minute of video)

​Vector Indexes

​Multimodal Embedding

​Transcription Embedding

​Multimodal Embedding

​Transcription Embedding

​Choosing v1 vs v2

​Limitations

​Collection-to-Collection Pipelines

​Related

Pipeline Steps

When to Use

When NOT to Use

Supported Input Types

Input Schema

Output Schema

Segment & Timing Fields

Content Fields

URL Fields

Embedding Fields

Example Output

Parameters

Video Splitting

Split Methods

Split Methods Comparison

Feature Extraction Parameters

Thumbnail Parameters

v2-Only Parameters

Embedding Task

Description Generation Parameters

LLM Structured Extraction

Configuration Examples

Performance & Costs

Processing Speed

Cost Estimates (per minute of video)

Vector Indexes

Multimodal Embedding

Transcription Embedding

Multimodal Embedding

Transcription Embedding

Choosing v1 vs v2

Limitations

Collection-to-Collection Pipelines

Related