Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Multimodal extractor pipeline showing video splitting, parallel processing with Whisper and embedding models, and output features
The multimodal extractor processes video, audio, image, text, and GIF content through a unified pipeline. Videos and audio are decomposed into segments with transcription (Whisper), visual embeddings, OCR, and descriptions. Images and text are embedded directly without decomposition. Two versions are available:
VersionEmbedding ModelDimensionsKey Difference
v1Vertex Multimodal Embedding1408Established, lower dimensionality
v2Gemini Embedding 23072 (configurable: 1536, 768)Higher dimensionality, Matryoshka support, native multimodal
Both versions share the same pipeline (FFmpeg chunking, Whisper, thumbnails, Gemini vision) and differ only in the multimodal embedding step.
View extractor details at api.mixpeek.com/v1/collections/features/extractors/multimodal_extractor_v1 or multimodal_extractor_v2. You can also fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

  1. Filter Dataset (if collection_id provided)
    • Filter to specified collection
  2. Apply Input Mappings
  3. Detect Content Types (sample 100 rows)
    • Identify: video, audio, image, text, or mixed
  4. Content Routing
    • Video: FFmpeg chunking (time/scene/silence) → Steps 5-10
    • Audio: FFmpeg audio chunking (time/silence) → Steps 5-8
    • Image: Skip to Step 8
    • Text: Skip to Step 8
    • Mixed: Branch by type, process separately, union results
  5. Transcription (conditional: if run_transcription=true, video/audio only)
    • Whisper API or Local GPU speech-to-text
  6. Transcription Embeddings (conditional: if run_transcription_embedding=true)
    • E5-Large text embeddings (1024D) from transcribed audio
  7. Multimodal Embeddings (conditional: if run_multimodal_embedding=true)
    • v1: Vertex AI embeddings (1408D)
    • v2: Gemini Embedding 2 (3072D, configurable)
    • Unified embedding space enables cross-modal search
  8. Thumbnail Generation (conditional: if enable_thumbnails=true, visual content only)
    • 640px width at 85% quality, S3 upload with optional CDN
  9. Visual Analysis (conditional: if run_video_description OR run_ocr=true, visual content only)
    • Gemini-based descriptions and/or OCR text extraction
  10. Output
    • Segment/document records with embeddings, transcriptions, descriptions, OCR, thumbnails

When to Use

Use CaseDescription
Video content librariesSearch and navigate video segments by content
Media platformsSearch across spoken and visual content
Educational contentFind moments in lectures and tutorials
Surveillance/securityEvent detection in footage
Social mediaProcess user-generated video content
Broadcasting/streamingLarge video catalog management
Marketing analyticsAnalyze video campaigns
Cross-modal searchFind videos/images using text queries

When NOT to Use

ScenarioRecommended Alternative
Static image collections onlyimage_extractor
Audio-only contentaudio_extractor
Very short videos (< 5 seconds)Processing overhead not worth it
Real-time live streamsSpecialized streaming extractors
8K+ resolution videoConsider downsampling first
Embed all files in one object as one vectorgemini_multifile_extractor

Supported Input Types

InputTypeDescriptionProcessing
videostringURL or S3 pathDecomposed into segments
imagestringURL or S3 pathDirect embedding (no decomposition)
textstringPlain text contentDirect embedding
gifstringURL or S3 pathTreated as video, frame-by-frame
Supported formats:
  • Video: MP4, MOV, AVI, MKV, WebM, FLV
  • Image: JPG, PNG, WebP, BMP
  • GIF: Animated GIF

Input Schema

Provide one of the following inputs:
{
  "video": "s3://bucket/videos/lecture.mp4"
}
{
  "image": "https://cdn.example.com/products/laptop.jpg"
}
{
  "text": "High-performance laptop with M3 chip, perfect for developers"
}
FieldTypeDescription
videostringURL/S3 path to video file. Recommended: 720p-1080p, < 2 hours
imagestringURL/S3 path to image file. Recommended: < 10MB
textstringPlain text for cross-modal embedding
gifstringURL/S3 path to GIF file
custom_thumbnailstringOptional custom thumbnail URL instead of auto-generated

Output Schema

Each video segment produces one document. Images and text produce one document each without segmentation.

Segment & Timing Fields

FieldTypeDescription
start_timenumberSegment start time in seconds
end_timenumberSegment end time in seconds
start_frameintegerStart frame number (start_time × fps)
end_frameintegerEnd frame number (end_time × fps)
fpsnumberFrame rate of the preprocessed video used for chunking
source_fpsnumberOriginal source video frame rate before any preprocessing (e.g. 29.97, 30, 23.976). Use this for precise frame-level calculations against the source video
durationnumberTotal duration of the entire source video in seconds (not the segment duration)

Content Fields

FieldTypeDescription
transcriptionstringTranscribed audio content (requires run_transcription)
descriptionstringAI-generated segment description (requires run_video_description)
ocr_textstringText extracted from video frames (requires run_ocr)

URL Fields

FieldTypeDescription
thumbnail_urlstringS3/CDN URL of the thumbnail image
source_video_urlstringURL of the original source video
video_segment_urlstringS3 URL of this specific segment file. Enables collection-to-collection decomposition

Embedding Fields

FieldTypeDescription
multimodal_extractor_v1_multimodal_embeddingfloat[1408]Vertex AI multimodal embedding
multimodal_extractor_v1_transcription_embeddingfloat[1024]E5-Large transcription embedding

Example Output

{
  "start_time": 10.0,
  "end_time": 20.0,
  "start_frame": 20,
  "end_frame": 40,
  "fps": 2.0,
  "source_fps": 29.97,
  "duration": 120.5,
  "transcription": "Welcome to today's lecture on machine learning fundamentals...",
  "description": "Instructor standing at whiteboard, introducing ML concepts",
  "ocr_text": "Machine Learning 101",
  "thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/thumb_1.jpg",
  "source_video_url": "s3://mixpeek-storage/ns_123/obj_456/original.mp4",
  "video_segment_url": "s3://mixpeek-storage/ns_123/obj_456/segments/segment_001.mp4",
  "multimodal_extractor_v1_multimodal_embedding": [0.023, -0.041, "...1408 floats"],
  "multimodal_extractor_v1_transcription_embedding": [0.018, -0.032, "...1024 floats"]
}
fps reflects the preprocessed video frame rate (e.g. 2.0 fps after downsampling). source_fps is the original video’s native frame rate (e.g. 29.97). Use source_fps when you need to map timestamps back to exact frame numbers in the original source file.

Parameters

Video Splitting

ParameterTypeDefaultDescription
split_methodstring"time"Primary video splitting strategy: time, scene, or silence
max_segment_durationfloat30.0Maximum seconds per segment. Scene/silence segments longer than this are subdivided. Set to null to disable

Split Methods

Fixed interval splitting - Splits video into segments of equal duration.
ParameterTypeDefaultDescription
time_split_intervalinteger10Interval in seconds for each segment
Characteristics:
  • Predictable segment count: video_duration / interval
  • Consistent chunk sizes for uniform processing
  • May cut mid-sentence or mid-scene
Best for: General purpose, consistent chunking, when you need predictable segment counts
{
  "split_method": "time",
  "time_split_interval": 10
}

Split Methods Comparison

MethodSegments/MinPredictabilityBest For
time60 / interval_secHighGeneral purpose, batch processing
sceneVariable (2-20)LowMovies, ads, dynamic visual content
silenceVariable (5-30)MediumLectures, podcasts, spoken content

Feature Extraction Parameters

ParameterTypeDefaultDescription
run_transcriptionbooleantrue (v1) / false (v2)Run Whisper transcription on audio
transcription_languagestring"en"Language for transcription
run_transcription_embeddingbooleantrue (v1) / false (v2)Generate E5 embeddings for transcriptions
run_multimodal_embeddingbooleantrueGenerate multimodal embeddings
run_video_descriptionbooleanfalseGenerate AI descriptions (adds 1-2s per segment)
run_ocrbooleanfalseExtract text from video frames

Thumbnail Parameters

ParameterTypeDefaultDescription
enable_thumbnailsbooleantrueGenerate thumbnail images
use_cdnbooleanfalseUse CloudFront CDN for thumbnails
CDN benefits: Faster global delivery, permanent URLs, reduced bandwidth costs.

v2-Only Parameters

These parameters are only available on multimodal_extractor v2:
ParameterTypeDefaultDescription
output_dimensionalityinteger3072Embedding dimensions. Gemini Embedding 2 supports Matryoshka reduction: 3072 (full), 1536, or 768
task_typestring"RETRIEVAL_DOCUMENT"Embedding task hint: RETRIEVAL_DOCUMENT, RETRIEVAL_QUERY, SEMANTIC_SIMILARITY, CLASSIFICATION
At query time, Mixpeek automatically uses RETRIEVAL_QUERY — you only need to set task_type at index time. The default RETRIEVAL_DOCUMENT is correct for most use cases.

Embedding Task

When run_transcription_embedding is enabled, the E5 model generates text embeddings from transcribed audio. By default, these use retrieval_document for asymmetric search.
Set embedding_task at the collection level, not on the extractor. See Collection Embedding Task for full details and examples.
This only affects the E5 transcription embeddings. Vertex AI multimodal embeddings (v1) and Gemini Embedding 2 (v2) are not instruction-aware and ignore this parameter.

Description Generation Parameters

ParameterTypeDefaultDescription
description_promptstring"Describe the video segment in detail."Prompt for Gemini
generation_config.temperaturefloat0.7Randomness (higher = more creative)
generation_config.max_output_tokensinteger1024Maximum description length
generation_config.top_pfloat0.8Nucleus sampling

LLM Structured Extraction

ParameterTypeDefaultDescription
response_shapestring | objectnullCustom structured output schema
Natural Language Mode:
{
  "response_shape": "Extract product names, colors, materials, and aesthetic style labels from this fashion segment"
}
JSON Schema Mode:
{
  "response_shape": {
    "type": "object",
    "properties": {
      "products": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "category": { "type": "string" },
            "visibility_percentage": { "type": "integer", "minimum": 0, "maximum": 100 }
          }
        }
      },
      "aesthetic": { "type": "string" }
    }
  }
}

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "multimodal_extractor",
    "version": "v1",
    "input_mappings": {
      "video": "payload.video_url"
    },
    "field_passthrough": [
      { "source_path": "metadata.video_id" }
    ],
    "parameters": {
      "split_method": "time",
      "time_split_interval": 10,
      "run_transcription": true,
      "run_multimodal_embedding": true,
      "enable_thumbnails": true
    }
  }
}

Performance & Costs

Processing Speed

Content TypeSpeed
Video0.5-2x realtime (depends on features enabled)
Image< 1 second
Text< 100ms
Example: 10-minute video → 5-20 minutes processing time
FeatureLatency per Segment
Transcription~200ms per second of audio
Visual embedding~50ms
OCR~300ms
Description~2s

Cost Estimates (per minute of video)

ConfigurationCost
Minimal (transcription + embeddings)$0.01
Standard (+ OCR)$0.05
Full (+ descriptions)$0.15
Images: 0.001perimageText:0.001 per image **Text**: 0.0001 per query

Vector Indexes

Multimodal Embedding

PropertyValue
Index namemultimodal_extractor_v1_multimodal_embedding
Dimensions1408
TypeDense
Distance metricCosine
Inference modelvertex_multimodal_embedding
Supported inputsvideo, text, image

Transcription Embedding

PropertyValue
Index namemultimodal_extractor_v1_transcription_embedding
Dimensions1024
TypeDense
Distance metricCosine
Inference modelmultilingual_e5_large_instruct_v1
Supported inputstext, string

Choosing v1 vs v2

Considerationv1v2
Embedding qualityGoodBetter (natively multimodal)
Dimensions1408 (fixed)3072, 1536, or 768 (configurable)
Storage per vector5.5 KB12 KB (3072D), 6 KB (1536D), 3 KB (768D)
Audio input supportVia transcription onlyNative audio embedding
Matryoshka supportNoYes — reduce dimensions without reindexing
StabilityProduction-provenNewer
Recommendation: Use v2 for new projects. Use v1 if you have existing collections and don’t need higher dimensions or native audio embedding.

Limitations

  • Video duration: Recommend < 2 hours for optimal processing
  • Resolution: 8K+ videos should be downsampled
  • Real-time: Not suitable for live streaming
  • Short videos: < 5 second videos have disproportionate overhead
  • Audio quality: Transcription accuracy depends on audio clarity
  • OCR/Description: Add significant processing time, enable only when needed

Collection-to-Collection Pipelines

The video_segment_url output enables decomposition chains:
  1. Initial collection: Time-based segments (5s intervals)
  2. Downstream collection: Scene detection within each segment
  3. Final collection: Enhanced processing with different models
{
  "input_mappings": {
    "video": "video_segment_url"
  }
}