Images
Agentset supports images in two ways:| Method | Description |
|---|---|
| Images in documents | When you upload PDFs, Word docs, or presentations containing images, Agentset automatically extracts and processes them. |
| Standalone images | Upload image files directly (.png, .jpg, .jpeg, .webp, .gif, .tiff) for processing. |
How image processing works
During generation, images are preserved and returned with their context, allowing your LLM to reference the original visuals when answering questions. For example, if your document contains this image:
Native image embedding
For use cases requiring direct visual similarity search, Agentset supports multimodal embedding models that encode images natively rather than converting them to text descriptions. This is useful for product catalogs, visual search, and design asset retrieval. Contact us for access to native image understanding.Audio and video
YouTube
Ingest YouTube videos, playlists, and channels by providing their URLs. Agentset extracts transcripts and metadata, making video content searchable.Multiple videos
Pass multiple URLs to ingest several videos, playlists, or channels in a single request.YouTube options
Configure transcript language and metadata extraction.| Option | Type | Default | Description |
|---|---|---|---|
transcriptLanguages | string[] | ["en"] | Preferred transcript languages. Agentset fetches the first available transcript matching these language codes. |
includeMetadata | boolean | false | Include video metadata (description, tags, category, duration) in the ingestion. |
YouTube ingestion is processed asynchronously. Learn how to check upload status.
Other video and audio formats
Contact us for early access to additional video and audio formats.Next steps
- API Reference — Multimodal ingestion parameters and options
- Document Metadata — Attach metadata for filtering
- Search — Query your multimodal content