Skip to main content
Ingest content from the web by providing specific URLs or crawling entire sites.

Single URL

Ingest a specific web page by providing its URL.
import { Agentset } from "agentset";

const agentset = new Agentset({
  apiKey: process.env.AGENTSET_API_KEY,
});

const ns = agentset.namespace("YOUR_NAMESPACE_ID");

const job = await ns.ingestion.create({
  payload: {
    type: "CRAWL",
    url: "https://agentset.ai/blog/intro-to-rag",
    maxDepth: 1,
  },
});

console.log(`Crawl started: ${job.id}`);

Crawling

Crawl a website to ingest multiple pages automatically. Agentset follows links from a starting URL and processes each page it discovers.

Basic crawl

Provide a starting URL to crawl a website.
import { Agentset } from "agentset";

const agentset = new Agentset({
  apiKey: process.env.AGENTSET_API_KEY,
});

const ns = agentset.namespace("YOUR_NAMESPACE_ID");

const job = await ns.ingestion.create({
  payload: {
    type: "CRAWL",
    url: "https://docs.agentset.ai",
  },
});

console.log(`Crawl started: ${job.id}`);

Crawl options

Control how the crawler navigates your site with the options parameter.
OptionTypeDefaultDescription
maxDepthnumber5How many links deep to follow from the starting URL. Depth 1 crawls only the initial page.
limitnumber50Maximum number of pages to crawl.
includePathsstring[]Only crawl URLs matching these path prefixes.
excludePathsstring[]Skip URLs matching these path prefixes.
headersobjectCustom HTTP headers to send with requests.

Limiting depth and pages

Set maxDepth and limit to control the scope of your crawl.
const job = await ns.ingestion.create({
  payload: {
    type: "CRAWL",
    url: "https://docs.agentset.ai",
    maxDepth: 3,
    limit: 100,
  },
});

Filtering paths

Use includePaths to crawl only specific sections, or excludePaths to skip certain areas.
const job = await ns.ingestion.create({
  payload: {
    type: "CRAWL",
    url: "https://docs.agentset.ai",
    includePaths: ["/guides", "/api-reference"],
    excludePaths: ["/blog", "/changelog"],
  },
});

Authenticated crawling

Pass custom headers to crawl pages that require authentication.
const job = await ns.ingestion.create({
  payload: {
    type: "CRAWL",
    url: "https://internal.agentset.ai",
    headers: {
      Authorization: "Bearer your-token",
    },
  },
});

With metadata

Attach metadata to ingested pages for filtering during search.
const job = await ns.ingestion.create({
  payload: {
    type: "CRAWL",
    url: "https://docs.agentset.ai",
  },
  config: {
    metadata: {
      source: "documentation",
      domain: "agentset.ai",
    },
  },
});
URL ingestion and crawls are processed asynchronously. Learn how to check upload status.

Next steps