URLs and Crawling

Ingest content from the web by providing specific URLs or crawling entire sites.

Single URL

Ingest a specific web page by providing its URL.

import { Agentset } from "agentset";

const agentset = new Agentset({
  apiKey: process.env.AGENTSET_API_KEY,
});

const ns = agentset.namespace("YOUR_NAMESPACE_ID");

const job = await ns.ingestion.create({
  payload: {
    type: "CRAWL",
    url: "https://agentset.ai/blog/intro-to-rag",
    maxDepth: 1,
  },
});

console.log(`Crawl started: ${job.id}`);

Crawling

Crawl a website to ingest multiple pages automatically. Agentset follows links from a starting URL and processes each page it discovers.

Basic crawl

Provide a starting URL to crawl a website.

import { Agentset } from "agentset";

const agentset = new Agentset({
  apiKey: process.env.AGENTSET_API_KEY,
});

const ns = agentset.namespace("YOUR_NAMESPACE_ID");

const job = await ns.ingestion.create({
  payload: {
    type: "CRAWL",
    url: "https://docs.agentset.ai",
  },
});

console.log(`Crawl started: ${job.id}`);

Crawl options

Control how the crawler navigates your site with the options parameter.

Option	Type	Default	Description
`maxDepth`	number	5	How many links deep to follow from the starting URL. Depth 1 crawls only the initial page.
`limit`	number	50	Maximum number of pages to crawl.
`includePaths`	string[]	—	Only crawl URLs matching these path prefixes.
`excludePaths`	string[]	—	Skip URLs matching these path prefixes.
`headers`	object	—	Custom HTTP headers to send with requests.

Limiting depth and pages

Set maxDepth and limit to control the scope of your crawl.

const job = await ns.ingestion.create({
  payload: {
    type: "CRAWL",
    url: "https://docs.agentset.ai",
    maxDepth: 3,
    limit: 100,
  },
});

Filtering paths

Use includePaths to crawl only specific sections, or excludePaths to skip certain areas.

const job = await ns.ingestion.create({
  payload: {
    type: "CRAWL",
    url: "https://docs.agentset.ai",
    includePaths: ["/guides", "/api-reference"],
    excludePaths: ["/blog", "/changelog"],
  },
});

Authenticated crawling

Pass custom headers to crawl pages that require authentication.

const job = await ns.ingestion.create({
  payload: {
    type: "CRAWL",
    url: "https://internal.agentset.ai",
    headers: {
      Authorization: "Bearer your-token",
    },
  },
});

With metadata

Attach metadata to ingested pages for filtering during search.

const job = await ns.ingestion.create({
  payload: {
    type: "CRAWL",
    url: "https://docs.agentset.ai",
  },
  config: {
    metadata: {
      source: "documentation",
      domain: "agentset.ai",
    },
  },
});

URL ingestion and crawls are processed asynchronously. Learn how to check upload status.

Next steps

API Reference — Crawl parameters and options
Document Metadata — Learn more about metadata filtering
Upload Status — Monitor crawl progress
Search — Query your crawled content

Get Started

Data Ingestion

Search and Retrieval

Production

URLs and Crawling

Single URL

Crawling

Basic crawl

Crawl options

Limiting depth and pages

Filtering paths

Authenticated crawling

With metadata

Next steps

Get Started

Data Ingestion

Search and Retrieval

Production

​Single URL

​Crawling

​Basic crawl

​Crawl options

​Limiting depth and pages

​Filtering paths

​Authenticated crawling

​With metadata

​Next steps

Single URL

Crawling

Basic crawl

Crawl options

Limiting depth and pages

Filtering paths

Authenticated crawling

With metadata

Next steps