URL Discovery

Copy for LLMs

Last updated Jul 22, 2025 by

Harlan Wilton

in doc: clean up.

Unlighthouse comes with multiple methods for URL discovery in the form of crawling.

Add the specified site from --site or config
Manually providing URLs via the --urls flag or urls on the provider.
robotsTxt - Reading robots.txt, if it exists. Provides sitemap URLs and disallowed paths.
sitemap - Reading sitemap.xml, if it exists
crawler - Inspecting internal links
Using provided static route definitions

Robots.txt

When a robots.txt is found, it will attempt to read the sitemap and disallowed paths.

Disabling robots

You may not want to use the robots.txt in all occasions. For example if you want to scan URLs which are disallowed.

import { defineUnlighthouseConfig } from 'unlighthouse/config'

export default defineUnlighthouseConfig({
  scanner: {
    // disable robots.txt scanning
    robotsTxt: false,
  },
})

Sitemap.xml

By default, the sitemap config will be read from your /robots.txt. Otherwise, it will fall back to using /sitemap.xml.

Note: When a sitemap exists with over 50 paths, it will disable the crawler.

Manual sitemap paths

You may provide an array of sitemap paths to scan.

export default defineUnlighthouseConfig({
  scanner: {
    sitemap: [
      '/sitemap.xml',
      '/sitemap2.xml',
    ],
  },
})

Disabling scan

If you know your site doesn't have a sitemap, it may make sense to disable it.

export default defineUnlighthouseConfig({
  scanner: {
    // disable sitemap scanning
    sitemap: false,
  },
})

Crawler

When enabled, the crawler will inspect the HTML payload of a page and extract internal links. These internal links will be queued up and scanned if they haven't already been scanned.

Disable crawling

If you have many pages with many internal links, it may be a good idea to disable the crawling.

export default defineUnlighthouseConfig({
  scanner: {
    crawler: false,
  },
})

Manually Providing URLs

While not recommended for most use cases, you may provide relative URLs within your configuration file, or use the --urls flag.

This will disable the crawler and sitemap scanning.

Can be provided statically.

export default defineUnlighthouseConfig({
  urls: [
    '/about',
    '/other-page',
  ],
})

Or you can return a function or promise.

export default defineUnlighthouseConfig({
  urls: async () => await getUrls(),
})

Specify explicit relative URLs as a comma-separated list.

unlighthouse --site https://example.com --urls /about,/other-page

Edit this page

Markdown For LLMs

Did this page help you?

Route Definitions

Configure route discovery and custom sampling patterns for better page organization and intelligent scanning.

UI Customization

Modify Unlighthouse client interface columns and display to show custom metrics and data.

Discord Support

On this page

Robots.txt
Sitemap.xml
Crawler
Disable crawling
Manually Providing URLs