Skip to content

Building a Custom Crawler

Identity Atlas crawlers are self-contained folders. Drop one into tools/crawlers/<type>/ and the system picks it up automatically — no changes to the dispatcher, module loader, or CI pipelines needed.

For how the system works internally, see docs/architecture/crawler-architecture.md.


Folder Structure

tools/crawlers/my-source/
├── crawler.json               ← required manifest
└── Start-MySourceCrawler.ps1  ← entry point declared in manifest

Restart the worker container after adding the folder. The new crawler appears in the UI under Admin → Crawlers → Add Crawler immediately.


The Manifest (crawler.json)

{
  "type": "my-source",
  "displayName": "My Source System",
  "entryPoint": "Start-MySourceCrawler.ps1",
  "dependsOn": [],
  "postSyncHooks": ["buildContexts"],
  "configSchema": {
    "type": "object",
    "required": ["apiUrl", "apiKey"],
    "properties": {
      "apiUrl": { "type": "string", "minLength": 1, "description": "Base URL of the source API" },
      "apiKey": { "type": "string", "description": "API key for authentication" }
    }
  }
}
Field Required Description
type Unique key — becomes the jobType in CrawlerJobs. Use the folder name.
displayName Name shown in the UI.
entryPoint Entry point filename, relative to the crawler folder.
dependsOn Other crawler types whose library files are dot-sourced before this one runs.
configSchema JSON Schema object. The UI renders a form from it; the API validates configs against it before queueing.
postSyncHooks "buildContexts" derives org-unit contexts after a sync. Most user-syncing crawlers should include it. The historical "accountCorrelation" hook is now a no-op — account-to-identity matching moved to the scheduler-driven Account Linking engine — so new crawlers should omit it.

The Entry Point Interface

Every entry point must accept exactly these four parameters:

[CmdletBinding()]
Param(
    [Parameter(Mandatory)] [string]$ApiBaseUrl,   # Identity Atlas API root, e.g. http://web:3001/api
    [Parameter(Mandatory)] [string]$ApiKey,       # Built-in crawler API key (fgc_...)
    [Parameter(Mandatory)] [int]$JobId,           # CrawlerJobs.id for live progress reporting
    [Parameter(Mandatory)] [string]$ConfigPath    # Path to temp JSON file written by the dispatcher
)

The dispatcher writes the operator-supplied config to a temp JSON file and passes the path. Read it at the top:

$Cfg = Get-Content $ConfigPath -Raw | ConvertFrom-Json

The temp file is deleted after the entry point exits, whether it succeeds or fails.

Reserved config keys (injected by the dispatcher)

Key Values Description
_syncMode "full" | "delta" Sync mode selected by the operator. Honour it where practical.

Conventional config keys (set by operators)

Key Used by Purpose
selectedObjects entra-id, omada Map of phase → bool to toggle individual sync phases
contextObjectTypes omada OData entity sets to sync as Contexts
resourceCategoryMapping omada Maps source category labels to resourceType values

Minimal Entry Point

[CmdletBinding()]
Param(
    [Parameter(Mandatory)] [string]$ApiBaseUrl,
    [Parameter(Mandatory)] [string]$ApiKey,
    [Parameter(Mandatory)] [int]$JobId,
    [Parameter(Mandatory)] [string]$ConfigPath
)

$ErrorActionPreference = 'Stop'
$Cfg = Get-Content $ConfigPath -Raw | ConvertFrom-Json

$headers = @{ Authorization = "Bearer $ApiKey"; 'Content-Type' = 'application/json' }

function Write-CrawlerProgress ([string]$Step, [int]$Pct) {
    Invoke-RestMethod -Uri "$ApiBaseUrl/crawlers/job-progress" -Method Post -Headers $headers `
        -Body (@{ jobId = $JobId; step = $Step; pct = $Pct } | ConvertTo-Json -Compress)
}

Write-CrawlerProgress 'Fetching data' 10

# Fetch from source system
$items = Invoke-RestMethod -Uri "$($Cfg.apiUrl)/items" -Headers @{ 'X-Api-Key' = $Cfg.apiKey }

Write-CrawlerProgress 'Pushing to Identity Atlas' 50

# Push to Identity Atlas ingest API
$body = @{ records = $items; syncMode = $Cfg._syncMode; systemId = 1 } | ConvertTo-Json -Depth 10
Invoke-RestMethod -Uri "$ApiBaseUrl/ingest/principals" -Method Post -Headers $headers -Body $body

Write-CrawlerProgress 'Complete' 100

Building on the OData Base Layer

If your source exposes an OData 4.0 API, declare "dependsOn": ["odata"] in the manifest. The dispatcher will dot-source the OData library before your entry point runs, making Connect-ODataAPI, Invoke-ODataPagedRequest, Invoke-ODataGetRequest, and Get-ODataAuthRoot available without any imports.

crawler.json:

{
  "type": "my-odata-source",
  "displayName": "My OData Source",
  "entryPoint": "Start-MyODataCrawler.ps1",
  "dependsOn": ["odata"],
  "postSyncHooks": ["buildContexts"],
  "configSchema": {
    "type": "object",
    "required": ["baseUrl", "authMethod"],
    "properties": {
      "baseUrl":    { "type": "string" },
      "authMethod": { "enum": ["ApiToken", "BasicAuth", "OAuth2CC", "OAuth2ROPC", "CookieString", "FormCookie"] },
      "apiToken":   { "type": "string" }
    },
    "allOf": [
      { "if": { "properties": { "authMethod": { "const": "ApiToken" } } },
        "then": { "required": ["apiToken"] } }
    ]
  }
}

Start-MyODataCrawler.ps1:

[CmdletBinding()]
Param(
    [Parameter(Mandatory)] [string]$ApiBaseUrl,
    [Parameter(Mandatory)] [string]$ApiKey,
    [Parameter(Mandatory)] [int]$JobId,
    [Parameter(Mandatory)] [string]$ConfigPath
)

$ErrorActionPreference = 'Stop'
$Cfg = Get-Content $ConfigPath -Raw | ConvertFrom-Json

# Connect-ODataAPI is available because "dependsOn": ["odata"] caused the dispatcher
# to dot-source the odata library before this script ran.
Connect-ODataAPI -BaseUrl $Cfg.baseUrl -AuthMethod $Cfg.authMethod -ApiToken $Cfg.apiToken

$items = Invoke-ODataPagedRequest -Path '/Users' -QueryParams @{ '$filter' = 'Active eq true' }

# Push to Identity Atlas ingest API ...
$headers = @{ Authorization = "Bearer $ApiKey"; 'Content-Type' = 'application/json' }
$body = @{ records = $items; syncMode = $Cfg._syncMode; systemId = 1 } | ConvertTo-Json -Depth 10
Invoke-RestMethod -Uri "$ApiBaseUrl/ingest/principals" -Method Post -Headers $headers -Body $body

Note: Never run the odata type as a job — its entry point throws by design. Use it only as a dependsOn base.


Integration Testing

Every crawler should include a Test-<Type>Crawler.ps1 file alongside its crawler.json. The PR integration CI discovers and runs all such files automatically — no YAML changes needed.

See tools/crawlers/CLAUDE.md for the parameter contract, shared mock server usage, and examples (Test-ODataCrawler.ps1, Test-OmadaCrawler.ps1).


See Also