WikiClaw Help

How to crawl a wiki.

Plain-language guide to every part of the app. Start at the top, or jump to whatever you need from the sidebar.

Your first crawl

WikiClaw needs nothing more than a wiki URL. The fastest path:

Open WikiClaw. The Crawl tab is selected by default.
Paste any MediaWiki URL into the field, or click one of the suggestions ("Try a wiki").
Wait ~1 second. WikiClaw checks the URL is actually a MediaWiki and shows the wiki's name and page count.
Press Start crawl.

That's everything. The default settings produce clean files ready for AI use. If you want a full archive instead, see For AI or for archive.

Picking a mode

Three modes, each for a different job. The mode is implicit — you discover it through Options, you don't pick it upfront.

Full: Crawl every page in the selected namespaces. The default. Pick this for a first archive of a wiki.
Filter: Crawl only titles that match a pattern, prefix list, or explicit allow-list. Open Options → toggle 'Only crawl some pages'. Use when you want a subset (e.g. all pages starting with "Battle") instead of everything.
Sync: Re-crawl only what changed since last time. Appears as a split-button next to Start when WikiClaw detects a prior crawl on disk for the same host. Uses the wiki's recent-changes feed and revision IDs, so it's fast even on huge wikis.

For AI, or for archive

Inside Options, "What's this for?" has two presets. They flip the right combination of toggles so you don't have to think about individual settings.

For AI / RAG (default): Markdown + heading-aware chunks ready to embed. Skips raw HTML. Doesn't download image binaries (the manifest is still produced). Light, fast, the most common use.
Full archive: Saves raw HTML for every page and downloads every image and file referenced. Bigger and slower, but produces a complete offline copy. Pick this for archival, migrations, or backups.

Try one page first

Long crawls are demoralising when the parser is wrong. The Preview sheet runs the whole pipeline against a single page, no writes, in a few seconds — so you can verify your filters and settings before you commit.

After verifying a wiki, click Preview a page firstnext to the verified-banner. The sheet opens with Main Page pre-filled. Click Fetch. You'll see what a real crawl would emit for that page:

Markdown — the LLM-ready content
HTML — what was scraped from #mw-content-text
Infobox — extracted label/value rows
Chunks — what would land in chunks.jsonl

If a filter is active, the preview sheet shows whether that title would be kept or skipped — useful for debugging Filter mode patterns.

Output files

Every crawl writes to ~/Library/Containers/WikiClaw/Data/Documents/<host>/. Use Settings → Storage → Reveal output folder to open it in Finder, or click Reveal in Finder on the done state.

File	What it contains
`pages.jsonl`	One record per page — HTML, Markdown, infobox, content hash, token estimate.
`chunks.jsonl`	RAG-ready chunks with section breadcrumbs and stable IDs.
`categories.json`	Full category tree (category → members map).
`assets_manifest.json`	Image and file index. With binaries on disk in Full archive mode.
`site_manifest.json`	Run metadata: source URL, accessed date, license notice, namespaces, tool version.
`errors.jsonl`	Per-page failure records. Empty file if no errors.
`html/`	Raw page HTML (Full archive mode only).
`assets/`	Image binaries (Full archive mode only).
`.checkpoint.json`	Resume state. Don't edit by hand — see Resume.

Every file is JSON Lines or plain JSON. Open in any editor. Stream into any embedding pipeline. Nothing proprietary.

Pages to include

MediaWiki splits content into namespaces. The default is 0 (Main / articles) plus 14 (Categories) — the right pick for most wikis.

Open Options → Pages to include → Choose. The picker pulls the live namespace list from the wiki's API, so you see exactly what's available. Quick presets:

Articles only: Just namespace 0. Cleanest corpus for AI use.
+ Categories: 0 + 14. The default. Useful when category pages carry their own content (game wikis often do).
Content only: Every even-numbered namespace. Excludes Talk pages and discussion.
All: Every namespace the wiki exposes. Includes Talk, User, Help, etc.

Resume

Crawls can take hours on big wikis. WikiClaw saves progress to .checkpoint.json as it goes, so if the network drops or you hit Stop, the next run can pick up from where it left off.

Cancel a crawl mid-flight (⌘ . or the Stop button).
Open Options → Advanced → toggle Skip pages already on disk.
Press Start. Pages already in pages.jsonl are skipped; only the rest are fetched.

Resume is intentionally not enabled by default — leaving it on would silently resume a crawl you cancelled yesterday.

Asset download

By default, WikiClaw indexes every image and file URL into assets_manifest.json but doesn't download the binaries — image-heavy wikis can be 10–100× the page-content bandwidth.

Enable Download assets in Options when you want the binaries on disk. They land in <output>/assets/, named after the URL filename with collision-safe hash suffixes. Already- downloaded files are skipped on re-run, so this is also resume-friendly.

Rate limiting

WikiClaw defaults to 1 request per second per host — polite for almost any wiki. Open Options → Advanced → Rate limit slider to change it.

Bigger isn't always better. Small community wikis can rate-limit aggressively (or block) clients that hit them too fast. Fandom and wiki.gg generally tolerate up to ~3 req/s. Wikipedia has its own guidelines — keep it slow there or use the off-peak time of day.

WikiClaw includes a configurable User-Agent string (Settings → Networking → User agent). Identify yourself honestly so wiki operators can reach you if your crawl causes issues.

Common errors

"This doesn't look like a MediaWiki site": WikiClaw probed for /w/api.php and /api.php and got nothing. Either the URL isn't a MediaWiki, or the wiki has the API disabled. Try the wiki's Special:Version page to confirm — if MediaWiki version info shows, the API exists; check the URL again.
"Pick at least one namespace to crawl": You opened Options → Pages to include and unchecked everything. Pick at least one (the Articles only preset is a good default).
Lots of failures in errors.jsonl: Most often a too-fast rate limit. Drop to 0.5 req/s, then resume from checkpoint to fill in the failed pages.
Crawl seems stuck on Discovery: Discovery walks Special:AllPages, which can be slow on huge wikis. Wikipedia at the default rate takes ~30 minutes just to enumerate. Be patient or use Filter mode to skip enumeration entirely.
The output folder is empty after Start: Discovery hasn't finished yet. WikiClaw writes pages.jsonl as pages are fetched, so the folder only fills out during Phase 2 ("Saving each page"). Watch the step list.

Privacy

WikiClaw runs entirely on your Mac. There is no account, no telemetry, no analytics, no cloud sync. The app does not collect, transmit, or share any personal data.

Network traffic is limited to:

Requests to the wiki you have explicitly chosen to crawl.
Requests to download asset binaries (images, files), only when you enable Download assets.

Crawled data lives in WikiClaw's sandboxed Documents folder and never leaves your machine. The User-Agent you configure in Settings is the only identifier sent on outbound requests.

Contact

Bug reports, feature requests, or questions: wikiclawsupport@proton.me.

When reporting a crawl problem, include your macOS version, the wiki URL, and (if relevant) attach errors.jsonl from the output folder.

← Back to home