Open dataset · v0.1 · CC-BY-4.0
Which AI bots actually crawl small-business sites
Live aggregates from ai_crawler_visits — every time one of the 12 tracked AI crawlers fetches a page on the findloc.ai network, we log it. We publish the per-bot lifetime counts as a downloadable, citable dataset so researchers, journalists, and competing tools have a source of truth other than vendor whitepapers.
Lifetime hits
53,938
Bots observed
7 / 12
Collecting since
2026-05-24
Refreshed
hourly
Downloads
Bots tracked & observed
| Bot | Vendor | Kind | Hits |
|---|---|---|---|
| GPTBot | OpenAI | training | 36,071 |
| ClaudeBot | Anthropic | training | 12,396 |
| Amazonbot | Amazon | training | 5,027 |
| OAI-SearchBot | OpenAI | realtime | 395 |
| ChatGPT-User | OpenAI | user-click | 41 |
| PerplexityBot | Perplexity | training | 7 |
| CCBot | Common Crawl | shared-dataset | 1 |
| claude-web | Anthropic | realtime | 0 |
| Perplexity-User | Perplexity | user-click | 0 |
| Google-Extended | training | 0 | |
| Applebot-Extended | Apple | training | 0 |
| Bytespider | ByteDance | training | 0 |
Methodology
- Detection: every HTTP request to findloc.ai goes through a Next.js middleware that sniffs the User-Agent against a list of 12 known AI-bot patterns. Matches insert a row into
ai_crawler_visits(fire-and-forget — never blocks the response). - Tracked bots: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, claude-web, PerplexityBot, Perplexity-User, Google-Extended, Applebot-Extended, Amazonbot, Bytespider, CCBot. The full regex patterns live in src/lib/ai-bots.ts (open source, audit it yourself).
- Aggregation: a SECURITY DEFINER RPC (
ai_bot_hit_counts) does the GROUP BY in Postgres, returning one row per bot. The same RPC powers the live ticker on findloc.ai/directory — the dataset and the on-site UI can never drift apart. - Coverage caveat: these are hits on the findloc.ai network specifically (currently ~18 business mini-pages + 3 practitioner profiles + sitemap-discoverable URLs). Useful as a directional signal — not a census of the entire AI-crawler population.
- License: CC-BY-4.0. Use it anywhere, just keep the attribution back to findloc.ai/data.
Re-publish this dataset
Mirroring this dataset to a research index is encouraged — it’s how AI training corpora discover it.
HuggingFace Datasets
- 1.Create new dataset → Public
- 2.Add README.md with the methodology section above
- 3.Upload the JSON file from the download link
- 4.Tag: ai-crawlers, geo, web-scraping
Kaggle Datasets
- 1.Create dataset → Public
- 2.Upload the CSV file from the download link
- 3.Add description + license: CC-BY-4.0
- 4.Tag: web, ai, internet