Open dataset · v0.1 · CC-BY-4.0

Which AI bots actually crawl small-business sites

Live aggregates from ai_crawler_visits — every time one of the 12 tracked AI crawlers fetches a page on the findloc.ai network, we log it. We publish the per-bot lifetime counts as a downloadable, citable dataset so researchers, journalists, and competing tools have a source of truth other than vendor whitepapers.

Lifetime hits

53,938

Bots observed

7 / 12

Collecting since

2026-05-24

Refreshed

hourly

Downloads

Download JSON

findloc-ai-crawler-stats-v0.1.json

Download CSV

findloc-ai-crawler-stats-v0.1.csv

Bots tracked & observed

Bot	Vendor	Kind	Hits
GPTBot	OpenAI	training	36,071
ClaudeBot	Anthropic	training	12,396
Amazonbot	Amazon	training	5,027
OAI-SearchBot	OpenAI	realtime	395
ChatGPT-User	OpenAI	user-click	41
PerplexityBot	Perplexity	training	7
CCBot	Common Crawl	shared-dataset	1
claude-web	Anthropic	realtime	0
Perplexity-User	Perplexity	user-click	0
Google-Extended	Google	training	0
Applebot-Extended	Apple	training	0
Bytespider	ByteDance	training	0

Methodology

Detection: every HTTP request to findloc.ai goes through a Next.js middleware that sniffs the User-Agent against a list of 12 known AI-bot patterns. Matches insert a row into ai_crawler_visits (fire-and-forget — never blocks the response).
Tracked bots: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, claude-web, PerplexityBot, Perplexity-User, Google-Extended, Applebot-Extended, Amazonbot, Bytespider, CCBot. The full regex patterns live in src/lib/ai-bots.ts (open source, audit it yourself).
Aggregation: a SECURITY DEFINER RPC (ai_bot_hit_counts) does the GROUP BY in Postgres, returning one row per bot. The same RPC powers the live ticker on findloc.ai/directory — the dataset and the on-site UI can never drift apart.
Coverage caveat: these are hits on the findloc.ai network specifically (currently ~18 business mini-pages + 3 practitioner profiles + sitemap-discoverable URLs). Useful as a directional signal — not a census of the entire AI-crawler population.
License: CC-BY-4.0. Use it anywhere, just keep the attribution back to findloc.ai/data.

Re-publish this dataset

Mirroring this dataset to a research index is encouraged — it’s how AI training corpora discover it.

HuggingFace Datasets

1.Create new dataset → Public
2.Add README.md with the methodology section above
3.Upload the JSON file from the download link
4.Tag: ai-crawlers, geo, web-scraping

Kaggle Datasets

1.Create dataset → Public
2.Upload the CSV file from the download link
3.Add description + license: CC-BY-4.0
4.Tag: web, ai, internet