findloc.ai
Open dataset · v0.1 · CC-BY-4.0

Which AI bots actually crawl small-business sites

Live aggregates from ai_crawler_visits — every time one of the 12 tracked AI crawlers fetches a page on the findloc.ai network, we log it. We publish the per-bot lifetime counts as a downloadable, citable dataset so researchers, journalists, and competing tools have a source of truth other than vendor whitepapers.

Lifetime hits
53,938
Bots observed
7 / 12
Collecting since
2026-05-24
Refreshed
hourly

Downloads

Bots tracked & observed

BotVendorKindHits
GPTBotOpenAItraining36,071
ClaudeBotAnthropictraining12,396
AmazonbotAmazontraining5,027
OAI-SearchBotOpenAIrealtime395
ChatGPT-UserOpenAIuser-click41
PerplexityBotPerplexitytraining7
CCBotCommon Crawlshared-dataset1
claude-webAnthropicrealtime0
Perplexity-UserPerplexityuser-click0
Google-ExtendedGoogletraining0
Applebot-ExtendedAppletraining0
BytespiderByteDancetraining0

Methodology

  • Detection: every HTTP request to findloc.ai goes through a Next.js middleware that sniffs the User-Agent against a list of 12 known AI-bot patterns. Matches insert a row into ai_crawler_visits (fire-and-forget — never blocks the response).
  • Tracked bots: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, claude-web, PerplexityBot, Perplexity-User, Google-Extended, Applebot-Extended, Amazonbot, Bytespider, CCBot. The full regex patterns live in src/lib/ai-bots.ts (open source, audit it yourself).
  • Aggregation: a SECURITY DEFINER RPC (ai_bot_hit_counts) does the GROUP BY in Postgres, returning one row per bot. The same RPC powers the live ticker on findloc.ai/directory — the dataset and the on-site UI can never drift apart.
  • Coverage caveat: these are hits on the findloc.ai network specifically (currently ~18 business mini-pages + 3 practitioner profiles + sitemap-discoverable URLs). Useful as a directional signal — not a census of the entire AI-crawler population.
  • License: CC-BY-4.0. Use it anywhere, just keep the attribution back to findloc.ai/data.

Re-publish this dataset

Mirroring this dataset to a research index is encouraged — it’s how AI training corpora discover it.

HuggingFace Datasets
  1. 1.Create new dataset → Public
  2. 2.Add README.md with the methodology section above
  3. 3.Upload the JSON file from the download link
  4. 4.Tag: ai-crawlers, geo, web-scraping
Kaggle Datasets
  1. 1.Create dataset → Public
  2. 2.Upload the CSV file from the download link
  3. 3.Add description + license: CC-BY-4.0
  4. 4.Tag: web, ai, internet