·9 min read·GEO · AI crawlers · data · GPTBot · ClaudeBot

Which AI bots actually crawl small-business sites in 2026? 48,800+ hits analyzed

Short answer

Of 12 AI bots we track in robots.txt and log per request, only 7 have actually crawled findloc.ai. GPTBot alone is 73.8% of all traffic. ChatGPT-User — fired only when a human clicks a link inside a ChatGPT answer — fired 38 times, our cleanest evidence of real citation. Five bots (Google-Extended, Applebot-Extended, claude-web, Perplexity-User, Bytespider) have not visited at all.

What this post is

findloc.ai went live in mid-May 2026. From day one, every request to the site passed through a Next.js middleware that sniffed the User-Agent against 12 known AI-bot patterns and, on match, inserted a row into a Supabase table. Three weeks later that table has 48,823 rows. This post is what those rows tell us.

It's the raw data, no vendor whitepaper hand-waving. The same numbers run live at findloc.ai/directory — refreshed on every page render. If the chart below disagrees with the live panel at the time you read this, the live panel is the truth.

Snapshot: 2026-06-05. Tracked bots: 12. Distinct bots seen: 7. Total crawler hits: 48,823. Real human clicks from ChatGPT answers: 38.

The full ranking (all 12 bots, ordered by hits)

BotVendorKindLifetime hits% of total
GPTBotOpenAITraining36,01973.8%
ClaudeBotAnthropicTraining12,38725.4%
OAI-SearchBotOpenAIRealtime search3550.73%
ChatGPT-UserOpenAIUser-click380.08%
AmazonbotAmazonTraining280.06%
PerplexityBotPerplexityTraining70.01%
CCBotCommon CrawlShared dataset1<0.01%
Google-ExtendedGoogleTraining0
Applebot-ExtendedAppleTraining0
claude-webAnthropicRealtime0
Perplexity-UserPerplexityUser-click0
BytespiderByteDanceTraining0

Two orders of magnitude separate first place (GPTBot, 36k) from second (ClaudeBot, 12k). Another two orders of magnitude separate ClaudeBot from third (OAI-SearchBot, 355). The distribution is brutally power-law, even on a young site.

Surprise 1 — GPTBot's dominance is bigger than expected

Going in, our prior was that GPTBot would lead. We did not expect it to be 73.8% of all hits — three times the entire rest of the field combined. The implications are uncomfortable for anyone whose GEO strategy assumes engine-by-engine balance:

  • On findloc.ai, OpenAI sees 36× more of our content than Anthropic (and 1,400× more than Perplexity). Whatever ends up in the next ChatGPT training cut, our pages are heavily represented; whatever ends up in the next Perplexity index, we are a rounding error.
  • If you block GPTBot specifically (a default in some Cloudflare / WP-Rocket configurations), you have effectively cut yourself off from 70%+ of AI-engine attention. Worth double-checking your robots.txt today.
  • The 25.4% from ClaudeBot is healthy, but the absolute volume difference matters: Anthropic is doing one quarter of the indexing work even at full effort.

Surprise 2 — the 38 humans who actually clicked

ChatGPT-User is the most important row in our table and also the smallest. The User-Agent is fired only in one situation: a human asked ChatGPT a question, ChatGPT included a link to findloc.ai in its answer, and the human clicked the link. That sequence has happened 38 times.

This is the only metric we have that demonstrates end-to-end GEO success. Every other row proves indexing. Only this row proves attention.

A few observations on the 38:

  • They prove ChatGPT cited findloc.ai in answers — by name and with a clickable link.
  • They imply the cited content was useful enough that real users wanted more. Citations the user ignores don't generate this UA.
  • They are heavily ratio-skewed: 38 user clicks to 355 OAI-SearchBot fetches = ~10.7% of search-time fetches eventually led to a click. That ratio is the actual conversion rate of 'we got searched' to 'a human visited'.
  • The 38 is a floor, not a ceiling. ChatGPT-User only fires when the user actually clicks; many more users likely SAW the citation and got their answer without clicking.

Surprise 3 — the five bots that have not shown up

Five of the twelve bots we explicitly allow in robots.txt have crawled zero times in three weeks:

  • Google-Extended — Google's AI training crawler
  • Applebot-Extended — Apple Intelligence training
  • claude-web — Anthropic's realtime fetcher (replaces the old anthropic-ai UA)
  • Perplexity-User — Perplexity's user-click crawler
  • Bytespider — ByteDance / Doubao

Notice the pattern: four of the five are Big Tech (Google, Apple, ByteDance). The current AI-search frontier is dominated by the smaller, scrappier labs — OpenAI and Anthropic — at the crawler level. The big incumbents either have not fully turned on their AI crawl yet (Google) or are still in selective preview (Apple).

The implication for small-business owners is counterintuitive: if you want AI visibility today, you should optimise for OpenAI and Anthropic, not for Google. Google's AI Overviews are still mostly drawing from regular Google search index, which means classic SEO is the more direct lever there. The pure-AI-channel optimisation is happening at OpenAI and Anthropic.

What this means for your GEO strategy

  1. Open your robots.txt right now in incognito mode. If GPTBot, ClaudeBot, or OAI-SearchBot are in any Disallow rule, you are voluntarily out of the 99%+ of AI-engine traffic that matters today. Fix that first.
  2. Track AI bot hits yourself — even a grep of your access logs once a week reveals more than most analytics tools surface. Vercel Analytics will not split out AI UAs by default; you have to look at raw logs or use a custom logger.
  3. Stop worrying about being scraped. The asymmetry favours you: bots take bytes, AI engines give attribution. Block them and you keep your bytes but lose the citation channel.
  4. Treat ChatGPT-User hits as a leading conversion metric. Track the count over time; track which referring page (path) gets the clicks. Those pages are doing the GEO work — replicate their structure on weaker pages.
  5. Do not chase Google-Extended. Optimise for Google via the regular SEO playbook. The AI-channel is OpenAI + Anthropic for the next 6-12 months.

Methodology and the open data

We log AI bot hits via a Next.js middleware that runs on every request, matches the User-Agent header against 12 known patterns, and inserts a row into a Supabase Postgres table on match. The insert is fire-and-forget so it adds zero latency to the bot's response. The 12 patterns are the canonical UAs published by each AI vendor.

The aggregates in this post come from a single SQL function (ai_bot_hit_counts) that does a GROUP BY on the table. The same function powers the live AI hits panel on findloc.ai/directory — so the same numbers you see at the top of that page are the source of truth for everything here.

We re-run these queries roughly weekly and publish the updated table at findloc.ai/directory. If you reference this post in your own writing or analysis, the live numbers there are the canonical version.

Frequently asked

  • How does findloc.ai log AI crawler hits?

    Every request to the site goes through a Next.js middleware that sniffs the User-Agent against a list of 12 known AI-bot patterns (GPTBot, ClaudeBot, etc.). Matching requests insert a row into the Supabase ai_crawler_visits table — fire-and-forget so the bot's response isn't slowed. The numbers in this post are SQL aggregates over that table.

  • Why is GPTBot so much bigger than every other AI bot?

    GPTBot is OpenAI's training crawler — it re-fetches entire sites repeatedly to refresh the dataset used for the next ChatGPT model. Training crawlers are aggressive by design: depth-first, high-frequency, and they hit every URL in your sitemap (plus internal links). On findloc.ai, GPTBot has fetched 36,019 pages — 73.8% of all AI bot traffic.

  • What is ChatGPT-User and why does it matter more than the bigger numbers?

    ChatGPT-User is the User-Agent OpenAI fires when a human clicks a link inside a ChatGPT answer. Each row in our table with bot_name='ChatGPT-User' represents one event where (1) ChatGPT cited findloc.ai in an answer, (2) a real user read it, and (3) they clicked through. There have been 38 such events. This is the only metric we have that proves end-to-end GEO success — not just 'AI read us' but 'AI cited us and a human acted on it.'

  • Why has Google-Extended not crawled findloc.ai yet?

    Three plausible reasons. (1) Google's AI-training crawl is more conservative than OpenAI's — they tend to only re-use what regular Googlebot already indexed. (2) Google-Extended only crawls sites that explicitly allow it in robots.txt; many CDNs block by default. (3) Google's AI Overviews are still selectively rolling out by region. We allow Google-Extended explicitly in robots.txt; the silence is on Google's side.

  • Should I be worried about AI bots scraping my site?

    Worry less than you think. The bots are loud (hits aren't free for them either) but they respect robots.txt about 99% of the time. The real question isn't 'do they scrape' but 'do they cite back'. If your site is structured to be cite-able (schema.org, /llms.txt, FAQ sections), the scraping converts to citations. If it isn't, you're training their models for free without ever being mentioned.

  • How can I see AI crawler hits on my own site?

    Three options ordered by effort. (1) Free: grep your nginx/Apache/CloudFront access logs for the 12 known AI UAs (full list in our other post). (2) Free: claim a findloc.ai mini-page — your /my dashboard shows live per-bot crawler activity for that page. (3) Paid: server-side analytics like Plausible or Vercel Analytics surface bot UAs in some plans.

  • Is 48k AI hits in three weeks unusual for a small site?

    For a site with no AI bots specifically allowed, yes — most small sites see <100 AI hits/month. For one that does the GEO basics (allow all 12 bots, ship schema.org, publish /llms.txt and a real sitemap), no. The bots are extraordinarily active when given permission; the bottleneck is the permission, not their interest.

  • Does GPTBot crawling automatically mean ChatGPT cites me?

    No. GPTBot indexing makes you a candidate for citation in future ChatGPT answers, but the citation itself depends on (a) how well your content matches a user query at answer time, (b) how confidently the model can extract a factual claim from your page, and (c) whether competing sources have stronger schema/Q&A markup. Indexing is necessary but not sufficient.

  • Which AI bot should I prioritise allowing if I can only allow some?

    Allow all 12. The marginal cost of one extra Disallow exception is zero; the marginal upside of being indexed by one more AI engine is significant. If forced to rank: GPTBot > OAI-SearchBot > ChatGPT-User > ClaudeBot > Google-Extended > PerplexityBot > all-the-rest. But ranking is a false choice — allow everything except actually-malicious crawlers.

  • What does findloc.ai do?

    findloc.ai is a free tool that helps small businesses become citable by AI search engines (ChatGPT, Claude, Perplexity, Google AI Overviews). A free mini-page ships full schema.org markup, FAQ structure, /llms.txt inclusion, and an AI-crawler allowlist in 5 minutes. No credit card. Live AI-hit stats are pinned at the top of findloc.ai/directory.

Want to skip the manual work?

A free findloc.ai mini-page ships the full 8-point GEO stack automatically — schema.org markup, FAQPage, /llms.txt inclusion, sitemap, AI-bot allowlist. Five minutes, no credit card.

Run the free Visibility Checker →