Why AI ignores 96% of the web and what it actually takes to get cited

April 1, 2026

Most websites are invisible to AI. Not because they’re bad because they were built for a search engine that no longer runs the show. This blog is about what comes next.

AI citation is not a ranking problem. It’s a selection problem — and the selection criteria are fundamentally different from everything marketers learned in two decades of SEO. The most counter-intuitive finding across all available research: brand search volume now predicts AI citations 3x more strongly than backlinks, The Digital Bloom keyword stuffing actively hurts AI visibility, arXiv and 85% of pages that AI retrieves are never actually cited. ALM Corp Most GEO advice circulating today is recycled SEO dressed in new language. The real mechanisms are stranger, more technical, and more consequential than the industry realizes.

This report synthesizes findings from the Princeton/IIT Delhi GEO study (KDD 2024), Princeton University Stanford’s SourceCheckup analysis (Nature Communications 2025), Profound’s 680-million-citation dataset, Cloudflare’s crawler analysis, SE Ranking’s 129,000-domain study, and multiple other large-scale analyses. What emerges is a picture of AI citation as a multi-stage filtering pipeline where content must survive at least six distinct gates — and most content fails at the very first one.

AI citation works like a six-stage elimination funnel, not a ranking algorithm

Google ranks pages. AI eliminates them. The distinction matters. When a user asks ChatGPT, Perplexity, or Gemini a question, the answer draws from somewhere between 2 and 7 sources — compared to Google’s 10 blue links. Onely This compression means AI citation is structurally more competitive than organic search. Understanding the six-stage pipeline explains why.

Stage 1: The search API gate. ChatGPT searches Bing; Search Engine Land Gemini searches Google’s index; Perplexity uses a proprietary index of 200+ billion URLs. The Digital Bloomthedigitalbloom If your content isn’t in the right index, the AI literally cannot see you. Only 6.82% of ChatGPT’s results overlap with Google’s top 10 Getpassionfruit The Digital Bloom — because they’re searching different indexes entirely. A page perfectly optimized for Google may be invisible to ChatGPT simply because Bing doesn’t index it well.

Stage 2: Query fan-out. AI doesn’t search your query as-is. It decomposes one question into multiple sub-queries searched in parallel. iPullRank Google holds a patent on this system (US20240289407A1). Search Engine Journal A Surfer SEO analysis of 173,020 URLs found that pages ranking for fan-out sub-queries are 161% more likely to be cited. Surfer Critically, 95% of these generated sub-queries have zero traditional search volume — you cannot find or target them through conventional keyword research.

Stage 3: Retrieval and chunking. The top 20–30 search results are fetched, Search Engine Land broken into chunks, and embedded as vectors. Semantic similarity to the query determines which chunks survive. An AirOps study of 548,534 pages found that 85% of pages ChatGPT retrieves are never cited — retrieval is necessary but wildly insufficient.

Stage 4: Re-ranking. Machine learning models score retrieved chunks for relevance, authority, and freshness. Perplexity evaluates sources on four explicit dimensions: credibility, recency, relevance, and clarity. Business Library “Am I Cited” This is where domain authority signals still matter — but less directly than in traditional search.

Stage 5: Context window injection. Only 3–8 sources make it into the LLM’s context. Here, Stanford/UC Berkeley’s “Lost in the Middle” research becomes critical: LLMs exhibit a U-shaped attention curve, MIT Press with over 30% performance degradation for information placed in the middle of context. Morph MIT researchers traced this to causal masking in the transformer architecture itself — it’s a hardware-level bias, not a learned one. This means 44% of ChatGPT citations come from the first 30% of a page’s content. Victorino Group +2 Front-loading matters enormously.

Stage 6: Generation with RLHF bias. The model synthesizes its answer, choosing which sources to cite. RLHF training — where human raters consistently prefer responses that cite authoritative-seeming sources — creates a systematic bias toward recognized brands, well-known publications, and Wikipedia-style encyclopedic content. A NAACL 2025 study confirmed that LLMs exhibit citation bias toward already highly-cited works, creating a rich-get-richer dynamic.

The Princeton GEO study proved keyword stuffing hurts and statistics win

The only large-scale, peer-reviewed controlled experiment on generative engine optimization — Aggarwal et al., published at KDD 2024 Victorino Group Princeton University — tested nine optimization strategies across 10,000 queries. Frase +4 The results overturn core SEO assumptions.

Adding statistics improved visibility by up to 41%. arXiv Embedding specific, cited data points was the single most effective strategy across domains. arXiv This isn’t surprising once you understand the pipeline: LLMs are trained to treat quantitative claims as more citable than qualitative assertions. Content saying “the conversion rate averages 3.2% across B2B SaaS” will be cited over content saying “conversion rates are generally low.”

Adding quotations from credible sources improved visibility by 37%. arXiv Expert quotes with verifiable attribution give the LLM an extractable, authoritative passage it can reference directly. arXiv This maps to a RLHF preference — human raters judge quoted expert opinions as more trustworthy.

Citing sources within your content improved visibility by up to 115% for lower-ranked sites. Search Engine Land This is perhaps the most striking finding: including references to credible sources in your own content dramatically increases the chance that AI will cite your page. It signals to the model that your content is well-researched. Frase Position-5 sites saw a 115% visibility increase from this strategy alone — while position-1 sites actually saw decreased visibility from the same optimization, The Digital Bloom Search Engine Land likely because they were already being cited at their ceiling.

Keyword stuffing had a negative impact. Traditional SEO’s foundational tactic actively reduced AI visibility. arXiv LLMs process natural language and semantic meaning; keyword density reads as spam to a model trained on human-preference feedback. arXiv

The combination of fluency optimization plus statistics addition outperformed any single strategy by more than 5.5%, arXiv suggesting that AI rewards the intersection of readability and data density.

Brand mentions beat backlinks by 3-to-1 for AI citations

The single most disruptive finding for SEO professionals: brand search volume correlates at 0.334 with AI citations — the strongest predictor measured The Digital Bloom — while backlinks show only weak correlation at 0.10 in Seer Interactive’s 10,000-question study. Elementor Brand mentions correlate 3x more strongly (0.664 vs. 0.218) than backlinks with AI citation rates. Onely

This inversion makes mechanical sense. LLMs learn entity associations from training data. A brand mentioned frequently across Reddit threads, YouTube transcripts, news articles, and industry publications becomes a strong node in the model’s knowledge graph. Backlinks are invisible to LLMs — they’re structural HTML signals that models don’t process as authority markers during generation. Ekamoira

Brands appearing on 4+ platforms are 2.8x more likely to appear in ChatGPT responses. The Digital Bloom Ekamoira The top 25% of brands by web mentions earn 10x more AI Overview citations than the next quartile. Higoodie Getpassionfruit Reddit mentions specifically show an outsized effect: domains with 10M+ Reddit mentions averaged 7 citations versus 1.8 for those with minimal Reddit activity. Search Engine Journal

This doesn’t mean small brands are hopeless — it means the type of authority that matters has changed. Domain authority’s correlation with AI Overview citations dropped to r = 0.18 Ziptie (from 0.43 pre-2024). Wellows Meanwhile, topical authority — deep, recognized expertise in a specific domain — correlates at r = 0.4, the strongest AI citation predictor. Ziptie A niche CRM review site that’s frequently referenced in Reddit discussions and industry forums can outcompete a high-DA generalist publisher. Kevin Indig’s analysis of ~98,000 ChatGPT citation rows found that a single well-structured comparison page on learn.g2.com earned 65 unique prompts and 495 citations — outperforming entire domain portfolios of well-known brands in the CRM vertical. Growth-memo

Each platform cites a different web – and only 11% of domains overlap

One of the most actionable findings from Profound’s 680-million-citation dataset: only 11% of domains are cited by both ChatGPT and Perplexity. The AI citation landscape isn’t one game – it’s several parallel games with different rules.

ChatGPT leans encyclopedic. Wikipedia captures 7.8% of total citations Profound (47.9% of top-10 source share). Profound thedigitalbloom It searches Bing’s index, Euskal Conseil +2 shows the strongest recency bias (76.4% of most-cited pages updated within 30 days), Getpassionfruit and cites high-quality sources at a 96.2% rate — the highest among all providers. arXiv About 60% of queries are answered from parametric memory alone, without any web search. The Digital Bloomthedigitalbloom

Perplexity is community-driven. Reddit dominates at 6.6–46.5% of citations depending on measurement methodology. The Digital Bloom +2 It runs real-time search against a proprietary 200-billion-URL index, The Digital Bloomthedigitalbloom applies aggressive time decay (a ~30-day freshness sweet spot), AuthorityTech and always provides inline numbered citations — making it the most transparent platform for verification. Its citation overlap with Google’s top 10 is about 60%. AuthorityTech

Google AI Overviews are the most diversified but self-referential. YouTube (9.5%), Wikipedia (8.4%), and Reddit (7.4%) lead Ahrefs Surfer — with Google properties collectively commanding 43% of citations. Decoding The Digital Bloom The overlap between AI Overview citations and top-10 organic results dropped from 76% in mid-2024 to 38% by early 2026, Frase +3 a stunning decoupling.

Gemini uniquely favors brand-owned content, with 52.15% of citations coming from brand domains — compared to ChatGPT’s reliance on directories and third-party listing platforms (48.73%). AuthorityTech

The strategic implication: optimizing for one AI platform may have zero effect on another. Platform-specific content strategies are becoming necessary. The Digital Bloom Ekamoira

The seven things most GEO articles get wrong

Philipp Götza’s “ladder of misinference” framework, published in Search Engine Land in January 2026, provides the sharpest critique of current GEO advice: searchengineland most recommendations fail at the “evidence → proof” transition, confusing correlation with causation. Here are the most pervasive misconceptions.

“Schema markup significantly boosts AI citations.” This is the single most repeated GEO tip — and the evidence doesn’t support it. Götza documents that “there is no evidence that AI chatbots access schema markup” — tokenization during pretraining strips HTML elements. Search Engine Land SE Ranking’s data actually shows pages with FAQ schema averaged fewer citations (3.6 vs. 4.2 without). Search Engine Journal The correlation between schema-rich pages and AI citations likely reflects a confound: well-organized sites implement both schema and quality content. Schema helps crawlers discover and parse content, but it doesn’t directly influence the LLM’s citation decision.

“If you rank #1 on Google, you’re guaranteed AI citations.” Pages ranking #1 see citation rates of only 33% in AI Overviews. Getpassionfruit ChatGPT draws substantially from sources beyond Google — only 16.61% of its cited pages even appeared in Google results for the same query. ALM Corp And 28.3% of ChatGPT’s most-cited pages have zero organic visibility in Google. Getpassionfruit The Digital Bloom

“High domain authority guarantees AI visibility.” AirOps data shows ~74% of ChatGPT citations went to sites with domain authority under 80. Sites in the DA 80–100 tier had a citation rate of only 15% after retrieval — lower than every other authority tier. ALM Corp The prestige filter that dominates traditional SEO is weaker in AI citation.

“Optimize for exact AI prompts like keywords.” Average AI prompts are 23 words versus 3–4 word search queries. Andreessen Horowitz Xfunnel LLMs understand intent and context. Fan-out queries decompose prompts into sub-queries that have zero traditional search volume. Surfer “Exact-match prompts” don’t exist as a GEO concept.

“LLMs.txt files are essential.” SE Ranking’s 129,000-domain study found “negligible impact” — described as an “AI hack” that barely works.

“Content volume is a GEO strategy.” SE Ranking practitioners reported that publishing AI-generated articles daily initially lifted visibility, but “after 2–3 weeks, visibility began to drop sharply” Onely as AI systems detected repetitive structures.

“GEO will destroy your SEO.” The strategies that improve AI citation — adding statistics, citing credible sources, clear structure, removing promotional language — also improve traditional SEO. The Princeton study confirmed top GEO strategies are compatible with SEO best practices. arXiv +2 The real issue isn’t conflict — it’s that GEO requires additional work that SEO doesn’t cover.

Content structure that AI can extract beats content that humans enjoy reading

The optimal content architecture for AI citation looks markedly different from what performs well on social media or even in traditional search. The data converges on a specific structural pattern.

Front-load direct answers in 40–60 word blocks after every heading. This is the single highest-impact structural change. bradleebartlett AI Overviews pull answers in chunks of 130–160 words, but the extractable “answer capsule” that gets cited is typically 40–60 words — short enough for clean extraction, long enough to be substantive. Pages with a short, direct answer immediately after a question-based heading see 72.4% higher citation rates. Surfer

Listicles and comparison tables dramatically outperform narrative prose. Listicles account for 50% of top AI citations. Tables increase citation rates by 2.5x. Onely Comparison tables using proper HTML structure see 47% higher AI citation rates. The Digital Bloom The reason is mechanical: 78% of AI-generated answers include list formats — if your content doesn’t include extractable lists, the AI has to work harder to synthesize from your prose, and it will prefer a competitor’s content that’s already in list form. bradleebartlett

Long-form, comprehensive content gets cited 3x more than short posts, but length alone isn’t sufficient. Articles over 2,900 words average 5.1 citations versus 3.2 for those under 800 words. The Digital Bloom SeaRanks The real driver isn’t word count but semantic completeness — the degree to which content fully addresses all facets of a topic. This metric correlates at r = 0.87 with AI Overview citation, making it the strongest content-level predictor. Content scoring 8.5/10+ on semantic completeness is 4.2x more likely to be cited. Wellows

Multi-modal content — text plus images plus video plus structured data — sees 156% higher AI selection rates. Wellows YouTube has emerged as the most-cited domain in Google AI Overviews, Decoding +3 and YouTube mentions in titles, transcripts, and descriptions are the strongest correlating factor with AI Overview visibility across an Ahrefs study of 75,000 brands. ALM Corp

Non-promotional tone is essential. Semrush’s content optimization study found a 26% reduction in citation rates for promotional content. Semrush AI systems, shaped by RLHF on human preferences, learn that promotional language correlates with lower trustworthiness. Every paragraph should answer one question fully, lead with declarative sentences, and read like a reference document rather than a sales page.

The invisible web: 96.4% of content AI never sees

Perhaps the most sobering data for marketers comes from Cloudflare’s crawler analysis. GPTBot (OpenAI) reaches only 3.6% of web pages — three times less than Googlebot’s already-limited 11.6%. PerplexityBot crawls just 0.06% of pages, nearly 200 times less than Google. Search Engine Journal The vast majority of web content is structurally invisible to AI.

The causes are layered. Robots.txt blocking is the most visible barrier: 79% of top news sites block AI training bots, BuzzStream 54.2% of news publishers have opted out of at least one AI crawler, Arc XP and 88.9% of robots.txt files in DataDome’s analysis explicitly disallow GPTBot. DataDome But many publishers play a nuanced game — blocking training bots while allowing retrieval bots, hoping to appear in AI search results without surrendering content for model training. BuzzStream

The Cloudflare default problem is less visible but potentially more consequential. Cloudflare’s one-click AI bot blocking has been overwhelmingly adopted by customers, Cloudflare meaning millions of sites behind Cloudflare may be invisible to AI crawlers without their operators ever making a conscious choice. Common Crawl’s Stephen Burns documented a case where a premier children’s hospital was effectively invisible to AI because Cloudflare’s default settings blocked CCBot — the hospital’s team had never intentionally opted out.

Paywalled content creates a paradox. Despite access restrictions, 96% of NYT article citations and 99.13% of Washington Post citations in AI Overviews come from behind paywalls. SeaRanks AI systems reconstruct paywalled content through Tom’s Guide client-side overlay scraping, fragment reconstruction from social media and cached snippets, Tom’s Guide and AI browsers like OpenAI’s Atlas and Perplexity’s Comet that appear as normal Chrome sessions. Columbia Journalism Review Only true server-side paywalls (Wall Street Journal, Bloomberg) effectively block AI access. Columbia Journalism Review

The crawl-to-refer ratio reveals the extractive economics: Anthropic crawls 25,000–100,000 pages for every single referral it sends back. OpenAI’s ratio reaches 3,700:1. Ekamoira Compare this to Google Search at 3:1 to 30:1. Search Engine Journal AI platforms consume enormous amounts of content while returning virtually no traffic — a fundamental break from the traditional search compact that sustained publishers for two decades.

The real-world impact is already visible. Travel blog The Planet D lost half its traffic after AI Overviews launched, then lost another 90%. AdExchanger Charleston Crafted, a home improvement blog, saw a 70% traffic decline in three months. AdExchanger News publishers collectively lost 600+ million monthly visits between mid-2024 and May 2025. The Digital Bloom Zero-click searches rose from 56% in 2024 to 69% by May 2025. Trialguides +2 Being cited by AI is increasingly necessary for survival — but even citation often doesn’t translate to clicks. Pew Research found that less than 1% of users click on links within AI Overviews. Position Digital Devenup

Conclusion: the real game is entity building, not page optimization

The central insight across all this research is that AI citation operates on entity-level recognition, not page-level optimization. Google asks “is this page the best answer?” AI asks “is this entity trustworthy enough to cite?” Search Engine Journal The distinction reshapes strategy entirely.

Three genuinely novel takeaways emerge. First, the information-gain principle: AI preferentially cites content that adds information not already present in its training data. Google’s own patent (US20200349181A1) describes ranking based on “information gain score” — the delta between what a document offers and what the model already knows. Standard LLM output represents the average of existing knowledge, which by definition has near-zero information gain. Yotpo Original research, proprietary data, and contrarian perspectives backed by evidence signal citation-worthiness because they offer what the model cannot generate itself. Yotpo LLMGeoKit Brands publishing thought leadership with unique insights are cited 2.5x more often than those with generic content. Onely

Second, citation is probabilistic, not deterministic. Rand Fishkin’s research found that asking the same AI engine the same question 100 times yields less than a 1-in-100 chance of getting the same answer twice. Cassie Clark There is no “position #1” in AI search. There are only probabilities of citation, influenced by real-time retrieval, model temperature, and context window composition. This means the traditional SEO mental model of “ranking” is architecturally wrong for AI optimization.

Third, the 50–90% citation inaccuracy rate undermines the entire premise. Stanford’s SourceCheckup study, published in Nature Communications, found that 50–90% of LLM citations don’t fully support the claims they’re attached to. ResearchGate +2 The Columbia Journalism Review found AI produces incorrect citations more than 60% of the time. The system that marketers are racing to optimize for is itself unreliable — which means the long-term winners will be brands that build entity recognition so strong that AI cites them despite its own limitations, not brands that game the current citation pipeline before it inevitably changes.

Liked it? Share on social media

Why AI ignores 96% of the web and what it actually takes to get cited

How AI is Reshaping Optimization in 2025

How llms.txt Is Changing the Future of SEO

Why AI Visibility Matters for Modern Websites