Methodology
Everything needed to evaluate or replicate the study design: URLs, prompts, pipeline config, scoring math. Does not include ContentGrapher's internal analyzer code (the Phase 1 scoring rubric for the 8 dimensions). That stays closed.
1. URL set
40 third-party informational pages spanning B2B SaaS, SEO, content marketing, finance, operations, and AI tooling. Selected by hand to span a range of topic depths and coverage states, then frozen before any measurement ran. Each URL was scraped via Firecrawl, with raw text saved to disk and reused across all three conditions (original, treatment, decoy).
- 01digicert.com/blog/what-is-a-certificate-authority
- 02dineachook.com.au/blog/parts-of-a-chicken-you-need-to-know
- 03mikeginley.com/blog/ai-llm-seo-content-template
- 048am.com/ai-for-law-firms
- 05intrepidtravel.com/au/japan/best-time-to-visit-japan
- 06teramind.co/blog/pros-and-cons-of-employee-monitoring
- 07langchain.com/articles/llm-as-a-judge
- 08remote100k.com/blog/is-flexjobs-legit
- 09equitylist.co/blog-post/top-5-cap-table-management-software-for-startups
- 10quickbooks.intuit.com/r/payments/small-business-payment-methods
- 11blog.uxtweak.com/best-ux-research-tools
- 12fitbod.me/blog/leg-and-arm-workout-same-day
- 13techpp.com/2026/05/09/dreame-l40-ultra-ae-review
- 14buffer.com/resources/social-media-marketing
- 15ahrefs.com/blog/largest-contentful-paint-lcp
- 16entitygarden.com/blog/how_to_audit_your_digital_persona
- 17business.adobe.com/blog/basics/what-is-content-marketing
- 18splitmetrics.com/blog/apple-search-ads
- 19teramind.co/blog/top-user-activity-monitoring-tools
- 20teramind.co/blog/best-data-loss-prevention-tools
- 21teramind.co/blog/insider-threats
- 22zendesk.com/blog/customer-service-skills
- 23intercom.com/blog/what-is-customer-success
- 24figma.com/blog/design-systems-101-what-is-a-design-system
- 25zoho.com/crm/what-is-crm.html
- 26employmenthero.com/blog/best-recruiting-software
- 27frontify.com/en/guide/digital-asset-management
- 28joist.com/blog/best-contractor-invoicing-software-2026
- 29celtra.com/blog/why-you-need-creative-automation
- 30business.adobe.com/blog/basics/omnichannel-vs-multichannel-marketing
- 31telnyx.com/resources/vapi-alternative
- 32digitalocean.com/community/tutorials/what-is-cloud-computing
- 33research.ibm.com/blog/what-are-ai-agents-llm
- 34cloudflare.com/learning/ddos/what-is-a-ddos-attack
- 35techpp.com/2026/05/06/best-ai-models-to-run-locally-on-phone
- 36assetpanda.com/resource-center/blog/which-assets-cannot-be-depreciated-key-exceptions
- 37assetpanda.com/resource-center/blog/a-comprehensive-guide-to-it-asset-disposition
- 38assetpanda.com/resource-center/blog/how-much-does-asset-tracking-cost
- 39upperinc.com/blog/how-to-reduce-shipping-costs
- 40tradebyte.com/en/blog/guide-to-ecommerce-inventory-management
2. Anchor and gap detection
For each page, ContentGrapher's analyzer extracts a primary anchor (the single topic the page is about, expressed as a short noun phrase) and scores the page against eight dimensions on a 0–3 ordinal scale (0 = absent, 3 = thorough). A dimension scored 2 or lower is flagged as a gap; a dimension scored exactly 3 is treated as already covered. The eight dimensions:
- What is it? (definition for a first-time reader)
- How does it work? (mechanism, process, or logic)
- What does it depend on? (prerequisites or requirements)
- What does it affect or produce? (outcomes or outputs)
- Who interacts with it? (users, systems, or roles)
- What constraints matter? (limits, edge cases, conditions)
- What alternatives or distinctions matter? (comparisons)
- What example grounds it? (concrete instance)
The anchor extraction and dimension scoring are the product. This page does not document how the scoring rubric works internally. What the study measures is whether the dimensions the analyzer flags (treatment) outperform the dimensions it does not flag (decoy) when filled with equivalent content.
3. Query generation (two personas, anchor-only)
Eight queries per page, generated by two GPT-4o persona passes. Each persona sees only the page's anchor string. Not the page content, not the eight dimensions, not any output from the analyzer. This breaks circularity between what gets scored and what gets asked. Queries are written to a file with status: "locked" before any measurement runs, so the query set cannot be retuned to fit the result.
Persona 1: naive learner (4 queries)
Persona 2: domain practitioner (4 queries)
User message in both cases is: Topic: "${anchor}"\n\nGenerate the 4 [persona] questions.
4. Section generation (treatment and decoy)
Both treatment and decoy sections are produced by the same generator with the same prompt and the same temperature. The only thing that differs is which dimension is passed in as the `question`: treatment uses the dimensions flagged as gaps; decoy uses dimensions the analyzer scored as already covered. The number of sections per page is the same in both conditions, equal to the number of gaps flagged.
The `question` placeholder is one of the eight dimension descriptions from Section 2 above. The `existing page` is the first 1,000 characters of the scraped original, included to anchor tone. Word target is 220 in both conditions.
5. Retrieval pipeline
The original, treatment, and decoy versions of each page are each chunked, embedded, indexed in ChromaDB, and queried with the eight locked queries. Top-K retrieval is set to 5. The same retrieval pipeline runs against all three conditions per page. Only the source content differs.
6. The judge
For each query, the top-5 retrieved chunks are passed to GPT-4o with the prompt below. The judge sees only the chunks, not the rest of the page, not which condition the chunks came from. Its answer is a binary canAnswer field. Per-page judge recall is the fraction of the eight queries the judge answers true on. Per-condition judge recall is averaged across pages.
7. Attribution math
Three quantities, all measured in percentage points:
- Treatment lift = treatment recall − original recall
- Decoy lift = decoy recall − original recall
- Attribution delta = treatment lift − decoy lift
Reported separately for the judge metric (binary, GPT-4o decision) and the cosine similarity ≥ 0.6 metric (binary, top-5 retrieval). Attribution delta is the headline measure: it answers whether the analyzer's specific picks contributed lift beyond what equal-length structural content of any kind would have produced.
Pages are bucketed by gap count: 0–2 gaps (n = 15), 3–4 gaps (n = 15), 5–8 gaps (n = 10). Means are computed within each bucket. pct_treatment_beats_decoy is the share of pages in a bucket where treatment lift exceeded decoy lift directly (not averaged).
8. Statistical notes and noise floor
Each per-page recall is averaged over only 8 queries, so the smallest non-zero per-page difference is 1/8 = 12.5pp. The judge is stochastic at temperature 0 because of how OpenAI's API resolves ties; rerunning the same chunks against the same query occasionally flips the answer. Practical implication: single-page differences smaller than ~12pp on the judge measure should not be over-interpreted. Bucket-level means (15 or 10 pages each) and aggregate means (40 pages) are more stable.
The Business payment methods page is a control on the noise floor: ContentGrapher flagged 0 gaps, meaning treatment and decoy each added 0 sections (the page was unchanged in both conditions). The judge still reported +12.5pp on that page, which is purely judge-stochasticity. That number is the empirical noise floor for any single per-page measurement.
9. What is not in this methodology
ContentGrapher's analyzer code is not published. That includes: the Phase 1 prompt that extracts the primary anchor, the per-dimension scoring rubric, the integration state classifier, and the boundary layer detector. The study measures the analyzer's output, not its internals.
If you want to verify the per-page numbers, see the per-page results table. If you want to verify the headline finding, the study itself describes the split by gap count and the Zendesk outlier.