ContentGrapher
ContentGrapher
research/decoy-study/methodology
The decoy studyMethodology

Methodology

Everything needed to evaluate or replicate the study design: URLs, prompts, pipeline config, scoring math. Does not include ContentGrapher's internal analyzer code (the Phase 1 scoring rubric for the 8 dimensions). That stays closed.

1. URL set

40 third-party informational pages spanning B2B SaaS, SEO, content marketing, finance, operations, and AI tooling. Selected by hand to span a range of topic depths and coverage states, then frozen before any measurement ran. Each URL was scraped via Firecrawl, with raw text saved to disk and reused across all three conditions (original, treatment, decoy).

  1. 01digicert.com/blog/what-is-a-certificate-authority
  2. 02dineachook.com.au/blog/parts-of-a-chicken-you-need-to-know
  3. 03mikeginley.com/blog/ai-llm-seo-content-template
  4. 048am.com/ai-for-law-firms
  5. 05intrepidtravel.com/au/japan/best-time-to-visit-japan
  6. 06teramind.co/blog/pros-and-cons-of-employee-monitoring
  7. 07langchain.com/articles/llm-as-a-judge
  8. 08remote100k.com/blog/is-flexjobs-legit
  9. 09equitylist.co/blog-post/top-5-cap-table-management-software-for-startups
  10. 10quickbooks.intuit.com/r/payments/small-business-payment-methods
  11. 11blog.uxtweak.com/best-ux-research-tools
  12. 12fitbod.me/blog/leg-and-arm-workout-same-day
  13. 13techpp.com/2026/05/09/dreame-l40-ultra-ae-review
  14. 14buffer.com/resources/social-media-marketing
  15. 15ahrefs.com/blog/largest-contentful-paint-lcp
  16. 16entitygarden.com/blog/how_to_audit_your_digital_persona
  17. 17business.adobe.com/blog/basics/what-is-content-marketing
  18. 18splitmetrics.com/blog/apple-search-ads
  19. 19teramind.co/blog/top-user-activity-monitoring-tools
  20. 20teramind.co/blog/best-data-loss-prevention-tools
  21. 21teramind.co/blog/insider-threats
  22. 22zendesk.com/blog/customer-service-skills
  23. 23intercom.com/blog/what-is-customer-success
  24. 24figma.com/blog/design-systems-101-what-is-a-design-system
  25. 25zoho.com/crm/what-is-crm.html
  26. 26employmenthero.com/blog/best-recruiting-software
  27. 27frontify.com/en/guide/digital-asset-management
  28. 28joist.com/blog/best-contractor-invoicing-software-2026
  29. 29celtra.com/blog/why-you-need-creative-automation
  30. 30business.adobe.com/blog/basics/omnichannel-vs-multichannel-marketing
  31. 31telnyx.com/resources/vapi-alternative
  32. 32digitalocean.com/community/tutorials/what-is-cloud-computing
  33. 33research.ibm.com/blog/what-are-ai-agents-llm
  34. 34cloudflare.com/learning/ddos/what-is-a-ddos-attack
  35. 35techpp.com/2026/05/06/best-ai-models-to-run-locally-on-phone
  36. 36assetpanda.com/resource-center/blog/which-assets-cannot-be-depreciated-key-exceptions
  37. 37assetpanda.com/resource-center/blog/a-comprehensive-guide-to-it-asset-disposition
  38. 38assetpanda.com/resource-center/blog/how-much-does-asset-tracking-cost
  39. 39upperinc.com/blog/how-to-reduce-shipping-costs
  40. 40tradebyte.com/en/blog/guide-to-ecommerce-inventory-management

2. Anchor and gap detection

For each page, ContentGrapher's analyzer extracts a primary anchor (the single topic the page is about, expressed as a short noun phrase) and scores the page against eight dimensions on a 0–3 ordinal scale (0 = absent, 3 = thorough). A dimension scored 2 or lower is flagged as a gap; a dimension scored exactly 3 is treated as already covered. The eight dimensions:

  1. What is it? (definition for a first-time reader)
  2. How does it work? (mechanism, process, or logic)
  3. What does it depend on? (prerequisites or requirements)
  4. What does it affect or produce? (outcomes or outputs)
  5. Who interacts with it? (users, systems, or roles)
  6. What constraints matter? (limits, edge cases, conditions)
  7. What alternatives or distinctions matter? (comparisons)
  8. What example grounds it? (concrete instance)

The anchor extraction and dimension scoring are the product. This page does not document how the scoring rubric works internally. What the study measures is whether the dimensions the analyzer flags (treatment) outperform the dimensions it does not flag (decoy) when filled with equivalent content.

3. Query generation (two personas, anchor-only)

Eight queries per page, generated by two GPT-4o persona passes. Each persona sees only the page's anchor string. Not the page content, not the eight dimensions, not any output from the analyzer. This breaks circularity between what gets scored and what gets asked. Queries are written to a file with status: "locked" before any measurement runs, so the query set cannot be retuned to fit the result.

Persona 1: naive learner (4 queries)

System prompt · gpt-4o · temperature 0.2 · JSON mode
You are a curious person new to a topic. You don't know technical jargon. You ask simple, direct questions to understand what something is and whether it matters to you. You're learning, not implementing.

Generate 4 questions a naive learner would Google about a topic. Each question must:
- Be a complete, well-formed question someone would actually type into a search engine
- Use plain language (no acronyms, no jargon)
- Cover different angles: what it is, why it matters, examples, common confusions
- Be specific enough that a single well-written page could answer it
- Make NO assumptions about a specific product, vendor, or implementation

Return JSON: { "queries": ["question 1", "question 2", "question 3", "question 4"] }

Persona 2: domain practitioner (4 queries)

System prompt · gpt-4o · temperature 0.2 · JSON mode
You are a working professional researching a topic for an active decision or project. You already understand the basics. You have time pressure. You ask questions that resolve open uncertainties standing between you and a decision.

Generate 4 questions a domain practitioner would Google about a topic. Each question must:
- Be a complete, well-formed question someone would actually type into a search engine
- Reflect a real practical concern: trade-offs, edge cases, costs, comparisons, integration, limits
- Be specific enough that a single well-written page could answer it
- Make NO assumptions about a specific product, vendor, or implementation
- Be different in shape from a naive learner's "what is it" questions

Return JSON: { "queries": ["question 1", "question 2", "question 3", "question 4"] }

User message in both cases is: Topic: "${anchor}"\n\nGenerate the 4 [persona] questions.

4. Section generation (treatment and decoy)

Both treatment and decoy sections are produced by the same generator with the same prompt and the same temperature. The only thing that differs is which dimension is passed in as the `question`: treatment uses the dimensions flagged as gaps; decoy uses dimensions the analyzer scored as already covered. The number of sections per page is the same in both conditions, equal to the number of gaps flagged.

System prompt · gpt-4o · temperature 0.3
You write concise, informative web content. Match the existing page tone. No heading needed.
User prompt template
Write a section for a webpage about "${anchor}" that addresses: ${question}

Match this page's writing tone:
${first 1000 characters of the existing page}

Write roughly 220 words of paragraph content.

The `question` placeholder is one of the eight dimension descriptions from Section 2 above. The `existing page` is the first 1,000 characters of the scraped original, included to anchor tone. Word target is 220 in both conditions.

5. Retrieval pipeline

The original, treatment, and decoy versions of each page are each chunked, embedded, indexed in ChromaDB, and queried with the eight locked queries. Top-K retrieval is set to 5. The same retrieval pipeline runs against all three conditions per page. Only the source content differs.

Embedding modeltext-embedding-3-large3072-dim vectors, OpenAI
Vector storeChromaDB
Chunk size512 tokens (~2048 chars)
Chunk overlap50 tokens (~200 chars)
Retrieval depthtop-K = 5
Similarity thresholds0.4, 0.5, 0.6 (cosine)reported separately
Primary similarity threshold0.6
Judge modelgpt-4o (temperature 0)
Section generatorgpt-4o (temperature 0.3)
Treatment word target220 words per section
Decoy word target220 words per section
Pipeline config hash8875e557SHA-256 of locked config, first 8 chars
Run date2026-06-11
Embedding model
text-embedding-3-large3072-dim vectors, OpenAI
Vector store
ChromaDB
Chunk size
512 tokens (~2048 chars)
Chunk overlap
50 tokens (~200 chars)
Retrieval depth
top-K = 5
Similarity thresholds
0.4, 0.5, 0.6 (cosine)reported separately
Primary similarity threshold
0.6
Judge model
gpt-4o (temperature 0)
Section generator
gpt-4o (temperature 0.3)
Treatment word target
220 words per section
Decoy word target
220 words per section
Pipeline config hash
8875e557SHA-256 of locked config, first 8 chars
Run date
2026-06-11

6. The judge

For each query, the top-5 retrieved chunks are passed to GPT-4o with the prompt below. The judge sees only the chunks, not the rest of the page, not which condition the chunks came from. Its answer is a binary canAnswer field. Per-page judge recall is the fraction of the eight queries the judge answers true on. Per-condition judge recall is averaged across pages.

System prompt · gpt-4o · temperature 0 · JSON mode
You judge whether retrieved content can answer a reader's question. Return JSON: { "canAnswer": true/false, "reason": "one short sentence" }.
User prompt template
Question: ${query}

Retrieved passages (top 5 from a RAG pipeline):

[1] ${chunk 1, first 800 chars}

---

[2] ${chunk 2, first 800 chars}

---

[3] ${chunk 3, first 800 chars}

---

[4] ${chunk 4, first 800 chars}

---

[5] ${chunk 5, first 800 chars}

Using only these passages, can a reader answer the question? Answer "true" only if the passages directly address the question, not just adjacent topics.

7. Attribution math

Three quantities, all measured in percentage points:

  • Treatment lift = treatment recall − original recall
  • Decoy lift = decoy recall − original recall
  • Attribution delta = treatment lift − decoy lift

Reported separately for the judge metric (binary, GPT-4o decision) and the cosine similarity ≥ 0.6 metric (binary, top-5 retrieval). Attribution delta is the headline measure: it answers whether the analyzer's specific picks contributed lift beyond what equal-length structural content of any kind would have produced.

Pages are bucketed by gap count: 0–2 gaps (n = 15), 3–4 gaps (n = 15), 5–8 gaps (n = 10). Means are computed within each bucket. pct_treatment_beats_decoy is the share of pages in a bucket where treatment lift exceeded decoy lift directly (not averaged).

8. Statistical notes and noise floor

Each per-page recall is averaged over only 8 queries, so the smallest non-zero per-page difference is 1/8 = 12.5pp. The judge is stochastic at temperature 0 because of how OpenAI's API resolves ties; rerunning the same chunks against the same query occasionally flips the answer. Practical implication: single-page differences smaller than ~12pp on the judge measure should not be over-interpreted. Bucket-level means (15 or 10 pages each) and aggregate means (40 pages) are more stable.

The Business payment methods page is a control on the noise floor: ContentGrapher flagged 0 gaps, meaning treatment and decoy each added 0 sections (the page was unchanged in both conditions). The judge still reported +12.5pp on that page, which is purely judge-stochasticity. That number is the empirical noise floor for any single per-page measurement.

9. What is not in this methodology

ContentGrapher's analyzer code is not published. That includes: the Phase 1 prompt that extracts the primary anchor, the per-dimension scoring rubric, the integration state classifier, and the boundary layer detector. The study measures the analyzer's output, not its internals.

If you want to verify the per-page numbers, see the per-page results table. If you want to verify the headline finding, the study itself describes the split by gap count and the Zendesk outlier.