The decoy studyMethodology

Methodology

Everything needed to evaluate or replicate the study design: URLs, prompts, pipeline config, scoring math. Does not include ContentGrapher's internal analyzer code (the Phase 1 scoring rubric for the 8 dimensions). That stays closed.

1. URL set

40 third-party informational pages spanning B2B SaaS, SEO, content marketing, finance, operations, and AI tooling. Selected by hand to span a range of topic depths and coverage states, then frozen before any measurement ran. Each URL was scraped via Firecrawl, with raw text saved to disk and reused across all three conditions (original, treatment, decoy).

2. Anchor and gap detection

For each page, ContentGrapher's analyzer extracts a primary anchor (the single topic the page is about, expressed as a short noun phrase) and scores the page against eight dimensions on a 0–3 ordinal scale (0 = absent, 3 = thorough). A dimension scored 2 or lower is flagged as a gap; a dimension scored exactly 3 is treated as already covered. The eight dimensions:

What is it? (definition for a first-time reader)
How does it work? (mechanism, process, or logic)
What does it depend on? (prerequisites or requirements)
What does it affect or produce? (outcomes or outputs)
Who interacts with it? (users, systems, or roles)
What constraints matter? (limits, edge cases, conditions)
What alternatives or distinctions matter? (comparisons)
What example grounds it? (concrete instance)

The anchor extraction and dimension scoring are the product. This page does not document how the scoring rubric works internally. What the study measures is whether the dimensions the analyzer flags (treatment) outperform the dimensions it does not flag (decoy) when filled with equivalent content.

3. Query generation (two personas, anchor-only)

Eight queries per page, generated by two GPT-4o persona passes. Each persona sees only the page's anchor string. Not the page content, not the eight dimensions, not any output from the analyzer. This breaks circularity between what gets scored and what gets asked. Queries are written to a file with status: "locked" before any measurement runs, so the query set cannot be retuned to fit the result.

Persona 1: naive learner (4 queries)

System prompt · gpt-4o · temperature 0.2 · JSON mode

You are a curious person new to a topic. You don't know technical jargon. You ask simple, direct questions to understand what something is and whether it matters to you. You're learning, not implementing.

Generate 4 questions a naive learner would Google about a topic. Each question must:
- Be a complete, well-formed question someone would actually type into a search engine
- Use plain language (no acronyms, no jargon)
- Cover different angles: what it is, why it matters, examples, common confusions
- Be specific enough that a single well-written page could answer it
- Make NO assumptions about a specific product, vendor, or implementation

Return JSON: { "queries": ["question 1", "question 2", "question 3", "question 4"] }

Persona 2: domain practitioner (4 queries)

System prompt · gpt-4o · temperature 0.2 · JSON mode

You are a working professional researching a topic for an active decision or project. You already understand the basics. You have time pressure. You ask questions that resolve open uncertainties standing between you and a decision.

Generate 4 questions a domain practitioner would Google about a topic. Each question must:
- Be a complete, well-formed question someone would actually type into a search engine
- Reflect a real practical concern: trade-offs, edge cases, costs, comparisons, integration, limits
- Be specific enough that a single well-written page could answer it
- Make NO assumptions about a specific product, vendor, or implementation
- Be different in shape from a naive learner's "what is it" questions

Return JSON: { "queries": ["question 1", "question 2", "question 3", "question 4"] }

User message in both cases is: Topic: "${anchor}"\n\nGenerate the 4 [persona] questions.

4. Section generation (treatment and decoy)

Both treatment and decoy sections are produced by the same generator with the same prompt and the same temperature. The only thing that differs is which dimension is passed in as the `question`: treatment uses the dimensions flagged as gaps; decoy uses dimensions the analyzer scored as already covered. The number of sections per page is the same in both conditions, equal to the number of gaps flagged.

System prompt · gpt-4o · temperature 0.3

You write concise, informative web content. Match the existing page tone. No heading needed.

User prompt template

Write a section for a webpage about "${anchor}" that addresses: ${question}

Match this page's writing tone:
${first 1000 characters of the existing page}

Write roughly 220 words of paragraph content.

The `question` placeholder is one of the eight dimension descriptions from Section 2 above. The `existing page` is the first 1,000 characters of the scraped original, included to anchor tone. Word target is 220 in both conditions.

5. Retrieval pipeline

The original, treatment, and decoy versions of each page are each chunked, embedded, indexed in ChromaDB, and queried with the eight locked queries. Top-K retrieval is set to 5. The same retrieval pipeline runs against all three conditions per page. Only the source content differs.

Embedding model	`text-embedding-3-large`3072-dim vectors, OpenAI
Vector store	`ChromaDB`
Chunk size	`512 tokens (~2048 chars)`
Chunk overlap	`50 tokens (~200 chars)`
Retrieval depth	`top-K = 5`
Similarity thresholds	`0.4, 0.5, 0.6 (cosine)`reported separately
Primary similarity threshold	`0.6`
Judge model	`gpt-4o (temperature 0)`
Section generator	`gpt-4o (temperature 0.3)`
Treatment word target	`220 words per section`
Decoy word target	`220 words per section`
Pipeline config hash	`8875e557`SHA-256 of locked config, first 8 chars
Run date	`2026-06-11`

Embedding model

text-embedding-3-large3072-dim vectors, OpenAI

Vector store

ChromaDB

Chunk size

512 tokens (~2048 chars)

Chunk overlap

50 tokens (~200 chars)

Retrieval depth

top-K = 5

Similarity thresholds

0.4, 0.5, 0.6 (cosine)reported separately

Primary similarity threshold

0.6

Judge model

gpt-4o (temperature 0)

Section generator

gpt-4o (temperature 0.3)

Treatment word target

220 words per section

Decoy word target

220 words per section

Pipeline config hash

8875e557SHA-256 of locked config, first 8 chars

Run date

2026-06-11

6. The judge

For each query, the top-5 retrieved chunks are passed to GPT-4o with the prompt below. The judge sees only the chunks, not the rest of the page, not which condition the chunks came from. Its answer is a binary canAnswer field. Per-page judge recall is the fraction of the eight queries the judge answers true on. Per-condition judge recall is averaged across pages.

System prompt · gpt-4o · temperature 0 · JSON mode

You judge whether retrieved content can answer a reader's question. Return JSON: { "canAnswer": true/false, "reason": "one short sentence" }.

User prompt template

Question: ${query}

Retrieved passages (top 5 from a RAG pipeline):

[1] ${chunk 1, first 800 chars}

---

[2] ${chunk 2, first 800 chars}

---

[3] ${chunk 3, first 800 chars}

---

[4] ${chunk 4, first 800 chars}

---

[5] ${chunk 5, first 800 chars}

Using only these passages, can a reader answer the question? Answer "true" only if the passages directly address the question, not just adjacent topics.

7. Attribution math

Three quantities, all measured in percentage points:

Treatment lift = treatment recall − original recall
Decoy lift = decoy recall − original recall
Attribution delta = treatment lift − decoy lift

Reported separately for the judge metric (binary, GPT-4o decision) and the cosine similarity ≥ 0.6 metric (binary, top-5 retrieval). Attribution delta is the headline measure: it answers whether the analyzer's specific picks contributed lift beyond what equal-length structural content of any kind would have produced.

Pages are bucketed by gap count: 0–2 gaps (n = 15), 3–4 gaps (n = 15), 5–8 gaps (n = 10). Means are computed within each bucket. pct_treatment_beats_decoy is the share of pages in a bucket where treatment lift exceeded decoy lift directly (not averaged).

8. Statistical notes and noise floor

Each per-page recall is averaged over only 8 queries, so the smallest non-zero per-page difference is 1/8 = 12.5pp. The judge is stochastic at temperature 0 because of how OpenAI's API resolves ties; rerunning the same chunks against the same query occasionally flips the answer. Practical implication: single-page differences smaller than ~12pp on the judge measure should not be over-interpreted. Bucket-level means (15 or 10 pages each) and aggregate means (40 pages) are more stable.

The Business payment methods page is a control on the noise floor: ContentGrapher flagged 0 gaps, meaning treatment and decoy each added 0 sections (the page was unchanged in both conditions). The judge still reported +12.5pp on that page, which is purely judge-stochasticity. That number is the empirical noise floor for any single per-page measurement.

9. What is not in this methodology

ContentGrapher's analyzer code is not published. That includes: the Phase 1 prompt that extracts the primary anchor, the per-dimension scoring rubric, the integration state classifier, and the boundary layer detector. The study measures the analyzer's output, not its internals.

If you want to verify the per-page numbers, see the per-page results table. If you want to verify the headline finding, the study itself describes the split by gap count and the Zendesk outlier.