The audience studyMethodology

Methodology

Everything you need to evaluate the study: the audience input under test, the four-condition design, the corpus and how it was repaired, the pre-classification step, the metric we used and the one we had to abandon, the counterbalanced judge panel, a deterministic confound, the statistics, and what this study does not test.

The four conditions

Phase 1 of the analysis, which reads the page and finds its concepts, was run once per page and held fixed. We then ran the second half four times, changing only the audience string handed to it. The two wrong conditions are the heart of the design: they change exactly one field each, so the level effect and the role effect can be read apart.

Condition	Audience sent to Phase 2	What it isolates
A — no audience	Empty string; the tool infers the reader from the page	The baseline output with no audience input
B — correct audience	The pre-assigned role and level, e.g. "Home cook, Intermediate level"	The full audience effect
C1 — wrong level	Correct role, level flipped to the far end	What the level field alone is doing
C2 — wrong role	Correct level, role swapped for an unrelated one	What the role field alone is doing

Each condition ran twice per page, giving 480 second-phase runs in total, plus a matching set of writing-guidance runs. Empty audience in condition A triggers the production fallback, the tool uses the reader it inferred during Phase 1.

The audience input

ContentGrapher's audience survey takes a reader role, a knowledge level, and an optional task. This study uses role and level only, constructed exactly as the dashboard does, "Role, Level level", and leaves the task field out so that role and level are the only variables. The level set is beginner, intermediate, and advanced. Because that string is the entire audience signal Phase 2 receives, the study measures what the audience input can do, not what a longer brief might.

The page set, and how it was repaired

The corpus is 60 real pages spanning ten everyday categories, recipes, Australian property, career and management, consumer and clinical health, SEO and marketing, personal finance, and travel, across three reader strata.

Stratum	Pages	Who reads it
General population	32	Searchers reading everyday explanatory content (recipes, personal finance, travel, consumer health)
Professional	24	Domain practitioners (SEO and marketing, career and management, small-business finance)
Expert / clinical	4	Clinical and specialist references; the smallest slice because authoritative content is the hardest to fetch

The repair is worth stating plainly. Many of the originally chosen URLs sat on aggressively bot-protected domains the scraper could not read, and a content-quality audit caught several more that returned a site shell or hub instead of the article body. We replaced every one of those with a verified, scrapeable page on the same topic and reader profile, leaning on government, reference, and long-form article sources, and re-audited all 60 so each carries the correct article. The trade was some geographic specificity: a few Australian finance and property pages became global equivalents where no scrapeable local version existed. The reader role and level, which are what the study varies, were preserved.

Pre-classification

The correct audience for each page was assigned before any condition ran, by a separate, cheap classifier (Claude Haiku) reading the page's topic and summary, not its URL. It picked one role from a fixed list of 22, one level, and one stratum. Doing this independently keeps the correct audience from being whatever Phase 1 happened to infer, and the assignments were reviewed before the conditions ran. The resulting mix leaned intermediate (41 pages), with 15 beginner and 4 advanced.

The metric we used, and the one we abandoned

The study was pre-registered around a consistency metric: the run-to-run overlap of the recommended concepts, on the expectation, from earlier work, of a noisy baseline around 0.68 that a good audience might tighten. That metric turned out to be untestable here. The second phase runs at a fixed temperature, so it is effectively deterministic, and the run-to-run overlap is about 0.996 for every condition. There is no noise for the audience to reduce, and the metric cannot tell the conditions apart. We report it for completeness and move the primary analysis to where the effect actually shows up.

That place is the priority tier. The audience does not change which concepts are recommended, the concept set overlaps 99% between conditions, but it does change how they are ranked, essential, important, or useful. For each page we measure the share of concepts shared between two conditions whose tier differs. The reference point is the tool's own run-to-run priority wobble, the same measure between two repeat runs of the no-audience condition, which is 5.9%. An audience effect has to clear that floor to count.

The judge panel, counterbalanced

To ask whether the audience makes the output better, not just different, three reviewer models (Claude Sonnet) saw the no-audience and correct-audience recommendations for each page, blinded, with no labels, and chose which served the reader better, or a tie. A first pass showed a strong, well-known failure mode: the judges preferred whichever set was shown second 69% of the time. A preference that size swamps the content. So every page was judged in both presentation orders and the verdict tallied across all six votes, which cancels the position bias by construction. We report the counterbalanced result; the raw, order-confounded number is on the data page as a cautionary figure.

A deterministic confound

One rule inside the analysis is not the model reasoning about the reader. When the audience names a beginner, a depth filter deterministically demotes a few concepts whose role reads as advanced from essential to useful. It fires only on beginner-level strings, so it inflates the priority change on exactly those pages, and it explains part of why the correct audience tends to net-demote rather than promote. The data page reports the priority effect split by whether the depth filter could fire, so the model's own contribution can be read separately from the rule's.

The statistics

Priority-change comparisons are paired by page and tested with the Wilcoxon signed-rank test, the non-parametric paired test, since the per-page rates are not normally distributed. The headline comparisons are the correct audience against the run-to-run floor (does the audience move anything real), the correct role against the wrong role (does the role's accuracy matter), and a level change against a role change from the same starting point (which field carries the signal). The judge result is a binomial test on the counterbalanced win rate, with inter-judge agreement reported as Fleiss kappa. All figures and p-values, with their counts, are on the data page.

What this study does not test

It does not test consistency, which was untestable on a deterministic pipeline; a more exploratory temperature is a separate experiment. It does not settle the expert case, the expert and advanced slices are four pages each. It does not test the optional task field, which was held out by design, so it says nothing about a fuller audience brief. It is one corpus and one pipeline; a different page set would move the exact figures, though we expect the shape, a stable concept set under a small, level-driven priority shuffle, to be more durable. And the judges are models, not editors: they remove single-maker bias, not the biases models share.

Artifacts

The corpus with its repair provenance, the per-page pre-classifications, all 480 second-phase outputs across the four conditions, the concept-label embeddings, the per-page priority analysis, the counterbalanced judge verdicts, and the aggregate statistics are persisted as JSON in the project repository.

Every figure on this page is reported on the data page with its counts.

← Back to the study The data →All research