Suggested Experiments

Directional findings from the current analysis run that show a potential signal but cannot yet support a style recommendation. Each card explains what to test, what is currently missing, and what different results would mean for guidance.

Run: 2026-04 · 8 suggestions · Page auto-regenerated each analytics run ·

Bonferroni fail — p<0.05 raw but doesn’t survive α/k correction Underpowered — too few data points for reliable inference Directional — 0.05 ≤ p < 0.10 after BH-FDR correction Untested — data available, analysis not yet run

High Priority

SmartNews Bonferroni fail ↑ High

“Here’s / Here are” Lift — Needs Confirmation

Current signal

Articles using “Here’s / Here are” on SmartNews score a median percentile rank of 0.543 vs. 0.500 for the direct-declarative baseline (n=585, raw p=0.038). Direction is positive — opposite to question and “What to know” on the same platform.

What’s missing

Raw p=0.038 does not survive Bonferroni correction at k=5 formula families (threshold α/k = 0.010). All data is observational; no A/B comparison is available. Topic confounding (“Here’s” may correlate with better-performing story types) has not been controlled for.

Test question

If T1 editors A/B tested “Here’s / Here are” against a direct-declarative version of the same story on SmartNews, would the formula version consistently outperform? Does the effect hold across topics (sports, crime, weather) or appear only in a topic subset?

How to run it

For 30–50 stories over 4–6 weeks, write two headline versions per story: (A) “Here’s / Here are” format and (B) a direct declarative covering the same facts. If the CMS supports A/B headline testing, use it; otherwise alternate formula by publication day or assign by outlet. Record SmartNews views at 7 days post-publish. Test: Mann-Whitney U on SmartNews views (A vs. B), rank-biserial r as effect size. Minimum n = 30 matched story pairs for reliable inference. Stratify by topic (sports/crime/weather vs. other) to check whether any interaction hides or drives the aggregate result. Do not use the same story on Apple News A/B — keep platforms separate to avoid contamination.

What the result unlocks

Confirmed: adds a positive SmartNews formula signal — currently all confirmed SmartNews guidance is avoidance-only. Editors could be told “Here’s works on SmartNews, unlike Apple News where WTK dominates featuring.” Not confirmed: SmartNews playbook stays avoidance-only; the directional positive for “Here’s” is noise or topic-driven.

MSN Underpowered ↑ High

MSN Formula Groups — Insufficient Data for Confirmation

Current signal

3 formula group(s) show directional patterns on MSN but cannot be confirmed: Question, Number lead, Possessive named entity. 430 total T1 news brand articles after filtering. Groups with n < 30: Question (n=22) · Number lead (n=8) · Possessive named entity (n=9).

What’s missing

All flagged groups have n < 30, below the minimum for reliable inference (GOVERNOR.md Part 2). Only the quoted-lede group currently has enough data to test, and it is the one confirmed MSN finding. The MSN dataset grows approximately 100 rows/month; most formula groups should cross n=30 within 2–3 months of continued data collection.

Test question

As MSN data accumulates, which formula groups consistently underperform the direct-declarative baseline? Is the underperformance pattern broad (all structured formulas hurt on MSN) or specific to certain formats?

How to run it

Natural experiment — no new data collection needed. The MSN dataset grows approximately 100 rows/month after the T1 brand filter. Re-run the Mann-Whitney formula analysis each monthly ingest; generate_site.py already does this automatically and the build report surfaces newly-significant groups. Threshold: treat any formula group as testable once it crosses n = 30. Expected timeline: most groups should reach n=30 within 2–3 monthly ingest cycles. Analysis: Mann-Whitney U (each formula group vs. untagged baseline), BH-FDR corrected across all tested groups simultaneously, rank-biserial r as effect size. Language tier: significant only if p_adj < 0.05; directional if p_adj < 0.10. Baseline key must be ‘untagged’ (not ‘other’) — see GOVERNOR.md Rigor Failures Log.

What the result unlocks

Confirmed broad pattern: extends the MSN rule from ‘avoid quoted lede’ to ‘avoid all structured formulas.’ Gives editors the strongest possible two-headline guidance: Apple News → use formulas; MSN → drop them entirely. Confirmed specific subset: MSN avoidance list grows to the confirmed formula types while others remain neutral. Not confirmed: MSN formula penalty is limited to quoted lede only.

Apple News Directional ↑ High

Character Length × Formula Type Interaction

Current signal

Character length (90–120 chars) and formula type independently predict Apple News views. Analysis run: for each formula type with ≥30 organic articles, the character-length bin (fixed-width: <70, 70–89, 90–109, 110–129, 130+) with the highest median percentile rank was identified. Analysis now run: 4 formula types qualified (n ≥30 organic articles). Best-performing formula×length combination: Possessive named entity at 90–109 chars (median 71% percentile, n=18 in bin). Most formulas peak in the 90–109 char range. Results are directional (uncorrected for multiple comparisons across formula types).

What’s missing

Each formula×length bin is underpowered individually (n per bin is often 10–25). Mann-Whitney one-tailed tests are uncorrected across formula types. The ‘best bin’ may reflect noise rather than a stable interaction. Confirming requires a prospective A/B test or a larger dataset per formula.

Test question

Do specific formula types require specific length ranges to achieve their Apple News performance lift? E.g., does “Here’s / Here are” need 90+ chars to work, or does it lift at any length? Does possessive named entity perform best at shorter lengths where the name dominates the headline?

How to run it

The observational analysis is now live in the main site (Finding 5 · Trends Over Time detail panel). To confirm: for each formula type, write 20+ headlines in the identified best-bin range and 20+ in an adjacent bin. Track Apple News organic percentile rank at 7 days. Mann-Whitney U within each formula type, BH-FDR corrected across formula types. Minimum n = 30 per bin per formula for reliable inference. Formula types to prioritize: those where best bin deviates from the platform-wide 90–109 guideline — these represent the highest-value formula-specific exceptions.

What the result unlocks

Confirmed interaction: compound guidance (formula + length range) replaces two independent rules. Editors get per-formula character-count targets more precise than the platform-wide 90–120 guideline. More actionable than current guidance. No interaction: the two independent rules are stable and can be applied separately without worrying about their interaction.

Notifications Directional ↑ High

Notification CTR × Character Length

Current signal

Formula choice has a 2–5× effect on notification CTR (confirmed). Character length has now been tested for notification CTR. Notifications truncate at ~80 chars on most devices, making length likely to matter more here than in feed headlines. Analysis now run on 1923 notifications (1124 news brand). Best-performing length bin for news brands: 70–89 chars (median CTR 1.42%). Results are directional — Mann-Whitney comparisons are uncorrected and some bins have n < 30.

What’s missing

Some length bins have n < 30 news brand notifications, so the Mann-Whitney comparisons are underpowered. The analysis does not control for formula type within each bin — length and formula are correlated (longer notifications may use more formula structure), so the length signal may partly proxy formula type. A prospective controlled test is needed to isolate length as an independent variable.

Test question

Do shorter notifications (≤80 chars) outperform longer ones for CTR, controlling for formula type? Is there a character-count range where CTR peaks, or is the relationship monotonic (shorter = better)?

How to run it

The observational analysis is now live in the main site (Finding 2 · Push Notifications detail panel). To confirm: deliberately write notification pairs for the same story — one in the best-performing length bin and one in an adjacent bin — alternating by publication day or outlet. Track CTR at 24 hours. Minimum n = 30 pairs per bin comparison for reliable inference. Hold formula type constant within pairs (both attribution language, or both declarative) so length is the only varying factor. If CMS supports A/B notification testing, use it directly.

What the result unlocks

Confirmed: adds a second actionable lever for push copy editors beyond formula choice. Current guidance (“use attribution language”) would be extended with a specific character-count target. Not confirmed: formula dominates; length doesn’t independently move CTR and editors can focus solely on formula selection for notifications.

Medium Priority

Apple News Underpowered Medium

Number Lead Specificity — Round vs. Exact Figures

Current signal

Specific numbers (e.g. ‘$487M’, ‘13 deaths’) score a median rank of 0.412 vs. 0.277 for round numbers on Apple News (n = 170 specific, 21 round)

What’s missing

Groups too small for a reliable test: round n=21, specific n=170.

Test question

Do Apple News headlines with precise numeric values (e.g. ‘$487 million,’ ‘13 officers’) consistently outperform rounded equivalents (‘$500 million,’ ‘10+ officers’) for views, controlling for topic and story type?

How to run it

When writing number-lead headlines, deliberately tag each as ‘round’ (e.g. ‘$500M,’ ‘10 people’) or ‘specific’ (e.g. ‘$487M,’ ‘13 people’) in a shared tracking sheet. Collect at least 30 Apple News articles per type before running the test. No CMS change needed — this is a tagging discipline applied during headline writing. Analysis: Mann-Whitney U on Apple News percentile rank at 7 days (specific vs. round), rank-biserial r as effect size. Stratify by topic (financial stories likely have more specificity variance than crime or sports). Existing pipeline: classify_number_lead() already extracts the numeric value — add a roundness tag to the tracking sheet and re-run once 30+ per group are tagged.

What the result unlocks

Confirmed: adds precision-number guidance to the style guide (“use exact figures in number leads, not rounded approximations”). Editors who default to rounded figures for readability would be asked to reverse that practice. Not confirmed: round vs. specific distinction does not affect views; the number-lead signal is format-driven rather than specificity-driven.

Apple News Untested Medium

“What to Know” — Featured vs. Organic Stability by Year

Current signal

Apple editors select “What to Know” at 1.6× the baseline rate for Featured placement. Organic (non-Featured) articles using WTK show no significant view lift (p<sub>adj</sub>=0.16). This editorial/algorithmic split is a key project finding — but whether it is stable across 2025 and 2026 separately is unknown.

What’s missing

The current analysis pools 2025 and 2026 Apple News data. If Apple has updated curation signals, or if T1 editors have changed how they use WTK, the featuring lift or organic penalty could be shifting — which would change whether the two-headline strategy is durable guidance.

Test question

Is the WTK featuring lift consistent when 2025 and 2026 are analyzed separately? Has the gap between editorial selection rate and organic algorithmic performance been stable, or is it narrowing/widening over time?

How to run it

Run on existing Apple News data — no new collection needed. Split the Apple News dataset by year (2025 and 2026 separately). For each year, run: (1) Q2 chi-square or Fisher’s exact test for WTK featuring rate vs. all other formulas; (2) Q1 Mann-Whitney U for WTK organic view rank vs. untagged baseline (non-Featured articles only). Compare the featuring lift ratio (WTK featured rate / baseline featured rate) and the organic p-value across years. A narrowing featured rate ratio or a trending organic p-value signals platform behavior change. Implement as a year-stratified extension of the existing Q1 and Q2 analysis blocks in generate_site.py. Report: a 2×2 table of year×metric (featuring lift and organic p) alongside a directional trend flag. Note: Apple News 2026 covers Jan–Mar only; interpret with caution until Q3 2026 data is available.

What the result unlocks

Stable across years: confirms structural platform behavior. The “WTK for Featured campaigns, avoid for organic” rule is durable. Shifting: guidance needs to evolve. If organic performance is catching up to editorial selection, the two-headline distinction may already be outdated.

Apple News Untested Medium

Possessive Named Entity — Topic Concentration

Current signal

Possessive named entity headlines (n=98) show moderate overall performance on Apple News. The signal may be concentrated in sports and crime — topics where named individuals are central to the story — but aggregate analysis cannot confirm this.

What’s missing

Overall n=98 is small enough that splitting by topic reduces per-cell counts below the 30-article threshold for reliable inference. The aggregate analysis masks any strong within-topic signal.

Test question

Do possessive named entity headlines specifically outperform in sports and crime topics on Apple News, where named individuals are central? Is the aggregate signal driven by a strong within-topic effect, or is it distributed broadly across all topics?

How to run it

Run on existing Apple News 2025+2026 data. Filter to possessive named entity headlines. Split by topic: sports+crime (the ‘high named-entity’ group) vs. all other topics. Run Mann-Whitney U (sports+crime PNE articles vs. untagged baseline within the same topics), rank-biserial r as effect size. Repeat for the ‘other topics’ group. Compare lift magnitudes: if the sports+crime lift is substantially larger (r ≥0.1 higher) than the other-topics lift, the signal is topic-concentrated. If lifts are similar, the rule applies broadly. If either per-topic cell has n < 20, flag as preliminary and hold until data grows. Do not split into individual topics at this sample size — aggregate sports+crime as one group. Implement as a topic-stratified extension of the Q1 analysis in generate_site.py.

What the result unlocks

Confirmed topic-specific: changes guidance from a broad rule to a targeted one (“use possessive named entity for sports/crime stories specifically”). Editors know exactly when to apply it. Evenly distributed: the broad rule stands. Possessive named entity is generally useful, not vertically restricted.

SmartNews Untested Medium

Number Lead Type — Which Numbers Drive the SmartNews Signal?

Current signal

Number leads show a positive directional trend on SmartNews (median rank 0.534 vs. 0.497 baseline; direction: above_dir). They are the only SmartNews formula with a positive directional signal. Whether count/list (‘3 ways’), dollar amounts (‘$2 billion’), or percentages (‘47%’) drive this has not been tested.

What’s missing

Number-type classification (classify_number_lead()) is implemented in the pipeline but per-type SmartNews performance has not been computed. SmartNews 2026 data is domain-aggregated (not article-level), limiting this analysis to the 2025 dataset (n=38,251 articles).

Test question

Which number-lead subtype drives the SmartNews directional signal: count/list, dollar amounts, or percentages? Or is the effect evenly distributed across number types, suggesting format (any number in the lead) matters more than type?

How to run it

Run on SmartNews 2025 number-lead articles (n ≈342 total). classify_number_lead() already extracts ntype (‘count_list,’ ‘dollar_amount,’ ‘percentage,’ ‘other’). Group by ntype and run Mann-Whitney U for each group vs. the untagged baseline, BH-FDR corrected across the three tested types. Expected n per type: count_list is likely the largest (list articles are common); dollar_amount and percentage groups may be small. Flag any group with n < 20 as preliminary. If all subtypes have low n, aggregate and report directionally only: “counts/lists trend higher (n=X, p=Y)” without a significance claim. Implement by extending the existing number-lead deep-dive block (classify_number_lead() section) in generate_site.py to add a per-type Mann-Whitney analysis. Note: SmartNews 2026 is domain-aggregated and cannot contribute to this analysis — 2025 data only.

What the result unlocks

Confirmed specific type: editors can be told “count/list numbers work on SmartNews; dollar amounts and percentages are neutral.” More precise than the current directional number-lead guidance. Equally distributed: the rule is format-driven (“any number in the lead is better than none”). Simpler and more broadly applicable.

Completed A/B Reports

Before/after comparisons and formula tests run against the dataset. Add a spec to experiments/ and run python3 generate_experiment.py experiments/SLUG.md.