Accountability scoring methodology
Methodology version: `accountability-2.0.0` (dataset envelope). Per-dimension stamps: `delivery-2.0.0`, others at accountability-2.0.0.
This document describes the accountability scoring dimensions produced by scripts/compute-scores.mjs and written to data/accountability/scores.json.
Every score carries its own methodologyVersion. The Delivery dimension is versioned independently of the dataset envelope: a bump to Delivery does not silently restamp scores from other dimensions, and historical scores keep the methodology stamp that produced them.
There are four dimensions: Delivery, Diligence, Mandate-fulfilment and Fiscal-stewardship. Each is computed and reported on its own. They are never blended into a single composite number.
No black box, fully decomposable
Every scorer is a deterministic rules function, not a learned model. The same inputs always produce the same outputs, and the output order is stable. Every AccountabilityScore carries a breakdown[] array — one row per underlying record — so any rate can be expanded back into the exact records, weights and values that produced it. No record-level data is hidden inside an aggregate.
Every score also carries a `coverage` figure — data completeness — and it is never omitted. Coverage is reported independently of the headline rate so a high rate built on thin data is always visible.
Delivery dimension
Weighted commitment delivery for an entity. Methodology version `delivery-2.0.0`. The pure delivery-1.0.0 rate is preserved alongside the new outcome-aware rate; the gap between them is the absorbed-reform signal.
Status → delivery value
Each commitment's CommitmentStatus maps to a delivery value in [0,1]:
| Status | Value |
|---|
delivered | 1.0 |
partially-delivered | 0.5 |
in-progress | 0.3 |
stalled | 0.1 |
broken | 0.0 |
abandoned | 0.0 |
promised | excluded — see Time-awareness |
Significance weighting
Each commitment is weighted by its significance field:
| Significance | Weight |
|---|
flagship | 3 |
standard | 1 |
minor | 0.5 |
If significance is absent, the commitment defaults to `standard` (weight 1).
deliveryRate is the weight-weighted mean of the delivery values of all counted (due) commitments:
deliveryRate = Σ(value · weight) / Σ(weight) over counted commitments
Time-awareness
A promised commitment that is not yet due is not a failure. It is excluded from deliveryRate (its ScoreContribution has counted: false) and counted instead toward onTrackRate:
onTrackRate = (promised commitments still on track) / (all promised commitments)
A promised commitment is "on track" unless its expectedDeliveryBy date has already passed relative to the score's asOf date. With no expectedDeliveryBy it is treated as on track.
Delivery coverage
coverage = (commitments backed by ≥1 high/medium-confidence evidence source)
/ (all of the entity's commitments)Evidence sources with confidence of low (or absent) do not count.
Derived fields: numeric target and deadline
hasNumericTarget and hasDeadline are used for downstream analysis. When a commitment file does not set them they are derived in-memory from the commitment text (the committed JSON is not modified):
- `hasNumericTarget` — true when the title/description contains a number
followed by a unit keyword (homes, units, beds, jobs, MW, staff, …), a percentage, a monetary figure (€/$/£), or a bare 3+ digit number.
- `hasDeadline` — true when
expectedDeliveryBy is set, or the text mentions
a target year/quarter (by 2030, by the end of 2027, Q3, mid-2026) or a relative window (within 5 years).
These are simple, transparent heuristics; they do not feed the delivery rate.
Delivery entities scored
One AccountabilityScore (dimension: "delivery") is emitted per:
- `government` — all Programme for Government commitments, collectively.
- `body` — grouped by each PfG commitment's
responsibleBodyId. - `officeholder` — grouped by each PfG commitment's
responsibleOfficeholderId. - `party` — grouped by each GE2024 manifesto commitment's
partyId.
Delivery 2.0.0 — outcome-aware adjustment
The Delivery dimension exposes two rates side by side:
deliveryRate — pure delivery-1.0.0 value (no outcome adjustment).
Preserved exactly so existing readers stay stable and historical comparison is apples-to-apples.
outcomeAdjustedRate — delivery-2.0.0 value with the rule below applied.
The gap between the two rates is the absorbed-reform signal. If a government delivers a lot of delivered commitments whose substantive outcomeStatus is outcome-unchanged, the two rates diverge: deliveryRate rewards the paperwork, outcomeAdjustedRate discounts it. This is the "delivered vs delivered-and-worked" distinction from [docs/power-and-blockers.md](../../docs/power-and-blockers.md) section "Absorbed reform — outcome status separate from delivery status".
When the adjustment fires
Only on rows whose CommitmentStatus is delivered or partially-delivered. Every other status is untouched (the adjustment is a no-op for promised, in-progress, stalled, broken, abandoned).
Status × outcomeStatus → value table
| Delivery status | outcomeStatus | Value applied to row | Note on row |
|---|
delivered | undefined / not-applicable | 1.0 (unchanged) | none |
delivered | outcome-improved | 1.0 (unchanged) | none |
delivered | outcome-unchanged | 0.5 (half value) | "outcome-unchanged: value halved per delivery-2.0.0 …" |
delivered | outcome-worsened | 0.0 (zero) | "outcome-worsened: value zeroed per delivery-2.0.0 …" |
delivered | contested | 0.5 (half value) | "contested: value halved per delivery-2.0.0 …" |
partially-delivered | undefined / not-applicable | 0.5 (unchanged) | none |
partially-delivered | outcome-improved | 0.5 (unchanged) | none |
partially-delivered | outcome-unchanged | 0.25 (half of 0.5) | "outcome-unchanged: value halved per delivery-2.0.0 …" |
partially-delivered | outcome-worsened | 0.0 | "outcome-worsened: value zeroed per delivery-2.0.0 …" |
partially-delivered | contested | 0.25 | "contested: value halved per delivery-2.0.0 …" |
| every other status | (any) | unchanged (delivery-1.0.0 mapping) | none |
Why undefined is treated as "no adjustment", not as outcome-unchanged
Absence of outcome data is not absence of outcome. We do not assume the world stood still simply because nobody has yet sourced the outcome metric. Penalising unfilled outcome fields would incentivise leaving them blank, and would conflate "we have not measured" with "we measured no change".
Instead, the absence is surfaced honestly through coverage (the share of commitments backed by ≥1 high/medium-confidence source). The Delivery dimension keeps the same coverage definition; a separate outcome-coverage signal can be layered in additively later without changing the rates above.
not-applicable is treated the same as undefined for the same reason: an administrative commitment with no measurable outcome should not be punished for the lack of one.
Relationship between deliveryRate and outcomeAdjustedRate
deliveryRate = Σ(pureValue · weight) / Σ(weight) over counted commitments
outcomeAdjustedRate = Σ(adjustedVal · weight) / Σ(weight) over counted commitments
Both rates use the same counted set (time-aware promised exclusion is identical) and the same significance weights. The only difference is the per-row value: outcomeAdjustedRate uses the table above, deliveryRate uses the pure delivery-1.0.0 mapping.
When no commitment in scope carries an actionable outcomeStatus (everything is undefined or not-applicable), the two rates are identical by construction. The new field activates as commitments are tagged in future PRs.
Decomposability
Every breakdown row reflects the outcome-adjusted (delivery-2.0.0) value, so outcomeAdjustedRate decomposes back to the exact records that produced it. When the adjustment fires, the row records:
outcomeStatusApplied — the outcomeStatus value the scorer read.note — a short human-readable explanation of what changed and why
(e.g. "outcome-unchanged: value halved per delivery-2.0.0 (absorbed-reform haircut)").
If a row has no outcomeStatusApplied, no adjustment was considered (the commitment had no outcomeStatus). If a row has outcomeStatusApplied but no adjustment note, the outcome status existed but did not change the value (e.g. outcome-improved or not-applicable).
The pure deliveryRate field is computed in a parallel pass over the same source data and is not decomposed in breakdown[]; the breakdown is the audit trail for the new headline (outcomeAdjustedRate). To reproduce deliveryRate from the breakdown, replace each row's adjusted value with the unadjusted mapping for its status (e.g. delivered → 1.0 regardless of outcomeStatusApplied) and re-weight.
Diligence dimension
Per officeholder parliamentary participation. Measures how reliably an officeholder turns up to recorded votes (divisions).
Inputs
divisions.json (every recorded division) joined to member-votes.json (one row per officeholder per division they were present for). Absence is represented by the lack of a member-vote row, not by an explicit record.
Term scoping
A division counts toward an officeholder only if its date falls within one of the officeholder's terms (from/to window, an open to meaning still in office). An officeholder is never penalised for divisions held before they took office or after they left.
Participation value
For each in-term division, the officeholder's vote is one of ta, nil, staon or absent. A vote of ta/nil/staon is participation (value 1); absent (no member-vote row) is value 0.
participationRate = (divisions voted in: ta/nil/staon)
/ (divisions held during the officeholder's term)Every in-term division is counted; the score decomposes by division in breakdown[] (a DivisionContribution per division).
Diligence coverage
coverage = (term divisions with a member-vote row of any kind)
/ (all term divisions)With the current data every member-vote row is a present vote, so coverage equals participationRate; the field is kept distinct so that if upstream ever records explicit absent rows, coverage and participation diverge correctly.
Diligence entities scored
One score (dimension: "diligence") per officeholder who had at least one in-term division. Officeholders whose terms cover no division are not scored.
Mandate-fulfilment dimension
Per body: how well the commitments linked to that body's statutory mandates are being delivered.
Inputs and join
commitments.json → mandateId → mandates.json → bodyId. Only PfG commitments that carry a non-null mandateId resolving to a known mandate are included. The body of the score is the mandate's bodyId, which may differ from the commitment's responsibleBodyId.
Fulfilment value
fulfilmentRate reuses the Delivery status→value mapping and significance weights exactly (see Delivery above), applied only to the body's mandate-linked commitments:
fulfilmentRate = Σ(value · weight) / Σ(weight) over counted mandate-linked commitments
promised commitments are excluded from the rate (counted: false), consistent with the Delivery dimension.
Mandate-fulfilment coverage
coverage = (mandate-linked commitments backed by ≥1 high/medium-confidence source)
/ (all of the body's mandate-linked commitments)mandateCount reports how many distinct mandates of the body have at least one linked commitment behind the score.
Mandate-fulfilment entities scored
One score (dimension: "mandate-fulfilment") per body that has at least one mandate-linked commitment.
Fiscal-stewardship dimension
Per body: how closely actual spend tracks the budget that was allocated.
Inputs
spending.json budget votes ({ votes, programmes, subheads }). Each BudgetVote carries a grossAllocation (always present) and, once a fiscal year closes, an outturn (actual spend — usually absent). Scores decompose by budget vote (BudgetContribution per vote).
Variance → stewardship value
For a vote with a published outturn:
variance = |grossAllocation − outturn| / grossAllocation
value = max(0, 1 − variance / 0.20)
A vote spent exactly to allocation scores 1.0; a vote 20% or more off allocation (over or under) scores 0.0; in between the value falls linearly. The 20% tolerance band is the single tunable parameter of this dimension.
stewardshipRate is the allocation-weighted mean of the values of votes that have an outturn, so larger votes dominate the body's score:
stewardshipRate = Σ(value · allocation) / Σ(allocation) over votes with an outturn
Honest handling of missing outturn
Most votes have no `outturn` yet. Outturn is never fabricated. A vote with no published outturn is counted: false, contributes value: 0, carries variance: null / outturn: null, and is excluded from stewardshipRate. Instead it lowers coverage, so a body whose budget is mostly unverifiable shows a low coverage rather than a misleadingly confident rate.
Fiscal-stewardship coverage
Coverage is the allocation share of the body's budget that is backed by a published outturn:
coverage = Σ(allocation of votes with an outturn) / Σ(allocation of all votes)
outturnVoteCount reports how many of the body's votes have an outturn. With the current data no outturn is published, so every fiscal-stewardship score has coverage: 0, stewardshipRate: 0 and outturnVoteCount: 0 — an honest "not yet verifiable" signal, not a zero-performance verdict.
Fiscal-stewardship entities scored
One score (dimension: "fiscal-stewardship") per body that has at least one budget vote in spending.json.
Score shape
AccountabilityScore is a discriminated union on dimension. Every member shares a common envelope (entityType, entityId, coverage, methodologyVersion, asOf); each dimension then adds its own metric fields and its own typed breakdown[]:
delivery — deliveryRate, onTrackRate, commitmentCount, ScoreContribution[]diligence — participationRate, divisionCount, votedCount, DivisionContribution[]mandate-fulfilment — fulfilmentRate, mandateCount, commitmentCount, ScoreContribution[]fiscal-stewardship — stewardshipRate, voteCount, outturnVoteCount, BudgetContribution[]
The delivery member is byte-compatible with methodology version delivery-1.0.0, so existing Delivery scores and their consumers are unaffected.
Versioning
methodologyVersion (accountability-2.0.0) is stamped on the dataset and on every score. Bump it whenever any dimension's status/variance/participation mapping, weights, scoping rule or coverage definition changes, so historical scores remain interpretable against the rules that produced them.
Summaries layer
Methodology version: summaries-1.0.0 (independent of the scoring methodology above). Lives in data/accountability/summaries.json, regenerated by pnpm data:summaries. Read at build time and embedded into every prerendered entity page.
What a summary is
A small set of plain-language bullet points (3 to 7) that describe one entity plus three structured impact analyses:
- Direct impact: one-hop relationships derived from the entity's own
foreign keys (commitment.responsibleBodyId, officeholder terms[].bodyId, edges on the typed-edge layer, etc.).
- Indirect impact: two-hop traversal from each direct target, deduped and
ranked, capped at ten entries to keep the panel readable.
- Leverage points: cross-references into the systems layer, surfacing the
Meadows leverage level for every system step the entity participates in.
How bullets are produced
Two backends, depending on entity kind:
- Officeholders (1202 records): template-based. Bullets are assembled
deterministically from structured fields (current role, party, level, constituency, civil-service grade, term count, Diligence score, owned commitments). No model is involved.
- Bills, parties, bodies, mandates, commitments, divisions: generated from
the source records by a language model run locally during the build (a free-tier OpenRouter model, default google/gemma-4-31b-it:free). This pass runs once on a contributor's machine; the resulting JSON is committed and Vercel deployments read the static file. There is no per-visit model call.
Sourcing rule (anti-invention)
Every bullet must cite at least one Source whose URL appears verbatim on the entity's underlying records. The generator validates this on the way out and the data validator (pnpm data:validate) rechecks it on the way in. A bullet that cites a URL outside the entity's record set fails validation and the dataset is rejected. There is no path by which a fabricated citation can land in production.
Caching
The generator hashes each entity's record plus its direct-impact set. On a re-run, an entity whose hash is unchanged and whose previous summary is at the current methodology version is reused verbatim from the existing summaries.json. Changes to the records trigger a regeneration.
Source of truth
Summaries are derivative. The underlying records are the ground truth. A summary may be incomplete or out of date relative to the records it cites; in all cases, follow the [source] links on each bullet to the primary documents. If a summary contradicts the records, the records win.