Methodology

How Mandat Johor works — and where it stops

Transparent methodology is a feature, not an appendix. This page states exactly what the system knows, how it knows it, and what it refuses to claim.

01Data sources and hierarchy

Tier 1 (official): SPR portal, MySPR Semak, official election notices, OpenDOSM. Tier 2 (structured): ElectionData.MY and comparable open election datasets. Tier 3: verified Malaysian news organisations. Tier 4: legally accessible public web and social pages only.

Official sources always override secondary sources for candidate, seat, party and result facts. Important facts are cross-checked across at least two independent sources where possible; when sources disagree the item is marked 'Source conflict detected' and routed to the Verification Desk instead of being published.

02Live data methodology

The live data layer ingests from four channels, in strict priority order. (1) Official sources: public SPR portal pages are fetched only where robots.txt permits, are never auto-committed as fact, and route through the Verification Desk; interactive services like MySPR Semak are never scraped — analysts transcribe lookups via manual import templates that require the official source URL on every row. (2) ElectionData.MY provides the historical result baseline via an authenticated API adapter whose endpoints are operator-confirmed, never guessed; imported rows stay 'partially verified' until cross-checked against Tier 1. (3) OpenDOSM/data.gov.my supplies demographic context (population by state and district, DUN-level where published) with an explicit freshness caveat whenever the data predates the current year — it is context, not a current-day measurement. (4) Trusted Malaysian news sources are read RSS-first under robots.txt and per-source rate limits; paywalled content is excluded, syndicated copies are deduplicated by canonical URL and content hash, and sentiment is computed only after entity-resolution confidence clears its threshold.

Data modes are never blended silently. Sample mode serves only the labelled deterministic seed. Live mode renders only database-backed, receipted data and shows honest empty states for gaps. Hybrid mode may fill gaps with sample values only while their Sample badge stays visible. The no-source-no-metric rule is enforced structurally: fact-bearing tables cannot store a row without a source receipt. Every import writes receipts and audit-log entries; unmatched or conflicting rows go to the manual review queue instead of being force-fitted; and /api/data-health reports coverage, freshness and failures.

Known limitations: historical and demographic baselines inherit their source's own errors and vintages; news-derived signals over-represent online discourse and under-represent offline sentiment; entity linking over short texts is imperfect and therefore confidence-gated; and until live coverage is complete, probability estimates stay wide or are withheld ('Insufficient verified data') rather than rendered precise-looking.

03What is a fact, what is a model output

A candidate name backed by an SPR nomination record is a fact. A win probability is a model estimate. A sentiment score is an analysis output with known error. An unverified circulating claim is a rumour. These four categories are stored in separate tables, labelled separately in the UI, and never blended.

If a metric has no source receipt, it is not rendered as fact — the UI shows an honest empty state instead.

04Probability model design

Inputs: historical result baseline, candidate factor, party/coalition baseline, sentiment factor, turnout sensitivity, and data confidence. Outputs: party and candidate probabilities, a race rating (Safe / Likely / Lean / Toss-up / Insufficient Data), a confidence interval, model confidence, and the top movement drivers.

The model never claims certainty. Every published number is a probability estimate, not a prediction guarantee. When verified inputs are insufficient, the model reports 'Insufficient verified data' rather than forcing a number.

05First Baseline Probability Model (baseline-1.0)

The first live model is a deliberately conservative heuristic built from two receipted inputs only: the winner-level 2022 Johor state election baseline (one winning row per DUN — party, vote share, majority, turnout) and the current 2026 candidate list imported from a trusted secondary source pending official verification. It does not yet include full candidate-level historical results, vote-transfer analysis, verified news or sentiment signals, polling, or demographic weighting — demographic context is treated as a caveat, not a signal.

Model logic: the 2022 winning coalition receives a baseline advantage anchored on its winning vote share, but only if it is contesting the seat in 2026; if it is absent, the model publishes a near-uniform split with increased uncertainty instead of guessing a successor. All challenger parties share the remaining probability equally, because winner-level data contains no information that could separate them. Larger candidate fields shrink probabilities toward uniform: races with four or more candidates cannot be rated Safe. A missing vote share or majority reduces confidence, and partially verified inputs cap it — baseline-1.0 never emits High confidence, and seats without sufficient inputs are withheld as Insufficient Data.

Every estimate records its confidence interval, model confidence, top drivers, the source receipts of its underlying candidate and historical rows, and explicit limitation notes — including the standing note that this is a winner-level 2022 baseline with full candidate-level history pending. These are probability estimates with intentionally wide uncertainty, not predictions or guarantees.

06Confidence calculation

Confidence grades combine source tier, cross-check agreement, recency, extraction quality and entity-resolution scores. High requires Tier 1/2 agreement on the underlying facts; Medium indicates partial corroboration; Low indicates single-source or noisy inputs; Insufficient means the value is withheld.

07Sentiment and narrative limitations

Sentiment classification over Bahasa Melayu, English, and (where feasible) Chinese and Tamil text is imperfect: sarcasm, code-switching and dialect reduce accuracy. Scores are aggregates over public text, not measurements of voter intention. Low-confidence classifications are excluded from seat signals and queued for review.

Narrative summaries are grounded in retrieved source excerpts. If the text does not support a claim, the claim is not included. The pipeline is forbidden from inventing missing facts.

08Crawler rules

The crawler respects robots.txt, honours per-source rate limits with backoff, and never accesses private groups, login-only content or paywalled text. Blocked or legally unclear sources are marked 'Unavailable — access restricted' and left uncrawled. Syndicated duplicates are removed by content hashing before they can inflate volume metrics.

No private personal data is collected. Every crawl writes an audit log entry.

09Sample mode

Until live feeds are connected and verified, the entire interface runs on a deterministic sample generator. Sample values are stamped with a Sample badge and a mock receipt, and are not derived from any real polling, reporting or social data. They exist to exercise the interface, not to inform anyone about Johor politics.

10Political neutrality statement

Mandat Johor does not endorse, attack or promote any party, coalition or candidate. Copy is analytical and source-backed. The system publishes uncertainty, shows conflicts instead of hiding them, and keeps a full audit trail of every change to important entities.

Versioning

Methodology version m-2026.07.1 · Live model baseline-1.0 · Sample generator sample-0.3

Every probability estimate records the model and methodology versions that produced it, so historical outputs remain auditable after upgrades.