LIVE
Loading latest updates...
Public registry document
Public access
Section 00 · Technical methodologyPublic access
01
Pipeline overview
§ 01.1

From raw data to structured findings

Political Prisoner Watch goes beyond simple aggregation. The platform runs a dedicated Python ML microservice country-specific models REST API to analyze risk, forecast repression trends, detect coordinated campaigns, and generate evidence for legal proceedings — for both Russia and Belarus.

Every case enters a multi-stage pipeline. Each stage is documented below — expand "How it works" on any stage to inspect the techniques involved.

2

Countries tracked

15+

ML models

50+

API endpoints

35+

Model artifacts

02
ML stages
§ 02.1 — § 02.10

The ten-stage pipeline

A case enters at the top, exits at the bottom. Each stage adds structure, evidence, or context that the next one builds on.

01
Stage 01 · Real-time synchronization

Automated data ingestion

The system continuously ingests data from established human rights sources. Sync scripts keep our PostgreSQL database current with the latest arrests, prosecutions, sentencing, and prisoner locations across the regions we cover.

  • Primary sources: OVD-Info, Memorial (Russia), Viasna (Belarus), and the Committee to Protect Journalists (worldwide press-freedom cases)
  • Secondary sources: court record aggregators and direct family / legal counsel reports
  • Structured database with a normalized schema for prisoners, cases, articles, locations, and outcomes
  • Automated sync runs continuously to detect new arrests, status changes, and releases
  • Geolocation enrichment for coordinate standardization and region mapping
  • Native-to-English machine translation (Russian & Belarusian) for international accessibility
02
Stage 02 · Legal actor extraction

Named entity recognition

Custom NER models extract structured legal entities from unstructured case summaries — judges, prosecutors, investigators, lawyers, and courts — enabling analytics on the judicial system itself.

  • Custom NER models trained on multi-lingual legal text
  • Entity extraction runs on every case summary; results are stored for downstream analytics
  • Per-judge analytics: average sentence length, case count, deviation from global average, harshness classification
  • Surfaces systematically harsh judges and patterns of judicial behavior
Sample data available
03
Stage 03 · Urgency & torture prediction

XGBoost risk classification

Our core models use XGBoost gradient-boosted trees to predict two critical probabilities for each case: the urgency level (whether immediate advocacy action is required) and the risk of torture while in custody. Russia and Belarus are trained as separate models so legal patterns do not bleed across borders.

  • Gradient-boosted tree classifiers with logistic objective; trained separately for Russia and Belarus
  • Inputs: age, gender, arrest location, case category, and criminal articles (multi-label encoded)
  • Preprocessing: median imputation for numerics, one-hot encoding with unknown-category handling
  • Class imbalance correction for minority outcomes (e.g. torture)
  • Trained on historical outcomes from thousands of documented cases across both countries
  • Outputs: urgency probability (0–1), torture probability (0–1), and feature importance rankings
04
Stage 04 · How they were caught

Surveillance tech attribution

A dedicated classifier predicts the likely surveillance technology used to identify and detain a political prisoner — surfacing patterns in state surveillance infrastructure across regions and case types.

  • Classifier trained on cases with known surveillance methods
  • Features: arrest location, case category, and temporal context
  • Predicts categories such as social media monitoring, CCTV / facial recognition, informant reports, and phone interception
  • Exposes regional surveillance deployment patterns (e.g. Moscow facial recognition vs. regional signals intelligence)
05
Stage 05 · Gap between act and charge

Charge inflation detection

This model detects prosecutorial overreach by measuring the gap between what a person actually did (based on case summaries) and the charges filed against them. Distinct severity mappings exist for the Russian and Belarusian Criminal Codes.

  • Severity mapping: criminal articles scored on a 1–10 scale across both Russian and Belarusian frameworks
  • AI-powered summarization analyzes the description of the actual act committed
  • Inflation Score = charged severity − actual severity, normalized to a 0–100 scale
  • A leaderboard ranks the most inflated cases per legal framework
  • Example: an anti-war post (actual severity ≈ 2) charged under Art. 207.3 "Fakes about army" (severity ≈ 8) → inflation score ≈ 75%
06
Stage 06 · Expected sentence forecasting

Outcome prediction

Given a case's features, this model predicts the likely sentence length. Log-transformed regression guarantees positive predictions and produces a confidence interval, helping legal professionals anticipate outcomes.

  • Log-transformed regression model ensuring all predictions are positive (sentence months)
  • Features: criminal articles, case category, gender, location, age, and charge severity
  • Outputs: predicted sentence length, confidence interval, and comparison to similar cases
  • Anomaly layer flags cases where actual sentences deviate significantly from predicted — possible political interference
  • Outlier detection identifies disproportionate punishments across the database
07
Stage 07 · 90-day arrest predictions

Prophet trend forecasting

Facebook Prophet time-series models forecast arrest trends up to 90 days into the future — generating isolated models for global trends, Russia-specific trends, and Belarus-specific trends.

  • Time-series models on weekly arrest counts with automatic changepoint detection
  • Isolated predictive models for Russia vs. Belarus to prevent cross-contamination of historical trends
  • Article-specific models for charges most commonly used in political persecution
  • Repression-wave detection via statistical anomaly on rolling windows — alerts when a location or article shows unusual spikes
08
Stage 08 · LDA + BERTopic semantic analysis

Topic & campaign detection

Dual topic modeling identifies thematic clusters across cases — from anti-war speech patterns to religious persecution campaigns. This reveals coordinated repression strategies that are invisible at the level of individual case analysis.

  • Statistical topic modeling discovers latent thematic clusters from case text
  • A transformer-based semantic model captures nuance beyond simple keyword matching
  • Prosecution-template detection finds groups of cases with suspiciously similar summary text
  • Monthly topic tracking reveals how repression focus shifts over time (e.g. war-related charges spike after 2022)
09
Stage 09 · Graph detection of coordinated repression

Case network analysis

We build similarity networks linking cases by shared charges, location, timing, and tactics. Community detection algorithms identify clusters that reveal coordinated repression campaigns — groups of people targeted simultaneously.

  • Network graphs built individually for Russian and Belarusian prisoners to map distinct domestic networks
  • Weighted edge scoring combines all similarity dimensions into a single 0–1 score
  • Community detection algorithms identify tightly-connected case clusters
  • Each community is profiled: dominant charges, geographic center, time span, and a descriptive label
  • Enables visualization and discovery of coordinated repression campaigns
10
Stage 10 · AI-assisted legal support

Legal evidence & asylum tools

Our most impactful outputs: statistical persecution evidence, individual risk scores for asylum seekers, comparative case finding, and generative document drafting — designed to support real legal proceedings.

  • Persecution Evidence Score: statistical proof of pattern-based persecution (by category, location, or article) with confidence intervals and baselines
  • Personal Risk Score: 0–100% individualized risk assessment for asylum applicants based on their specific profile
  • Comparative Case Finder: identifies the most similar documented cases for precedent-building
  • Affidavit Generator: AI-assisted synthesis of prisoner data and country-condition reports into legal support documents
  • All tools designed to produce court-admissible statistical evidence
03
Architecture
§ 03.1

System architecture

Four discrete layers. Data flows left to right; each layer can be replaced or scaled independently.

Layer 01

Data sources

OVD-Info API
Memorial HRC
Viasna
CPJ Data
Court records

Layer 02

Backend (Node.js)

PostgreSQL
Interoperable REST API
Sync jobs
Auth / admin

Layer 03

ML microservice (Py)

XGBoost models
Prophet forecasts
NER & BERTopic
Risk & outcome

Layer 04

Frontend (Next.js)

Interactive map
Analytics dashboard
Legal tools
Risk radar
04
Limitations
§ 04.1

A note on predictive models

Probabilistic, not deterministic

Risk scores and forecasts are statistical estimates derived from historical data — not predictions of what will happen to any specific person.

Aid, never replacement

The pipeline aids researchers and legal professionals in prioritization. It does not replace human judgment, attorney review, or the lived expertise of families and witnesses.

Continuously retrained

Every model is retrained as new data arrives. A "high risk" classification reflects a statistical resemblance to past cases — outcomes for an individual may differ.

Outputs from these models should be treated as supporting evidence, not adjudication. They sit alongside source records on each case page so reviewers can see the basis of any score.
Methodology — How We Document Political Prisoners | Political Prisoner Watch