Methodology — How We Document Political Prisoners

Public registry documentDoc № PPW-METHODOLOGYLast revised · 07 / 16 / 26

Public access

Section 00 · Technical methodologyPipeline · Stages · Architecture · LimitationsPublic access

Table of contents

Pipeline overview

§ 01.1

From raw data to structured findings

Political Prisoner Watch goes beyond simple aggregation. The platform runs a dedicated Python ML microservice country-specific models REST API to analyze risk, forecast repression trends, detect coordinated campaigns, and generate evidence for legal proceedings — for both Russia and Belarus.

Every case enters a multi-stage pipeline. Each stage is documented below — expand "How it works" on any stage to inspect the techniques involved.

Countries tracked

15+

ML models

50+

API endpoints

35+

Model artifacts

ML stages

§ 02.1 — § 02.10

The ten-stage pipeline

A case enters at the top, exits at the bottom. Each stage adds structure, evidence, or context that the next one builds on.

Stage 01 · Real-time synchronization

Automated data ingestion

The system continuously ingests data from established human rights sources. Sync scripts keep our PostgreSQL database current with the latest arrests, prosecutions, sentencing, and prisoner locations across the regions we cover.

Primary sources: OVD-Info, Political Prisoners Support. Memorial (Independent Human Rights Project, Russia), Viasna (Belarus), and the Committee to Protect Journalists (worldwide press-freedom cases)
Secondary sources: court record aggregators and direct family / legal counsel reports
Structured database with a normalized schema for prisoners, cases, articles, locations, and outcomes
Automated sync runs continuously to detect new arrests, status changes, and releases
Geolocation enrichment for coordinate standardization and region mapping
Native-to-English machine translation (Russian & Belarusian) for international accessibility

Stage 02 · Legal actor extraction

Named entity recognition

Custom NER models extract structured legal entities from unstructured case summaries — judges, prosecutors, investigators, lawyers, and courts — enabling analytics on the judicial system itself.

Custom NER models trained on multi-lingual legal text
Entity extraction runs on every case summary; results are stored for downstream analytics
Per-judge analytics: average sentence length, case count, deviation from global average, harshness classification
Surfaces systematically harsh judges and patterns of judicial behavior

Sample data available

JSON sample CSV sample

Stage 03 · Urgency & torture prediction

XGBoost risk classification

Our core models use XGBoost gradient-boosted trees to predict two critical probabilities for each case: the urgency level (whether immediate advocacy action is required) and the risk of torture while in custody. Russia and Belarus are trained as separate models so legal patterns do not bleed across borders.

Gradient-boosted tree classifiers with logistic objective; trained separately for Russia and Belarus
Inputs: age, gender, arrest location, case category, and criminal articles (multi-label encoded)
Preprocessing: median imputation for numerics, one-hot encoding with unknown-category handling
Class imbalance correction for minority outcomes (e.g. torture)
Trained on historical outcomes from thousands of documented cases across both countries
Outputs: urgency probability (0–1), torture probability (0–1), and feature importance rankings

Stage 04 · How they were caught

Surveillance tech attribution

A dedicated classifier predicts the likely surveillance technology used to identify and detain a political prisoner — surfacing patterns in state surveillance infrastructure across regions and case types.

Classifier trained on cases with known surveillance methods
Features: arrest location, case category, and temporal context
Predicts categories such as social media monitoring, CCTV / facial recognition, informant reports, and phone interception
Exposes regional surveillance deployment patterns (e.g. Moscow facial recognition vs. regional signals intelligence)

Stage 05 · Gap between act and charge

Charge inflation detection

This model detects prosecutorial overreach by measuring the gap between what a person actually did (based on case summaries) and the charges filed against them. Distinct severity mappings exist for the Russian and Belarusian Criminal Codes.

Severity mapping: criminal articles scored on a 1–10 scale across both Russian and Belarusian frameworks
AI-powered summarization analyzes the description of the actual act committed
Inflation Score = charged severity − actual severity, normalized to a 0–100 scale
A leaderboard ranks the most inflated cases per legal framework
Example: an anti-war post (actual severity ≈ 2) charged under Art. 207.3 "Fakes about army" (severity ≈ 8) → inflation score ≈ 75%

Stage 06 · Expected sentence forecasting

Outcome prediction

Given a case's features, this model predicts the likely sentence length. Log-transformed regression guarantees positive predictions and produces a confidence interval, helping legal professionals anticipate outcomes.

Log-transformed regression model ensuring all predictions are positive (sentence months)
Features: criminal articles, case category, gender, location, age, and charge severity
Outputs: predicted sentence length, confidence interval, and comparison to similar cases
Anomaly layer flags cases where actual sentences deviate significantly from predicted — possible political interference
Outlier detection identifies disproportionate punishments across the database

Stage 07 · 90-day arrest predictions

Prophet trend forecasting

Facebook Prophet time-series models forecast arrest trends up to 90 days into the future — generating isolated models for global trends, Russia-specific trends, and Belarus-specific trends.

Time-series models on weekly arrest counts with automatic changepoint detection
Isolated predictive models for Russia vs. Belarus to prevent cross-contamination of historical trends
Article-specific models for charges most commonly used in political persecution
Repression-wave detection via statistical anomaly on rolling windows — alerts when a location or article shows unusual spikes

Stage 08 · LDA + BERTopic semantic analysis

Topic & campaign detection

Dual topic modeling identifies thematic clusters across cases — from anti-war speech patterns to religious persecution campaigns. This reveals coordinated repression strategies that are invisible at the level of individual case analysis.

Statistical topic modeling discovers latent thematic clusters from case text
A transformer-based semantic model captures nuance beyond simple keyword matching
Prosecution-template detection finds groups of cases with suspiciously similar summary text
Monthly topic tracking reveals how repression focus shifts over time (e.g. war-related charges spike after 2022)

Stage 09 · Graph detection of coordinated repression

Case network analysis

We build similarity networks linking cases by shared charges, location, timing, and tactics. Community detection algorithms identify clusters that reveal coordinated repression campaigns — groups of people targeted simultaneously.

Network graphs built individually for Russian and Belarusian prisoners to map distinct domestic networks
Weighted edge scoring combines all similarity dimensions into a single 0–1 score
Community detection algorithms identify tightly-connected case clusters
Each community is profiled: dominant charges, geographic center, time span, and a descriptive label
Enables visualization and discovery of coordinated repression campaigns

Stage 10 · AI-assisted legal support

Legal evidence & asylum tools

Our most impactful outputs: statistical persecution evidence, individual risk scores for asylum seekers, comparative case finding, and generative document drafting — designed to support real legal proceedings.

Persecution Evidence Score: statistical proof of pattern-based persecution (by category, location, or article) with confidence intervals and baselines
Personal Risk Score: 0–100% individualized risk assessment for asylum applicants based on their specific profile
Comparative Case Finder: identifies the most similar documented cases for precedent-building
Affidavit Generator: AI-assisted synthesis of prisoner data and country-condition reports into legal support documents
All tools designed to produce court-admissible statistical evidence

Architecture

§ 03.1

System architecture

Four discrete layers. Data flows left to right; each layer can be replaced or scaled independently.

Layer 01

Data sources

OVD-Info API

Political Prisoners Support. Memorial

Viasna

CPJ Data

Court records

Layer 02

Backend (Node.js)

PostgreSQL

Interoperable REST API

Sync jobs

Auth / admin

Layer 03

ML microservice (Py)

XGBoost models

Prophet forecasts

NER & BERTopic

Risk & outcome

Layer 04

Frontend (Next.js)

Interactive map

Analytics dashboard

Legal tools

Risk radar

Limitations

§ 04.1

A note on predictive models

Probabilistic, not deterministic

Risk scores and forecasts are statistical estimates derived from historical data — not predictions of what will happen to any specific person.

Aid, never replacement

The pipeline aids researchers and legal professionals in prioritization. It does not replace human judgment, attorney review, or the lived expertise of families and witnesses.

Continuously retrained

Every model is retrained as new data arrives. A "high risk" classification reflects a statistical resemblance to past cases — outcomes for an individual may differ.

Outputs from these models should be treated as supporting evidence, not adjudication. They sit alongside source records on each case page so reviewers can see the basis of any score.