System Design: Clinical Trial Management System

Requirements

Functional Requirements:

Define clinical trial protocols with study arms, visit schedules, eligibility criteria, and endpoint definitions
Electronic Data Capture (EDC) with configurable case report forms (CRFs), edit checks, and medical coding (MedDRA, WHO Drug)
Patient randomization with stratified, block, and adaptive randomization algorithms
Adverse event capture and reporting with automated MedDRA coding and regulatory timeline tracking (FDA IND Safety Reports)
Multi-site management with site activation, monitoring visit scheduling, and enrollment tracking dashboards
Audit trail compliant with 21 CFR Part 11 (electronic records/signatures) and ICH GCP E6(R2)

Non-Functional Requirements:

Support 5,000 concurrent clinical trials with 10,000 investigator sites globally
EDC form submission latency under 3 seconds including edit check validation
21 CFR Part 11 compliance: electronic signatures, complete audit trail, validated system with documented testing
Data integrity: no data loss or silent corruption — every data modification is versioned and traceable
HIPAA compliance for all protected health information with minimum necessary access enforcement and BAAs with all cloud vendors
99.9% availability with a maximum 4-hour recovery time objective (RTO) for disaster recovery

Scale Estimation

With 5,000 active trials, each averaging 500 enrolled patients with 15 study visits producing 20 CRF pages per visit, the system manages 750M CRF data points. Daily data entry: 100K CRF page submissions/day from 50,000 active data entry users across all sites. Each CRF submission triggers 10-50 edit checks (range validations, cross-field logic, medical coding lookups) — totaling 2M edit check evaluations/day. Query management: 50K data queries open at any time requiring site response. Randomization calls: 5,000/day across all trials. Adverse event reports: 10,000/day with 500 requiring expedited regulatory reporting within 7-15 days. Document storage: 500TB of trial documents (informed consent forms, monitoring reports, source documents).

High-Level Architecture

The CTMS follows a multi-tenant architecture where each clinical trial is a logical tenant with isolated data, configurable workflows, and independent access controls. The system is built on a service-oriented architecture with domain services: Protocol Design Service, EDC Service, Randomization Service, Safety (Adverse Event) Service, Site Management Service, and Regulatory Submission Service.

The EDC Service is the core, handling form definition, data entry, edit check execution, and query management. Trial-specific CRF definitions are stored as JSON schemas that drive dynamic form rendering on the web client (React-based). Edit checks are defined as rules in a DSL by data managers and compiled into executable validators that run both client-side (for immediate feedback) and server-side (as the authoritative check). Every data modification is captured as a versioned audit record: old value, new value, reason for change, user identity, timestamp, and electronic signature hash — fully compliant with 21 CFR Part 11.

The Randomization Service runs as an isolated, stateless microservice with its own dedicated PostgreSQL database to prevent any possibility of unblinding. It implements multiple randomization algorithms: simple, permuted block, stratified, and response-adaptive (Bayesian). Randomization lists are generated by biostatisticians and sealed (encrypted) in the system; the service dispenses the next assignment upon request without revealing the full list. Emergency unblinding requires two-person authorization and is logged immutably.

All data is encrypted at rest (AES-256) and in transit (TLS 1.3). Multi-region deployment supports global trials with data residency requirements — EU trial data stays in eu-west-1, Chinese trial data in cn-north-1, per local regulations. Authentication uses SAML 2.0 for enterprise SSO with mandatory MFA for all users.

Core Components

Electronic Data Capture Engine

The EDC engine is a metadata-driven platform where CRF definitions (stored as JSON Schema with custom extensions for edit checks, medical coding, and conditional display logic) drive the entire data capture experience. Data managers build forms in a visual form designer that generates the JSON schema. At runtime, the React client renders forms dynamically from the schema, applying conditional visibility rules client-side. Server-side, each CRF submission is validated against the schema and edit checks: range checks (systolic BP must be 60-250), cross-form checks (randomization date must be after consent date), and medical coding (adverse event terms auto-coded to MedDRA using fuzzy matching with clinical synonym expansion). Failed edit checks generate data queries assigned to the site coordinator for resolution. The audit trail is append-only in PostgreSQL: every field-level change creates a new audit_trail row with columns (trial_id, subject_id, form_id, field_id, old_value, new_value, change_reason, user_id, electronic_signature, timestamp).

Randomization Engine

The Randomization Engine is deliberately isolated from all other services to maintain study blinding integrity. It has its own database, its own deployment, and its own access controls — no user with EDC access can query randomization assignments except through controlled unblinding procedures. The engine supports: (1) permuted block randomization (block sizes of 4-8, randomly varying) to prevent prediction, (2) stratified randomization using minimization algorithms that balance treatment arms across key prognostic factors (age group, disease severity, site), (3) interactive response technology (IRT/IXRS) for drug supply management, assigning both a treatment arm and a medication kit number. The randomization list is pre-generated, encrypted with a trial-specific key, and stored as sealed envelopes in the database. Each randomization call consumes the next envelope and returns only the assignment for that subject. The biostatistician's unblinding key is stored in a separate AWS KMS key with a two-person access policy.

Safety & Adverse Event Reporting

The Safety Service manages the lifecycle of adverse events from initial capture through regulatory submission. When a site enters an adverse event on the CRF, the system auto-codes the event term to MedDRA (Lowest Level Term → Preferred Term → System Organ Class) using NLP-assisted fuzzy matching. Serious adverse events (SAEs) trigger an automated workflow: the Safety Service evaluates whether the event is unexpected (not listed in the Investigator's Brochure) and related to the study drug — if both criteria are met, the event qualifies as a SUSAR (Suspected Unexpected Serious Adverse Reaction) requiring expedited reporting to the FDA (IND Safety Report within 15 calendar days, or 7 days for fatal/life-threatening events). The system tracks regulatory timelines, generates pre-populated MedWatch 3500A forms, and submits electronically via the FDA ESG (Electronic Submissions Gateway). A compliance dashboard shows all open safety events color-coded by days remaining until regulatory deadline.

Database Design

PostgreSQL serves as the primary database with multi-tenant isolation via the trial_id column present in every table (row-level security policies enforce tenant isolation). Core tables: trials (trial_id, protocol_number, sponsor, phase, status, created_at), subjects (subject_id, trial_id, site_id, screening_number, randomization_number, consent_date, status), crf_submissions (submission_id, trial_id, subject_id, visit_id, form_definition_id, data JSONB, status DRAFT/SUBMITTED/VERIFIED/LOCKED, submitted_by, submitted_at, electronic_signature_hash), audit_trail (audit_id, trial_id, subject_id, entity_type, entity_id, field_name, old_value, new_value, change_reason, user_id, signature_hash, timestamp), adverse_events (ae_id, trial_id, subject_id, ae_term, meddra_pt_code, meddra_soc_code, onset_date, severity, seriousness_criteria JSONB, causality, outcome, regulatory_report_due_date, report_status).

The Randomization database (separate instance): randomization_lists (list_id, trial_id, stratum_key, assignments_encrypted BYTEA, current_index, created_by, sealed_at), randomization_log (log_id, trial_id, subject_id, stratum_key, assignment_encrypted, assigned_at, assigned_by). All assignment values are encrypted; only the unblinding key holder can decrypt.

API Design

POST /v1/trials/{trialId}/subjects/{subjectId}/forms/{formId} — Submit a CRF page; body contains form data (JSONB matching the form schema), electronic_signature (user credentials hash); returns validation results including any triggered edit checks and queries
POST /v1/trials/{trialId}/randomize — Randomize a subject; body contains subject_id, stratification_factors; returns randomization_number and treatment_arm (blinded label, e.g., "Arm A")
POST /v1/trials/{trialId}/adverse-events — Report an adverse event; body contains subject_id, ae_term, onset_date, severity, seriousness criteria, causality assessment; triggers auto-coding and regulatory timeline evaluation
GET /v1/trials/{trialId}/dashboard — Retrieve trial dashboard metrics: enrollment by site, screen failure rate, query rate, open SAEs, protocol deviations

Scaling & Bottlenecks

The EDC edit check engine is the primary performance bottleneck. Complex trials have 500+ edit checks per CRF page, including cross-form validations that require loading data from other visits. At 100K CRF submissions/day (1.15/sec average, 20/sec peak during site monitoring visits), each triggering 10-50 edit checks with some requiring database lookups, the system must handle 1,000 edit check evaluations/sec at peak. This is optimized by: (1) caching subject data in Redis (all CRF data for a subject keyed by subject_id with 1-hour TTL), reducing database reads by 80%, (2) compiling edit check rules into optimized JavaScript functions at trial setup time rather than interpreting DSL at runtime, (3) running client-side edit checks for simple range validations and reserving server-side checks for cross-form logic.

Multi-region deployment for data residency adds operational complexity. Each region runs an independent database cluster with no cross-region replication of subject data. Global trial dashboards aggregate metrics (not PHI) from regional deployments via a centralized analytics service that receives anonymized aggregates. This ensures compliance with GDPR (EU data stays in EU), China PIPL, and other data localization laws while maintaining global trial oversight.

Key Trade-offs

JSON Schema-driven EDC over hard-coded forms: Metadata-driven forms enable rapid trial setup (days instead of months) and mid-study amendments without code changes, but add complexity to the rendering engine and limit form UX customization — mitigated by a rich component library covering 95% of clinical data capture patterns
Isolated randomization database over shared database with access controls: Physical isolation prevents any possibility of accidental unblinding through database queries or joins, but increases operational overhead (separate backups, separate monitoring) — essential for regulatory confidence in blinding integrity
Append-only audit trail over soft deletes with change tracking: 21 CFR Part 11 requires a complete, unalterable audit trail — append-only storage ensures no audit record can be modified after creation, but increases storage significantly (audit_trail table is typically 10x larger than the data tables)
Regional data isolation over global database with encryption: Regional deployment ensures data residency compliance without relying solely on encryption policies, but complicates multi-region trial analytics — aggregated, de-identified metrics bridge the gap for global oversight