System Design: Online Exam Proctoring System

Requirements

Functional Requirements:

Capture and store webcam video and screen recordings for all exam sessions
Real-time suspicious behavior detection: multiple faces, face not visible, gaze off-screen, unauthorized applications open
Live proctor monitoring: human proctors can monitor flagged sessions and intervene via text chat
Automated pre-exam identity verification via face matching against ID document photo
Detailed incident reports with video timestamps of all flagged events for administrator review
Integration with LMS platforms via LTI for seamless single-sign-on into proctored exam sessions

Non-Functional Requirements:

Support 200,000 simultaneous proctored exam sessions
Video recording at minimum 720p/15fps per session; stored for 2 years
AI detection pipeline latency under 2 seconds from event occurrence to flag creation
Zero exam data accessible after student deletion request (GDPR Right to Erasure)
False positive rate for AI flags below 5% to avoid unnecessary disruptions

Scale Estimation

200k simultaneous sessions at 720p/15fps (approximately 500 Kbps per session) = 100 Gbps of video ingestion. Over a 3-hour exam, each session generates ~675 MB of video; 200k sessions = 135 TB of new video storage per major exam window. Annual storage: if 200k-session peaks happen 50 times/year with 2-year retention, that's 200k × 675 MB × 100 events × 2 years = 27 PB. AI inference: 200k sessions each processed at 1 frame/second = 200k inference calls/second. A single GPU (A100) handles ~2,000 frames/second for a lightweight face detection model — requiring 100 GPUs for real-time inference at peak.

High-Level Architecture

The system separates video ingestion, AI analysis, and human review into independent planes. Video ingestion: the browser-based exam client uses WebRTC to stream webcam and screen capture to regional media servers (Janus or mediasoup WebRTC gateways). The media server records the stream to object storage (S3) in rolling 5-minute chunks. This chunked recording approach allows immediate access to recent video for AI analysis without waiting for the full session to complete.

AI analysis runs as a streaming pipeline. As new video chunks are written to S3, an S3 event notification triggers a Lambda that enqueues the chunk for the AI analysis service. GPU inference workers (EC2 P3 instances) pull chunks, run the detection model (frame by frame at 1 FPS sample rate) and emit detection events: FaceNotVisible, MultipleFaces, GazeDeviation, UnauthorizedAppDetected. Detection events are published to Kafka and consumed by the incident aggregation service, which groups related events into incidents (e.g., 5 consecutive GazeDeviation events = one "prolonged gaze-away" incident). Incidents above a severity threshold are pushed to the live proctor dashboard in real time.

Human proctor monitoring is a separate web application. Proctors see a grid of live thumbnails (1 frame every 5 seconds per session), with flagged sessions highlighted. Clicking a session opens the live WebRTC stream directly from the media server. Proctors can send text messages to students (displayed as an overlay in the exam interface) or terminate an exam session. All proctor actions are logged for audit.

Core Components

WebRTC Media Ingestion Service

The exam client establishes two WebRTC tracks: webcam and screen share. The media gateway (Janus) receives both tracks, records them to S3 using HLS-like segmented recording (5-minute .webm chunks), and maintains a session registry in Redis (session_id → media server node mapping). For 200k simultaneous sessions, media servers are distributed across multiple regions and scaled horizontally — each Janus node handles ~500 concurrent sessions (CPU-bound by recording/muxing). A DNS-based load balancer routes students to the nearest regional media server cluster. If a media server fails mid-exam, the client automatically reconnects and a new segment chain begins (with a gap flag for auditors).

AI Behavioral Analysis Engine

The analysis engine uses a lightweight face detection model (BlazeFace or RetinaFace, running at ~1ms/frame on GPU) combined with a gaze estimation model. Inference runs on 1 frame/second sampled from the video stream. The face detection model produces: face bounding boxes and count (for multiple-face detection), face landmarks (for liveness check — detecting photo spoofing), and head pose angles (for gaze deviation estimation). The screen analysis module takes screenshot captures every 10 seconds and runs a classifier for unauthorized applications (based on window title and visual similarity to known exam-banned apps). Models are deployed as TorchServe endpoints on GPU instances behind an internal load balancer.

Identity Verification Service

At exam start, students complete an identity verification flow: (1) photograph their government ID using the webcam, (2) take a live selfie. The verification service runs OCR on the ID photo (extracting name and ID number) and a face similarity match between the ID photo and live selfie using a FaceNet embedding (cosine similarity > 0.85 threshold). The service also runs a liveness detection check (blink detection, head movement challenge) to prevent photo spoofing. Verification results are stored as a signed record (institution name, exam ID, student name, similarity score, timestamp) and linked to the exam session. In case of a verification failure, a fallback human review flow is triggered.

Database Design

PostgreSQL: exam_sessions (session_id, student_id, exam_id, institution_id, started_at, ended_at, status, identity_verified, media_session_id), incidents (incident_id, session_id, type, severity, started_at, ended_at, frame_timestamps[], reviewed_by, resolution), proctor_actions (action_id, session_id, proctor_id, action_type, message, occurred_at). S3: raw video chunks (organized by session_id/timestamp), and a master session manifest (JSON file listing all chunks for a session). Redis: session routing table (session_id → media server), active session count per institution, live thumbnail cache (base64 JPEG per session, updated every 5 seconds for proctor grid). ClickHouse: detection_events (session_id, frame_ts, detection_type, confidence, bounding_box_json) — full detection log for audit and model retraining.

API Design

POST /sessions — body: {exam_id, student_id, institution_id}, initializes session, returns {session_id, media_server_url, webrtc_token}; called at exam start
POST /sessions/{session_id}/verify-identity — multipart upload: {id_photo, selfie_photo}, runs face match, returns {verified: true/false, similarity_score}
GET /proctoring/live — SSE or WebSocket stream of incident events across all active sessions for a proctor; filtered by institution, severity
GET /sessions/{session_id}/report — returns full incident report with video timestamp links; S3 pre-signed URLs for each flagged segment
DELETE /sessions/{session_id}/data — GDPR erasure: deletes video from S3, anonymizes DB records, returns deletion confirmation with audit certificate

Scaling & Bottlenecks

WebRTC media ingestion at 200k concurrent sessions is the hardest scaling challenge. Each Janus node consumes ~1 Gbps of network bandwidth (500 sessions × 2 Mbps) and 16 CPU cores (for recording/muxing). A fleet of 200 Janus nodes across 5 regions handles the load. Session routing must be sticky (a session can't move mid-exam without interruption), so a consistent hash ring assigns sessions to nodes with 10% over-provisioning for node failure. Auto-scaling cannot add nodes during an active exam window — the fleet must be pre-scaled.

AI inference at 200k frames/second is GPU-bound. The 1 FPS sample rate keeps this at 200k inference calls/second, which 100 A100 GPUs handle. However, GPUs are expensive — optimize by batching frames from 32 sessions per GPU call (model supports batch inference), reducing cost to 6,250 batched calls/second, achievable on 20 A100 GPUs. GPU spot instances reduce cost by 60% with a preemption-safe checkpoint mechanism that reprocesses any missed frames from S3.

Key Trade-offs

Server-side recording vs. client-side recording: Server-side (WebRTC to media server) is tamper-proof but requires 100 Gbps of ingestion infrastructure; client-side recording (browser MediaRecorder API upload) is cheaper but students could potentially manipulate the recorded file. Server-side is required for legally defensible exams.
Real-time AI analysis vs. post-exam review: Real-time analysis enables proctors to intervene during the exam but requires expensive always-on GPU fleets; post-exam analysis is 10× cheaper (no real-time constraint) but misses the intervention window. A tiered approach (real-time for high-severity flags like multiple faces, post-exam for subtle behavioral patterns) balances cost and responsiveness.
Automated termination vs. human review: Automatically terminating exams on AI flag would cause unjust outcomes given a 5% false positive rate; all AI flags should queue for human review before any action. Full automation is appropriate only for extreme violations (e.g., another person detected taking over the exam).
Privacy vs. integrity: Continuous webcam surveillance is invasive; some jurisdictions restrict it. Offering a tiered proctoring model (no proctoring, AI-only, AI + human) lets institutions choose their compliance and privacy balance.