System Design: Student Progress Tracking System

Requirements

Functional Requirements:

Aggregate learner activity events from multiple source systems: LMS, video platform, quiz engine, coding judge
Compute per-learner progress metrics: course completion %, assignment grades, time-on-task, quiz scores
Real-time dashboards for instructors showing per-student and per-cohort progress
At-risk student identification: flag students below grade threshold or with insufficient engagement
Learning path recommendations based on progress and performance gaps
Progress history and audit trail for accreditation and compliance reporting

Non-Functional Requirements:

Ingest 1 million learning events per minute from all integrated platforms
Instructor dashboards must reflect new events within 30 seconds (near real-time)
Support 5 million concurrent learners with personalized progress state
Historical data retained for 7 years for compliance
Predictive at-risk model must have >80% precision to avoid alert fatigue

Scale Estimation

1M events/minute = 16,667 events/second average, with 10× burst factor during exam periods (167k events/second). Each event averages 500 bytes (learner ID, activity type, timestamp, metadata) — steady state is ~8 MB/second ingestion, burst 80 MB/second. Storing 7 years of events: 1M events/min × 525,600 min/year × 7 years × 500 bytes = ~1.8 PB of raw event storage — requiring a columnar store with compression (Parquet in S3 achieves ~10× compression, reducing to ~180 TB). Per-learner progress state: 5M learners × 10 KB average state size = 50 GB — fits in a well-sized Redis cluster for hot state.

High-Level Architecture

The system is built around a Lambda Architecture pattern: a speed layer for real-time instructor dashboards and a batch layer for compliance reporting and ML model training. The speed layer uses Kafka for event ingestion, Flink for stream processing, and Redis for serving computed metrics. The batch layer uses S3 (Parquet) for durable event storage, Spark for historical aggregations, and Redshift for analytics queries.

Event sources (LMS, video platform, quiz engine) publish events to Kafka via an event gateway that validates schemas (using Avro schema registry), enriches events with canonical learner and course identifiers (resolving source-system IDs to platform UUIDs), and routes to the appropriate Kafka topic by event type. This normalization step is critical for cross-platform aggregation — a student completing a video lesson in Platform A and a quiz in Platform B must map to the same learner profile.

Flink stream processors consume from Kafka and maintain per-learner aggregate state (completion percentages, running grade averages, engagement scores) in RocksDB state backends. Every 10 seconds, Flink flushes updated learner state to Redis. Instructor dashboards read from Redis — sub-millisecond reads for any learner's current progress. The at-risk detection model runs as a Flink operator: it evaluates a logistic regression model per-learner on each state update and publishes alert events to a notification queue when the risk score crosses a threshold.

Core Components

Event Ingestion Gateway

The event gateway is a horizontally scaled API service (50+ nodes) that receives events from source systems via a REST or Kafka-compatible producer API. Each incoming event is validated against a versioned Avro schema (e.g., VideoWatchEvent v2, QuizSubmitEvent v3) stored in Confluent Schema Registry. Invalid events are routed to a dead-letter queue with the validation error for debugging. Valid events are enriched with canonical IDs from a Redis-cached identity mapping table and published to Kafka. The gateway is the single ingestion choke point — it runs at 167k events/second burst and is designed with backpressure signaling to source systems to prevent Kafka producer overload.

Stream Processing Engine

Flink jobs consume from Kafka topic partitions. Each job is scoped to an event type category: engagement events (video watches, page views), assessment events (quiz scores, assignment grades), and system events (logins, course enrollments). Each job maintains a KeyedState store in RocksDB, keyed by learner_id. The engagement job computes a rolling 7-day engagement score (weighted sum of activity counts by type) and a session-level time-on-task metric. The assessment job maintains a running weighted average GPA per course. Both jobs emit updated LearnerState snapshots to an output Kafka topic, which a Redis sink connector applies to the in-memory learner state hash.

At-Risk Prediction Service

The at-risk model is a logistic regression trained on 3 years of historical data, with features: days since last login, current completion %, current grade average, engagement score trend (7-day slope), and assignment submission timeliness rate. The model is retrained weekly using Spark MLlib on the Redshift data warehouse. The trained model coefficients are published to a feature store (Redis), and the Flink at-risk operator applies the model in-stream per learner state update. When risk score > 0.7, an alert event is published to the notification topic, triggering an email to the instructor (batched to avoid notification flooding — max 1 alert per student per 24 hours per instructor).

Database Design

Kafka (durable event log, 30-day retention): topics engagement-events, assessment-events, system-events. S3 (long-term archive): Parquet files partitioned by event_type / year / month / day, compressed with zstd, retained 7 years — queryable via Athena or Spark. Redis: learner:{learner_id}:state (JSON hash, current progress snapshot), learner:{learner_id}:risk (float, current risk score), dashboard:course:{course_id}:cohort (sorted set of learner_ids by risk score for instructor dashboard). PostgreSQL: learner_profiles (canonical learner records), course_enrollments, alert_log (audit of sent alerts). Redshift: fact_events (all events, denormalized), dim_learners, dim_courses — for compliance reporting and ML training.

API Design

POST /events — body: {event_type, learner_id, source_system, payload, occurred_at}, ingests event, validates schema, publishes to Kafka; returns 202 Accepted
GET /learners/{learner_id}/progress — returns current progress snapshot from Redis: {courses: [{course_id, completion_pct, grade, last_activity_at}], engagement_score, risk_score}
GET /courses/{course_id}/dashboard — returns instructor view: sorted list of enrolled learners with progress metrics and risk flags; served from Redis cohort sorted set
GET /reports/compliance?learner_id={id}&from={date}&to={date} — queries S3 via Athena for full activity audit trail; async, returns job_id then results via polling

Scaling & Bottlenecks

Kafka partition count is the primary throughput lever. At 167k events/second burst, with each Kafka partition handling ~5k events/second, 34 partitions per topic are needed for burst. In practice, over-provision to 64 partitions per topic with 3× replication. Flink parallelism matches Kafka partition count — 64 parallel task slots per job. RocksDB state backends handle 5M keyed learner states; checkpoint intervals of 60 seconds to HDFS/S3 allow recovery with <1 minute of reprocessing on job restart.

Redis write throughput: Flink emits ~100k state updates/second during burst (not every event triggers a state change, only ~60% do). Redis can handle this comfortably on a 16-node cluster. The instructor dashboard query for a 500-student course reads 500 Redis hash entries — a pipeline call completing in <5ms. For very large courses (5,000 students), use a pre-computed sorted set for the dashboard rather than individual lookups.

Key Trade-offs

Lambda architecture vs. Kappa: Lambda (batch + speed layer) provides the most accurate batch aggregations and simplest recovery from stream processing bugs, but doubles the codebase; Kappa (stream only, reprocess from Kafka) is simpler but requires Kafka retention long enough to reprocess — impractical at 7-year compliance requirements.
Real-time at-risk vs. nightly batch: Real-time detection (Flink) catches disengagement within 30 seconds but runs a simpler model; nightly batch (Spark on full history) supports richer features (longitudinal patterns) with higher accuracy. A tiered approach — real-time for urgent signals, batch for weekly intervention planning — captures both.
Push vs. pull for instructor dashboards: Push (WebSocket) delivers updates sub-second but requires persistent connections for every instructor; pull (polling every 30s) is simpler and scales better given instructors check dashboards infrequently.
Centralized event schema registry vs. per-source schemas: A canonical schema registry forces source systems to conform but adds integration work; accepting heterogeneous events and normalizing at the gateway is more flexible but creates schema drift risk.