System Design: Time Tracking System

Requirements

Functional Requirements:

Employees clock in/out via web app, mobile app, physical time clocks (kiosks), or biometric scanners
Managers approve timesheets weekly with the ability to edit or flag entries
Automatic overtime calculation based on jurisdiction-specific labor laws (daily OT, weekly OT, double-time)
Project/task-based time allocation for client billing and internal cost tracking
PTO (paid time off) tracking with accrual rules, balance management, and absence calendars
Geofencing and GPS validation for mobile clock-ins (field workers must be on-site)

Non-Functional Requirements:

Support 5M active employees across 20K companies
Clock-in/out processed within 1 second with offline support (sync when connectivity returns)
99.95% availability; missed clock events mean incorrect pay
Idempotent clock events (duplicate submissions from network retries must not create duplicate entries)
Real-time dashboard showing who is currently clocked in across all locations

Scale Estimation

5M active employees. Clock events: 2 per employee per day (in + out) = 10M events/day. With breaks and project switches: 6 events/day average = 30M events/day = 347 events/sec. Peak: shift changes at 8 AM, 3 PM, 5 PM = 3x average = ~1,000 events/sec for 30-minute windows. Timesheet records: 5M employees × 52 weeks = 260M timesheets/year. Each timesheet: 7 daily entries × 500 bytes = 3.5KB. Total timesheet storage: 260M × 3.5KB = 910GB/year. GPS data: 1M field workers × 6 clock events × 100 bytes GPS = 600MB/day. Real-time dashboard: 20K companies querying active clock-ins = 200 queries/sec.

High-Level Architecture

The system uses an event-sourced architecture where every clock event is an immutable fact. When an employee clocks in, the Clock Event Service receives the event (timestamp, employee_id, event_type, location, device_id) and validates it: checks for duplicate submissions using an idempotency key (employee_id + event_type + timestamp rounded to nearest minute), validates geofencing (if enabled, GPS coordinates must be within the configured radius of a work site), and detects anomalies (clock-in without prior clock-out, impossible travel between locations). Valid events are appended to an event store (Kafka topic partitioned by employee_id) and written to the primary database (PostgreSQL).

The Timesheet Aggregation Service consumes clock events and materializes weekly timesheets. Each timesheet summarizes: daily hours worked, breaks taken, overtime hours (calculated per applicable labor law), and project time allocations. The aggregation runs continuously, updating timesheets in near-real-time as new clock events arrive. Managers view pending timesheets via a dashboard and approve/reject/edit them. Approved timesheets are published to Kafka for consumption by payroll systems.

The Real-Time Presence Service maintains a live view of who is currently clocked in across the organization. This service uses Redis with a sorted set per company (employee_id scored by clock-in timestamp). Clock-in events add to the set; clock-out events remove from it. The dashboard queries this Redis structure for instant results. For companies with multiple locations, the presence data is additionally grouped by location_id.

Core Components

Clock Event Processing

The event processing pipeline handles 347 events/sec with spikes to 1,000/sec during shift changes. Each event goes through a validation pipeline: (1) Idempotency check — a Redis set stores recent event hashes (employee_id + event_type + minute-rounded timestamp) with a 24-hour TTL; duplicates are silently acknowledged. (2) Sequence validation — the system checks the employee's last event type; a clock-in following a clock-in triggers an alert (missed clock-out). (3) Geofence validation — for mobile clock-ins, GPS coordinates are checked against the employee's assigned work site polygon using PostGIS point-in-polygon queries. (4) Anomaly detection — travel time between consecutive events at different locations is compared against realistic driving time (Google Maps Distance Matrix API, cached). Anomalous events are flagged for manager review.

Overtime Calculation Engine

Overtime rules vary dramatically by jurisdiction. California requires daily overtime (time-and-a-half after 8 hours/day, double-time after 12 hours) and weekly overtime (after 40 hours). Federal FLSA only requires weekly overtime (after 40 hours). Some jurisdictions have consecutive day rules (7th consecutive day at overtime rate). The engine uses a rules framework: each jurisdiction has a rule module that receives daily hours and weekly cumulative hours, and returns the overtime classification for each worked hour. The engine evaluates rules on every timesheet update, ensuring overtime is calculated correctly even when timesheets are edited retroactively. Edge cases (employees working across state lines, shift workers crossing midnight) are handled by the jurisdiction resolver.

Offline Sync for Mobile

Field workers in areas with poor connectivity need reliable offline clock-in. The mobile app stores clock events locally (SQLite) with a monotonic sequence number. When connectivity is restored, the sync engine sends queued events to the server in order. The server processes them idempotently, comparing the event's original timestamp against the current state. Conflicts (e.g., a manager edited the timesheet while the worker was offline) are resolved by the server — worker clock events take precedence over manual edits for time-of-day, but manager-approved totals are preserved. The sync protocol uses a last-synced-sequence-number cursor to avoid re-processing already synced events.

Database Design

PostgreSQL schema: clock_events (event_id UUID PK, employee_id, company_id, event_type ENUM(clock_in, clock_out, break_start, break_end), timestamp TIMESTAMPTZ, device_type ENUM(web, mobile, kiosk, biometric), location GEOGRAPHY(Point), project_id nullable, idempotency_key VARCHAR UNIQUE, created_at). timesheets (timesheet_id, employee_id, company_id, week_start_date, status ENUM(draft, submitted, approved, rejected), daily_hours JSONB, total_regular_hours DECIMAL, total_overtime_hours DECIMAL, total_doubletime_hours DECIMAL, approved_by nullable, approved_at nullable). Indexes: (company_id, employee_id, timestamp DESC) for event queries, (company_id, week_start_date, status) for manager approval dashboards.

PTO balances are tracked in a separate table: pto_balances (employee_id, company_id, pto_type ENUM(vacation, sick, personal), balance_hours DECIMAL, accrual_rate DECIMAL, max_carryover DECIMAL, used_this_year DECIMAL). Accrual rules (e.g., 4 hours per bi-weekly pay period, max 160 hours carryover) are configured per company and evaluated by a scheduled job on each pay period boundary.

API Design

POST /api/v1/clock — Submit a clock event; body contains employee_id, event_type, timestamp, gps_coordinates, device_type; idempotent via idempotency_key header
GET /api/v1/timesheets?employee_id={id}&week={date} — Fetch timesheet for a specific week
PUT /api/v1/timesheets/{timesheet_id}/approve — Manager approves a timesheet; triggers payroll export
GET /api/v1/presence?company_id={id}&location_id={id} — Fetch currently clocked-in employees (real-time)

Scaling & Bottlenecks

Shift-change spikes (1,000 events/sec) require the clock event API to scale horizontally. The stateless API servers behind a load balancer handle burst traffic. The idempotency check in Redis adds 1ms per request. The geofence validation (PostGIS query) adds 5ms but is only evaluated for mobile clock-ins (30% of events). The system processes clock events asynchronously: the API immediately writes to Kafka and returns success; timesheet aggregation happens downstream. This ensures clock-in latency stays under 200ms even during spikes.

The timesheet aggregation service processes 30M events/day. Using Kafka consumer groups with 10 partitions (partitioned by employee_id), 10 consumers process events in parallel. Each consumer maintains in-memory state for the current week's timesheet per employee and flushes to PostgreSQL every 5 seconds. The overtime calculation engine adds computational overhead for complex jurisdictions — cached rule evaluations (jurisdiction + hours worked → overtime classification) reduce repeated calculations.

Key Trade-offs

Event sourcing vs CRUD for clock events: Event sourcing provides a complete, immutable audit trail of all clock events (critical for labor law compliance) but makes corrections more complex — corrections are modeled as new events (adjustment events) rather than edits to existing records
Real-time overtime calculation vs end-of-week batch: Real-time calculation lets employees and managers see overtime accruing throughout the week, enabling proactive scheduling — the computational cost is justified by the labor cost savings from better overtime management
GPS geofencing vs no location validation: Geofencing prevents time theft for field workers but raises privacy concerns and requires GPS permission — configurable per company/role, defaulting to off for office workers
Synchronous clock event processing vs async: Async processing (immediate ACK, background aggregation) provides sub-200ms clock-in response but means the timesheet view may be 5-10 seconds behind the latest event — acceptable for this use case