System Design: Workflow Automation Engine (Zapier-style)

Requirements

Functional Requirements:

Users create multi-step workflows (Zaps) connecting trigger events to action sequences across 3,000+ third-party apps
Trigger types: webhook-based (instant), polling-based (scheduled checks for new data), and schedule-based (cron)
Action steps: API calls to third-party services with data transformation and field mapping between steps
Conditional logic: filters (skip actions if conditions not met), branching (if/else paths), and loops (iterate over arrays)
Error handling: automatic retries with exponential backoff, error notification, and fallback paths
Workflow versioning, history logging, and replay of failed executions

Non-Functional Requirements:

Support 10M active workflows across 2M users
Process 500M workflow executions/day with P95 latency under 30 seconds per execution
Webhook trigger processing within 1 second of receipt
99.95% availability for the execution engine
Rate limiting per user and per connected app to prevent abuse and protect third-party API quotas

Scale Estimation

500M executions/day = 5,787 executions/sec. Each execution averages 4 steps = 2B step executions/day = 23,148 steps/sec. Each step involves a third-party API call averaging 500ms latency → concurrent in-flight steps: 23,148 × 0.5 = 11,574 concurrent outbound API calls. Polling triggers: 5M workflows × polling every 15 minutes = 333K polls/minute = 5,555/sec. Webhook ingestion: 5M instant-trigger workflows receiving an average of 50 events/day = 250M webhooks/day = 2,894/sec. Execution logs: 500M × 4 steps × 1KB per step log = 2TB/day.

High-Level Architecture

The architecture has three planes: Trigger Plane, Execution Plane, and Integration Plane. The Trigger Plane handles event ingestion from three sources. Webhook Receivers accept HTTP callbacks from third-party services at dedicated per-workflow URLs (e.g., hooks.platform.com/{workflow_id}). Each incoming webhook is validated (signature verification using the app's webhook secret), parsed, and published to an execution queue (Kafka topic). Poll Workers run on a schedule, checking third-party APIs for new data. Each poll-based workflow has a last_polled cursor; the worker fetches new records since the cursor, and for each new record, publishes an execution event. Schedule Workers use a distributed cron system (backed by a time-wheel algorithm) to trigger workflows at configured intervals.

The Execution Plane processes workflow executions as DAGs. When an execution event arrives, the DAG Executor retrieves the workflow definition (a graph of steps with edges representing data flow and control flow), initializes the execution context (trigger data), and begins step execution. Steps are executed in topological order respecting dependencies. Each step is dispatched to a Step Worker that: (1) resolves data mappings (field references from previous steps using JSONPath expressions), (2) calls the third-party API via the Integration Plane, (3) stores the step result, and (4) evaluates any post-step conditions (filters, branches). The executor handles parallelism — independent steps (no data dependencies) run concurrently.

The Integration Plane abstracts third-party API interactions. Each supported app has an App Definition containing: authentication method (OAuth2, API key, basic auth), available triggers and actions, API endpoint templates, request/response schemas, and rate limit specifications. When a step worker executes an action, the Integration Plane resolves the app definition, injects the user's stored credentials, constructs the HTTP request, applies rate limiting, sends the request, and normalizes the response.

Core Components

DAG Executor

The DAG executor processes each workflow execution as a directed acyclic graph. The execution state machine tracks each step's status: pending, running, completed, failed, skipped. The executor uses a work-queue model: it pushes ready steps (all dependencies satisfied) to a per-execution priority queue. Step workers pull from this queue, execute the step, and report the result back to the executor. The executor then evaluates which downstream steps become ready. For workflows with conditional branches (if/else), the executor evaluates the branch predicate against step outputs and marks the non-taken branch's steps as skipped. Loops (for-each over arrays) are expanded at execution time into parallel step instances, one per array element, with results aggregated via a join step.

Webhook Ingestion at Scale

Webhook receivers handle 2,894 webhooks/sec across millions of unique endpoint URLs. The URL routing uses a hash-based mapping: the workflow_id in the URL directly identifies the workflow, eliminating a database lookup on the hot path. Signature verification (HMAC-SHA256 for most apps) ensures authenticity. The receiver performs minimal processing — validate, parse, publish to Kafka — and returns 200 OK within 200ms. Kafka partitioning by workflow_id ensures ordered processing of events for the same workflow. For apps with unreliable webhook delivery (no retries), the platform provides a webhook relay that proxies calls and adds retries.

Rate Limiting & Credential Management

Third-party API rate limits are a critical constraint. The system maintains per-app, per-user rate limit counters in Redis using the token bucket algorithm. When a step worker attempts an API call, it first acquires a token from the rate limiter. If the limit is exhausted, the step is requeued with a delay. Global app-level limits (e.g., Slack allows 1 message/sec per workspace) are enforced across all users sharing that workspace credential. User credentials (OAuth2 tokens, API keys) are stored in a Vault-backed credential store with encryption at rest. OAuth2 token refresh is handled transparently by the Integration Plane — when an API call returns 401, the system attempts a token refresh before marking the step as failed.

Database Design

PostgreSQL stores workflow definitions and user data: workflows (workflow_id UUID PK, user_id, name, trigger_config JSONB, steps JSONB, status ENUM(active, paused, draft, deleted), version INT, created_at, updated_at). The steps JSONB contains the DAG definition: [{step_id, app_id, action_type, field_mappings, conditions, next_steps}]. Credentials: user_app_credentials (credential_id, user_id, app_id, auth_type, encrypted_tokens JSONB, created_at, expires_at).

Execution logs are stored in a time-series-optimized store (TimescaleDB or ClickHouse): executions (execution_id, workflow_id, trigger_data JSONB, status, started_at, completed_at, error_message), step_executions (step_execution_id, execution_id, step_id, status, input JSONB, output JSONB, started_at, completed_at, duration_ms, error). Partitioned by started_at with automatic retention (30 days for free tier, 1 year for paid). Indexes on (workflow_id, started_at DESC) for execution history views. Polling cursors: poll_state (workflow_id PK, last_polled_at, cursor_value JSONB) — updated atomically after each successful poll to avoid reprocessing.

API Design

POST /api/v1/workflows — Create a workflow; body contains trigger config, steps DAG, field mappings; returns workflow_id
POST /hooks/{workflow_id} — Webhook endpoint; receives external events and triggers workflow execution
GET /api/v1/workflows/{workflow_id}/executions?limit=20&status=failed — Fetch execution history with filtering
POST /api/v1/executions/{execution_id}/replay — Replay a failed execution from the failed step with original or modified input

Scaling & Bottlenecks

Third-party API latency and rate limits are the dominant bottleneck. The average step takes 500ms due to external API latency, meaning the system has 11,574 concurrent outbound connections. Step workers are designed for high concurrency using async I/O (Go goroutines or Node.js async/await), with each worker handling 1,000 concurrent connections. A fleet of 12 workers handles peak load. When a third-party API is slow or down, circuit breakers prevent cascading failures — the circuit opens after 5 consecutive failures, and affected executions are paused with automatic retry after a cooldown period.

Polling trigger scalability is the second bottleneck. 5,555 polls/sec, each requiring an API call to check for new data, generates significant outbound traffic. The system optimizes by: (1) adaptive polling intervals — workflows with frequent events poll every 1 minute; rarely-triggered workflows poll every 15 minutes, (2) batch polling — checking multiple users' data in a single API call when the app supports it (e.g., fetching all new rows in a shared Google Sheet), and (3) encouraging webhook-based triggers over polling (webhooks are instant and cheaper). A polling scheduler distributes polls evenly across time to avoid thundering herd on the minute/hour boundary.

Key Trade-offs

DAG execution model vs sequential steps only: DAGs enable parallel execution of independent steps and complex branching logic, but add complexity to the executor and the workflow builder UI — the power-user benefit justifies the complexity
Webhook-first vs polling-first trigger model: Webhooks are instant and efficient but require apps to support them and users to configure them — polling is universal but introduces latency (up to 15 minutes) and wastes API calls; the platform supports both with webhook as preferred
Per-user credential isolation vs shared connections: Each user authenticates their own app connections, providing security isolation but requiring N OAuth flows for N users of the same app — shared organizational credentials are offered for team plans
Execution log retention (30 days) vs infinite: Retaining all execution data enables full history replay but grows storage at 2TB/day — tiered retention (30 days hot, 1 year cold in S3) balances debugging needs with cost