System Design: Digital Media Platform

Requirements

Functional Requirements:

Multi-brand content management: a single platform hosts multiple media brands with independent editorial workflows
Publish articles, videos, interactive graphics, and live blogs to web, mobile, and third-party platforms (Apple News, Google AMP)
Dynamic paywall: meter article reads, enforce subscription limits, and upsell to paid plans
Personalization: serve different content rankings and recommendations per reader based on interest profile
Real-time analytics: live article view counts, trending content, and audience segmentation for editorial teams
SEO: structured data (JSON-LD), canonical URLs, sitemap generation, and fast page loads (<2s Core Web Vitals)

Non-Functional Requirements:

Support 500 million monthly page views across all brands
Article publish-to-live latency under 30 seconds
Paywall enforcement latency under 100ms (must not add visible delay to page loads)
CMS handles 1,000 concurrent editorial users across all brands
99.99% uptime for article serving (revenue-critical)

Scale Estimation

500M monthly page views / 30 days / 86,400 seconds = 193 requests/second average, with 10× peak = 1,930 requests/second. With 95% CDN hit rate, origin serves ~100 requests/second — easily handled. Article publishing: 1,000 editorial users × 10 publishes/day = 10,000 publishes/day = 0.12 publishes/second — trivial for the CMS API. Paywall checks: every page view triggers a paywall check (is user subscribed? how many free articles used?): 193 checks/second at peak 1,930 checks/second. At 100ms SLA, paywall must serve from in-memory state (Redis), not a DB query per request. Personalization: 193 requests/second each triggering a ranking call — served from a Redis-cached interest profile per user.

High-Level Architecture

The platform separates the content management system (CMS), the content delivery system (CDS), and the audience system (paywall + personalization) into independent services. This allows the editorial team's publishing workflow to scale independently from reader traffic, and the audience system to be updated without affecting content delivery.

CMS: a headless CMS (custom-built or a platform like Contentful/Sanity) stores article content as structured JSON documents. Articles have a rich content model: headline, byline, body (block-based rich text), primary media, tags, section, brand, SEO metadata, publish schedule, and distribution channels. Editorial workflow: draft → reviewed → scheduled → published → archived. Publishing triggers a webhook to the CDS, which generates and caches the rendered HTML.

Content Delivery: articles are rendered server-side (Next.js SSR or a custom template engine) and cached at multiple layers: CDN edge (CloudFront, 30-minute TTL), origin cache (Varnish, 5-minute TTL), and application memory cache (60-second TTL for hot articles). Article URLs follow SEO-friendly patterns (/section/year/month/day/slug). AMP and Apple News versions are generated automatically from the article JSON and served at separate endpoints. Sitemaps are regenerated every 5 minutes and submitted to Google Search Console.

Audience system: the paywall service checks subscription status from a Stripe/subscription DB (Redis-cached per user session, 15-minute TTL) and tracks metered article reads in Redis (counter per user, reset monthly). Personalization: a lightweight interest profile per user (JSON map of topic → affinity score) is stored in Redis, updated via a Kafka consumer processing click events. Homepage and section page rankings are computed by a scoring function (recency × relevance × personalization score) running in the application tier for each request, with article metadata cached in Redis.

Core Components

Headless CMS

The CMS stores content in a document database (MongoDB or PostgreSQL with JSONB). Each article document contains all content variants: web HTML, AMP, Apple News JSON, and social sharing snippets — generated at publish time, not at read time. The CMS API provides: a GraphQL editorial API (for the editor UI, allowing complex queries like "fetch all articles in Sport section scheduled for next 24 hours") and a REST content delivery API (for the website and apps, optimized for high-read throughput). Editorial users authenticate via OAuth 2.0 with role-based permissions: writer (create/edit own articles), editor (review and publish all articles), and administrator (manage brand settings).

Paywall Enforcement Service

The paywall service is injected into every page request via a JavaScript tag (client-side check) backed by a server-side API call. Architecture: (1) client loads article page (CDN-cached HTML with truncated content beyond paywall); (2) JavaScript calls the paywall API with the user's session token; (3) paywall API checks Redis: user subscription status (TTL 15 min, refreshed from Stripe), article meter count (incremented on first read), and article free/metered/premium classification; (4) if user is subscribed or article is free, return full content token; (5) client JavaScript replaces truncated content with full text. End-to-end in <100ms because all checks are Redis reads. Hard paywall (no free articles) for premium brands; metered paywall (5 free/month) for general news brands.

Real-Time Editorial Analytics

Editorial teams see live view counts and engagement metrics for published articles. Architecture: a JavaScript analytics beacon on every page sends a view event to an analytics collector API (Kafka write, <10ms response). A Flink stream processor aggregates view counts in a 1-minute tumbling window and writes to Redis: article:{article_id}:views_1m (1-minute view count), article:{article_id}:views_total (total view count). The editorial dashboard reads from Redis for live data and ClickHouse for historical trend data. Trending articles (view rate acceleration in the past 30 minutes) are surfaced on the editorial dashboard to guide homepage curation decisions.

Database Design

PostgreSQL + JSONB: articles (article_id, brand_id, slug, headline, body_json, section_id, tags[], status, published_at, author_id, seo_meta_json), brands (brand_id, name, domain, paywall_config_json), sections (section_id, brand_id, slug, name). Redis Cluster: article:{article_id}:meta (cached article metadata, TTL 5 min), user:{user_id}:paywall (subscription status + meter count, TTL 15 min), user:{user_id}:interests (topic affinity map, TTL 1 hour), trending:global (sorted set of article_ids by 30-min view velocity). ClickHouse: page_views (article_id, user_id, session_id, client_ts, referrer, device_type) — analytics, 90-day retention. S3: article media (images, videos), sitemap files, AMP and Apple News JSON exports. Kafka: analytics-events (page views, click events), publish-events (article publish/update triggers).

API Design

GET /articles/{slug} — returns article content; CDN-cached at edge for 30 minutes; MISS triggers SSR with 60-second application cache
POST /paywall/check — body: {article_id, user_session_token}, returns {access: full|metered|blocked, remaining_free: N}; served from Redis in <10ms
POST /cms/articles — body: article document JSON, validates content, stores in PostgreSQL CMS DB, returns article_id
POST /cms/articles/{article_id}/publish — triggers rendering, CDN invalidation, sitemap update, and distribution to Apple News/Google AMP
GET /analytics/trending?brand_id={b}&window=30m — returns top-20 trending articles by view velocity; from Redis sorted set

Scaling & Bottlenecks

CDN is the primary scaling mechanism — 95% hit rate reduces origin load from 1,930 to 97 requests/second. The remaining 5% (cache misses + dynamic personalization) is served by the application tier. For breaking news events (major story drives 100k concurrent readers to one article), CDN cache hit rate approaches 99.9% — origin receives only ~200 requests/second regardless of total traffic. CDN cache invalidation on article update: use a surrogate key pattern (article-specific cache tag) to invalidate only the affected article's cache entries across all edge nodes, not a full cache flush.

Paywall Redis throughput: 1,930 checks/second × 2 Redis reads (subscription status + meter count) = 3,860 reads/second — trivial for Redis. Meter count increments (Redis INCR) at 1,930/second are also well within Redis capacity. The monthly meter reset (100M users × 1 Redis key flush per month) is a batch operation run on the first day of each month using a Lua script.

Key Trade-offs

Headless CMS vs. monolithic CMS (WordPress): Headless CMSes separate content from presentation, enabling multi-channel publishing (web, app, Apple News, AMP) from a single content store; monolithic CMSes (WordPress) have richer editorial UIs and plugin ecosystems but couple content and rendering, making performance optimization and multi-channel distribution harder.
Server-side paywall vs. client-side: Client-side paywall (JavaScript truncation) is easy to bypass by disabling JavaScript; server-side rendering of truncated content with a paywall token exchange is more robust but adds latency to the paywall check. A hybrid approach (CDN-cached truncated content + client-side token for full content) balances cacheability and security.
Personalization depth vs. privacy: Deep personalization (track every click, model interests per article) maximizes engagement but requires data collection that faces GDPR/CCPA consent challenges; lightweight personalization (track only topic-level preferences from explicit follows) is privacy-preserving but less accurate.
Static site generation vs. SSR: Pre-generating all article pages as static HTML files (Gatsby, Next.js SSG) is the most CDN-friendly approach (100% cache hit rate) but breaks for dynamic content (personalized paywalls, live blogs). SSR with CDN caching and edge-side includes (ESI) for dynamic fragments is the practical approach for news platforms.