SYSTEM_DESIGN
System Design: Travel Review Platform
Design a travel review platform like TripAdvisor — covering review ingestion, ranking algorithms, fake review detection, and multi-language content serving.
Requirements
Functional Requirements:
- Travelers submit text reviews with star ratings, photos, and visit date for hotels/restaurants/attractions
- Review listing pages display sorted and filtered reviews with summary statistics
- Property owners can respond to reviews
- Helpful votes and review flagging by community members
- Review summary: AI-generated highlights from recent reviews
- Multi-language support: reviews submitted and read in 40+ languages
Non-Functional Requirements:
- Review submission processed within 5 seconds; moderation within 24 hours
- Review pages load in under 300ms at the 95th percentile
- Support 1 billion reviews across 8 million properties
- Fake review detection must process each review within 60 minutes of submission
- 99.9% availability; peak travel season doubles traffic
Scale Estimation
1 billion reviews at average 500 bytes text = 500 GB text data. With photos (average 2 photos per review × 1.5 MB compressed) = 3 PB photo storage in S3. New reviews: 3 million/day = 35/second. Review page loads: 1 billion/day = 11,574/second. Helpful votes: 500 million/day = 5,787/second. Review moderation queue: 3 million/day; automated ML filters handle 95%, human review handles 5% = 150,000 human moderations/day across a global team of 2,000 moderators.
High-Level Architecture
The review platform splits into: Review Ingestion & Moderation (processing new submissions), Review Serving (delivering review pages at scale), and Trust & Safety (detecting and removing fake reviews).
New review submissions flow into a Review Ingestion Service that validates content (length, language detection, profanity check), stores the review in PostgreSQL as PENDING_MODERATION, enqueues it for async processing (Kafka), and returns a 202 Accepted to the client. The async pipeline runs: fake review ML classifier, language detection, photo processing (resize, CDN upload), sentiment analysis, and AI summary trigger. After automated moderation, the review transitions to PUBLISHED or FLAGGED_FOR_HUMAN_REVIEW.
The Review Serving path is heavily cached. Property review pages are pre-rendered as HTML fragments (review list with default sort, rating histogram, top reviewer highlights) and cached in a CDN (Varnish + CloudFront). Cache invalidation fires on new review publish or helpful vote threshold crossing.
Core Components
Review Moderation Pipeline
Every new review triggers a Kafka event consumed by the Moderation Orchestrator. It runs in parallel: (1) Spam classifier (gradient boosted model, features: reviewer profile age, IP, text similarity to other reviews, review velocity), (2) Content policy check (profanity, PII, off-topic), (3) Photo moderation (Google Vision SafeSearch API for NSFW detection). Each check returns PASS/FAIL/UNCERTAIN. Reviews with any FAIL auto-reject; UNCERTAIN route to human moderation queue; all PASS auto-publish. Human moderators use a web tool showing review + ML scores; decisions feed back into model training.
Fake Review Detection
Fake review patterns: coordinated campaigns (multiple accounts reviewing the same property within hours), review farms (accounts with no travel history suddenly posting dozens of reviews), incentivized reviews (text patterns like "I received a discount"). The detection model uses graph features (reviewer-property relationship graph, shared IPs, device fingerprints) plus text features (review similarity to other reviews for the same property, linguistic markers of non-native writing). Suspicious reviews are shadow-flagged (excluded from ranking but not visibly removed) pending further investigation.
Review Ranking & Aggregation
Review sort orders: Most Recent (default), Most Helpful (helpful_votes - unhelpful_votes score), Highest Rated, Lowest Rated. The default displayed rating is a Bayesian average (bayesian_avg = (rating_count × avg_rating + prior_count × prior_mean) / (rating_count + prior_count)) which prevents properties with 1 review and 5 stars from outranking properties with 10,000 reviews and 4.5 stars. Review highlights (pinned at the top of the listing): selected by a BERT-based extractive summarizer that identifies the most informative sentences about key aspects (location, cleanliness, staff, food).
Database Design
Reviews in PostgreSQL sharded by property_id: (review_id, property_id, user_id, rating, title, body, visit_date, language, status, helpful_votes, created_at). Index on (property_id, status, created_at DESC) for review listing pages. Rating aggregates in a materialized view (updated by trigger on status change). Photo metadata in PostgreSQL; photos stored in S3 with CloudFront CDN. Review search (full-text) in Elasticsearch: reviews indexed by property_id + text content, supporting keyword search within property reviews. Helpful votes in Redis counters (INCR review:{review_id}:helpful_votes) with periodic batch sync to PostgreSQL.
API Design
- POST /v1/properties/{property_id}/reviews — Submits review with rating, text, photos, visit_date; returns review_id and PENDING_MODERATION status
- GET /v1/properties/{property_id}/reviews?sort={}&lang={}&page={} — Returns paginated reviews for a property with sort and language filter
- POST /v1/reviews/{review_id}/helpful — Records helpful/unhelpful vote from authenticated user; Redis INCR + deduplication by user_id
- POST /v1/reviews/{review_id}/response — Property manager posts official response to review; stored in PostgreSQL, displayed below the review
Scaling & Bottlenecks
Review page serving at 11,574 requests/second is the primary bottleneck. CDN caching is the solution: the top 10,000 properties (by search volume) have their review pages cached as HTML fragments at CloudFront edge nodes with a 5-minute TTL for the review list (refreshed on new review publication). Cache hit rate for top properties exceeds 90%, reducing origin load to ~1,157 requests/second — within a modest PostgreSQL read-replica fleet's capacity.
Helpful vote writes at 5,787/second hit Redis counters (O(1) INCR operation) rather than PostgreSQL, avoiding row-level lock contention on the reviews table. A batch job syncs Redis counters to PostgreSQL every 5 minutes. The Elasticsearch full-text search index (1 billion documents across 20 shards) handles the rare keyword-in-reviews search query at ~100 QPS.
Key Trade-offs
- Auto-publish vs. pre-moderation — pre-moderation prevents all fake/harmful reviews but delays legitimate reviews by 24 hours, hurting reviewer experience; auto-publish with post-publication removal is the industry standard for scale
- Bayesian average vs. simple average — Bayesian average prevents gaming by new properties posting fake 5-star reviews; the prior (global property mean ~3.9 stars) anchors new properties to the average until enough genuine reviews accumulate
- Review response rights — allowing property owners to respond is good for engagement but creates abuse risk (harassing negative reviewers); strict response content policy + response moderation is required
- Language translation — auto-translating all reviews to the user's language improves accessibility but introduces translation errors that misrepresent original sentiment; always showing original with "Translated by Google" disclosure is the honest approach
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.