SYSTEM_DESIGN
System Design: Real Estate Marketplace (Zillow-scale)
Design a Zillow-scale real estate marketplace supporting property listings, valuation estimates, agent matching, and mortgage pre-qualification for millions of daily users. Covers geo-search, image handling, and real-time market data.
Requirements
Functional Requirements:
- Browse and search for-sale, for-rent, and recently sold properties with geo-based filtering
- Property detail pages with photos, virtual tours, price history, school ratings, and neighborhood data
- Automated valuation model (AVM) providing estimated property values
- Agent directory with reviews, recent sales, and contact/inquiry forms
- Saved searches and alerts when new listings matching criteria are published
- Mortgage calculator and pre-qualification flow with lender partner integrations
Non-Functional Requirements:
- Geo-search returning results within 500ms for 95th percentile
- Support 10M daily active users with peak traffic during evening home browsing hours
- Property photos: 200M listings × 20 photos × 3 resolutions = 12B image objects
- New listing appear in search results within 5 minutes of MLS ingestion
- 99.9% uptime; search downtime directly impacts business revenue
Scale Estimation
At Zillow scale: 160M US property records (all properties, not just active listings), 1.5M active for-sale listings at any time. Daily traffic: 10M DAU × 15 searches/session = 150M search queries/day = ~1,736/second average, 10x peak = 17,360 searches/second. Photo uploads from MLS feeds: 50k new listings/day × 20 photos = 1M photo uploads/day. Valuation model runs: triggered on listing events and scheduled refreshes = ~500k AVM computations/day.
High-Level Architecture
The platform is organized around a Property Data Platform, a Search Service, a Media Service, and a User Engagement Platform. The Property Data Platform ingests listing data from MLS (Multiple Listing Service) feeds via RETS/RESO API protocols, normalizing data from thousands of local MLS systems into a canonical property schema. This is the most complex data engineering challenge — each MLS has different field names, formats, and update frequencies.
The Search Service is Elasticsearch-backed, with a geo-spatial component for map-based search. Property documents in Elasticsearch contain all search-relevant fields plus a geo_point for radius and bounding-box queries. The Elasticsearch cluster is sized for 160M property documents with a 10-shard, 2-replica configuration. Search queries use a multi-stage approach: broad geo-filter first, then attribute filters, then relevance ranking.
The AVM (Automated Valuation Model) Service runs ML models (gradient boosting or neural networks) trained on historical sale prices, property attributes, and neighborhood features. The model is retrained weekly on new sales data. Valuation outputs are cached in PostgreSQL and served synchronously on property detail page load. The AVM is a major trust-building feature — accuracy and freshness directly impact user trust.
Core Components
MLS Ingestion Pipeline
A Kafka-based pipeline ingesting from hundreds of MLS data feeds on update schedules (some real-time, some daily batch). Each feed adapter normalizes to a canonical PropertyEvent schema and publishes to Kafka. A Stream Processor (Flink) enriches events with geocoding (address → lat/lon), school district lookup, flood zone data, and historical price data. Enriched records flow to both the Elasticsearch indexing pipeline (5-minute SLA to search) and the PostgreSQL property store (for detailed pages). Deduplication logic handles the same property appearing in multiple regional MLS feeds.
Geo-Search Service
Built on Elasticsearch with geo_shape polygon queries for map-drag search and geo_distance for radius search. Map-based search uses a two-pass approach: first pass retrieves property IDs within the viewport polygon (fast geo query), second pass hydrates results with display data from Redis cache. Search filters (price range, bed/bath, property type, listing status) are applied as Elasticsearch term and range filters. A separate "Draw on Map" feature stores user-drawn polygon queries and re-evaluates them against new listings as they arrive.
Media & Virtual Tour Service
Handles photo upload, resizing, and delivery. MLS photos are ingested via the pipeline and uploaded to S3. A Lambda-triggered processing pipeline generates three sizes: thumbnail (200px), gallery (800px), and full (1600px). Virtual tour integration embeds third-party 3D tour players (Matterport). A CDN (CloudFront) serves all images globally with edge caching. Photo ordering, deletion, and compliance (MLS copyright rules on photo use) is managed by the Media Service.
Database Design
Property records in PostgreSQL: property_id UUID, address, city, state, zip, geo POINT, property_type, beds, baths, sqft, lot_size, year_built, listing_status ENUM, list_price, avm_value, avm_updated_at, mls_id, last_updated_at. A price_history table records every price change: (property_id, price, changed_at, change_type). A listing_photos table maps property to ordered photo S3 keys.
Elasticsearch documents mirror the PostgreSQL fields plus computed fields for search: price_per_sqft, days_on_market, price_reduction_pct. Saved searches are stored in PostgreSQL with the search criteria as JSONB; a scheduled job re-runs each saved search against Elasticsearch daily and sends alerts for new matches. Agent profiles and reviews use a separate schema; review data feeds a rating aggregation job that updates agent summary scores nightly.
API Design
GET /api/v1/properties/search?bounds={ne_lat,ne_lon,sw_lat,sw_lon}&price_min=&price_max=&beds_min=&type= — map-viewport search; returns property summaries with geo coordinates.
GET /api/v1/properties/{propertyId} — full property detail including photos, price history, AVM, and school data.
POST /api/v1/saved-searches — body: {criteria JSONB, alert_frequency}; registers a saved search with email/push alerts.
GET /api/v1/properties/{propertyId}/estimate — returns current AVM with confidence interval and comparable sales.
Scaling & Bottlenecks
Elasticsearch geo-queries are the main latency bottleneck. Map drag events fire a new search on every mouse-up, generating high query rates from active users. Query result caching with a short TTL (30 seconds) for identical viewport+filter combinations reduces Elasticsearch load during peak browsing. A dedicated Elasticsearch coordinating node tier (no data, just query routing and aggregation) prevents data node saturation during complex aggregation queries.
Photo delivery is handled entirely by CDN — S3 is never hit for reads after initial CDN warm-up. Cache-Control headers with long max-age (1 year) for immutable photo URLs eliminate cache churn. The main CDN cost concern is invalidation during compliance-driven photo removal (MLS listings that expire must have photos removed within SLA). A bulk invalidation batch job handles this within 4 hours of listing expiry.
Key Trade-offs
- AVM freshness vs. compute cost: Running AVM on every page load would provide always-current values but is too expensive; caching AVM with daily refresh balances freshness and cost, though values can lag recent comparable sales.
- MLS data freshness vs. normalization complexity: Real-time MLS feeds give up-to-the-minute listings but require real-time normalization of hundreds of schema variants; daily batch ingestion is simpler but misses same-day listing changes.
- Elasticsearch vs. PostGIS for geo-search: Elasticsearch handles text + geo in a single index and scales horizontally, but is operationally complex; PostGIS on PostgreSQL is simpler but requires careful indexing to hit 500ms query SLAs at 160M records.
- AVM confidence display: Showing a single number creates a false sense of precision; showing a range ("$450k–$490k") is more honest but users often anchor on the high end; this is as much a UX problem as a systems problem.
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.