System Design: Real Estate Marketplace (Zillow-scale)

Requirements

Functional Requirements:

Browse and search for-sale, for-rent, and recently sold properties with geo-based filtering
Property detail pages with photos, virtual tours, price history, school ratings, and neighborhood data
Automated valuation model (AVM) providing estimated property values
Agent directory with reviews, recent sales, and contact/inquiry forms
Saved searches and alerts when new listings matching criteria are published
Mortgage calculator and pre-qualification flow with lender partner integrations

Non-Functional Requirements:

Geo-search returning results within 500ms for 95th percentile
Support 10M daily active users with peak traffic during evening home browsing hours
Property photos: 200M listings × 20 photos × 3 resolutions = 12B image objects
New listing appear in search results within 5 minutes of MLS ingestion
99.9% uptime; search downtime directly impacts business revenue

Scale Estimation

At Zillow scale: 160M US property records (all properties, not just active listings), 1.5M active for-sale listings at any time. Daily traffic: 10M DAU × 15 searches/session = 150M search queries/day = ~1,736/second average, 10x peak = 17,360 searches/second. Photo uploads from MLS feeds: 50k new listings/day × 20 photos = 1M photo uploads/day. Valuation model runs: triggered on listing events and scheduled refreshes = ~500k AVM computations/day.

High-Level Architecture

The platform is organized around a Property Data Platform, a Search Service, a Media Service, and a User Engagement Platform. The Property Data Platform ingests listing data from MLS (Multiple Listing Service) feeds via RETS/RESO API protocols, normalizing data from thousands of local MLS systems into a canonical property schema. This is the most complex data engineering challenge — each MLS has different field names, formats, and update frequencies.

The Search Service is Elasticsearch-backed, with a geo-spatial component for map-based search. Property documents in Elasticsearch contain all search-relevant fields plus a geo_point for radius and bounding-box queries. The Elasticsearch cluster is sized for 160M property documents with a 10-shard, 2-replica configuration. Search queries use a multi-stage approach: broad geo-filter first, then attribute filters, then relevance ranking.

The AVM (Automated Valuation Model) Service runs ML models (gradient boosting or neural networks) trained on historical sale prices, property attributes, and neighborhood features. The model is retrained weekly on new sales data. Valuation outputs are cached in PostgreSQL and served synchronously on property detail page load. The AVM is a major trust-building feature — accuracy and freshness directly impact user trust.

Core Components

MLS Ingestion Pipeline

A Kafka-based pipeline ingesting from hundreds of MLS data feeds on update schedules (some real-time, some daily batch). Each feed adapter normalizes to a canonical PropertyEvent schema and publishes to Kafka. A Stream Processor (Flink) enriches events with geocoding (address → lat/lon), school district lookup, flood zone data, and historical price data. Enriched records flow to both the Elasticsearch indexing pipeline (5-minute SLA to search) and the PostgreSQL property store (for detailed pages). Deduplication logic handles the same property appearing in multiple regional MLS feeds.

Geo-Search Service

Built on Elasticsearch with geo_shape polygon queries for map-drag search and geo_distance for radius search. Map-based search uses a two-pass approach: first pass retrieves property IDs within the viewport polygon (fast geo query), second pass hydrates results with display data from Redis cache. Search filters (price range, bed/bath, property type, listing status) are applied as Elasticsearch term and range filters. A separate "Draw on Map" feature stores user-drawn polygon queries and re-evaluates them against new listings as they arrive.

Media & Virtual Tour Service

Handles photo upload, resizing, and delivery. MLS photos are ingested via the pipeline and uploaded to S3. A Lambda-triggered processing pipeline generates three sizes: thumbnail (200px), gallery (800px), and full (1600px). Virtual tour integration embeds third-party 3D tour players (Matterport). A CDN (CloudFront) serves all images globally with edge caching. Photo ordering, deletion, and compliance (MLS copyright rules on photo use) is managed by the Media Service.

Database Design

Property records in PostgreSQL: property_id UUID, address, city, state, zip, geo POINT, property_type, beds, baths, sqft, lot_size, year_built, listing_status ENUM, list_price, avm_value, avm_updated_at, mls_id, last_updated_at. A price_history table records every price change: (property_id, price, changed_at, change_type). A listing_photos table maps property to ordered photo S3 keys.

Elasticsearch documents mirror the PostgreSQL fields plus computed fields for search: price_per_sqft, days_on_market, price_reduction_pct. Saved searches are stored in PostgreSQL with the search criteria as JSONB; a scheduled job re-runs each saved search against Elasticsearch daily and sends alerts for new matches. Agent profiles and reviews use a separate schema; review data feeds a rating aggregation job that updates agent summary scores nightly.

API Design

GET /api/v1/properties/search?bounds={ne_lat,ne_lon,sw_lat,sw_lon}&price_min=&price_max=&beds_min=&type= — map-viewport search; returns property summaries with geo coordinates.

GET /api/v1/properties/{propertyId} — full property detail including photos, price history, AVM, and school data.

POST /api/v1/saved-searches — body: {criteria JSONB, alert_frequency}; registers a saved search with email/push alerts.

GET /api/v1/properties/{propertyId}/estimate — returns current AVM with confidence interval and comparable sales.

Scaling & Bottlenecks

Elasticsearch geo-queries are the main latency bottleneck. Map drag events fire a new search on every mouse-up, generating high query rates from active users. Query result caching with a short TTL (30 seconds) for identical viewport+filter combinations reduces Elasticsearch load during peak browsing. A dedicated Elasticsearch coordinating node tier (no data, just query routing and aggregation) prevents data node saturation during complex aggregation queries.

Photo delivery is handled entirely by CDN — S3 is never hit for reads after initial CDN warm-up. Cache-Control headers with long max-age (1 year) for immutable photo URLs eliminate cache churn. The main CDN cost concern is invalidation during compliance-driven photo removal (MLS listings that expire must have photos removed within SLA). A bulk invalidation batch job handles this within 4 hours of listing expiry.

Key Trade-offs

AVM freshness vs. compute cost: Running AVM on every page load would provide always-current values but is too expensive; caching AVM with daily refresh balances freshness and cost, though values can lag recent comparable sales.
MLS data freshness vs. normalization complexity: Real-time MLS feeds give up-to-the-minute listings but require real-time normalization of hundreds of schema variants; daily batch ingestion is simpler but misses same-day listing changes.
Elasticsearch vs. PostGIS for geo-search: Elasticsearch handles text + geo in a single index and scales horizontally, but is operationally complex; PostGIS on PostgreSQL is simpler but requires careful indexing to hit 500ms query SLAs at 160M records.
AVM confidence display: Showing a single number creates a false sense of precision; showing a range ("$450k–$490k") is more honest but users often anchor on the high end; this is as much a UX problem as a systems problem.