INTERVIEW_QUESTIONS
Vector Database Interview Questions for Senior Engineers (2026)
15 advanced vector database interview questions with detailed answer frameworks covering similarity search algorithms, HNSW, IVF, product quantization, embedding storage, hybrid search, and practical comparisons of Pinecone, Weaviate, Qdrant, and pgvector.
Why Vector Databases Matter in Senior Engineering Interviews
Vector databases have become foundational infrastructure for modern AI applications. Every RAG system, recommendation engine, image search platform, and anomaly detection pipeline relies on efficient vector similarity search. As companies move from AI prototypes to production systems, the ability to design, operate, and optimize vector storage and retrieval has become a critical senior engineering skill.
Interviewers asking vector database questions want to assess whether you understand the algorithmic foundations of approximate nearest neighbor search, the trade-offs between different indexing strategies, and the practical considerations of operating vector infrastructure at scale. A strong candidate can explain why HNSW works, when to use product quantization, how to benchmark vector search quality, and how to select the right vector database for a specific use case.
At companies building AI-native products, from startups to FAANG, vector database design decisions directly impact product quality, search latency, and infrastructure cost. For comprehensive preparation, see our system design interview guide and explore the AI infrastructure learning path.
1. What is a vector database and how does it differ from a traditional database?
What the interviewer is really asking: Do you understand the fundamental purpose of vector databases and can you articulate why traditional databases are insufficient for similarity search?
Answer framework:
A vector database is a specialized data store optimized for storing, indexing, and querying high-dimensional vectors (embeddings). The fundamental operation is similarity search: given a query vector, find the k most similar vectors in the database. This is fundamentally different from traditional databases that operate on exact matches, range queries, and joins.
Traditional databases index data using B-trees, hash indexes, or inverted indexes. These structures excel at exact lookups (WHERE id = 123) and range queries (WHERE price BETWEEN 10 AND 50) but cannot efficiently answer "find the 10 most similar items to this one." You could compute cosine similarity against every vector in the database, but this is O(n) per query and impractical at scale.
Vector databases solve this with Approximate Nearest Neighbor (ANN) indexes that trade a small amount of accuracy for dramatic speed improvements. Instead of scanning all vectors, ANN algorithms organize vectors into structures that allow pruning most of the search space. Common ANN index types include HNSW (Hierarchical Navigable Small World), IVF (Inverted File Index), and tree-based methods.
Key capabilities that distinguish vector databases: purpose-built similarity search with sub-millisecond latency at billion-vector scale, support for multiple distance metrics (cosine, Euclidean, dot product), metadata filtering combined with vector search, real-time index updates without full rebuilds, and horizontal scaling across multiple nodes.
Modern vector databases also support hybrid search (combining vector similarity with keyword matching), multi-tenancy, and integration with AI pipelines. The landscape includes purpose-built databases like Pinecone, Weaviate, and Qdrant, vector extensions for existing databases like pgvector for PostgreSQL, and vector search libraries like FAISS and Annoy.
Common mistakes: describing vector databases as just "databases that store vectors" without explaining the indexing and search differences, not mentioning the approximate nature of the search.
2. Explain the HNSW algorithm. How does it work and what are its trade-offs?
What the interviewer is really asking: Do you understand the most widely used ANN algorithm at a technical level, not just as a black box?
Answer framework:
HNSW (Hierarchical Navigable Small World) is a graph-based ANN algorithm that builds a multi-layer navigable graph for efficient approximate nearest neighbor search. It is the default index type in most vector databases because it offers an excellent balance of search quality, speed, and dynamic updateability.
The construction works as follows. Imagine a skip-list but for graph navigation. The algorithm builds multiple layers of graphs. Layer 0 (the bottom) contains all vectors. Each higher layer contains a random subset of vectors from the layer below, with exponentially fewer vectors per layer. For each vector, the algorithm connects it to its M nearest neighbors in each layer where it appears.
When inserting a new vector: (1) randomly assign it a maximum layer using an exponential distribution, (2) starting from the top layer, greedily navigate to the closest vector, (3) descend layer by layer, using the closest vector found as the entry point for the next layer, (4) at each layer where the new vector should appear, connect it to its ef_construction nearest neighbors.
Search follows the same top-down traversal: start at the top layer's entry point, greedily find the nearest neighbor at each layer, use it as the entry point for the next layer, and at the bottom layer do a more thorough search with a beam width of ef_search candidates.
Key parameters and their trade-offs:
M (max connections per node): Higher M means more connections, better recall, but more memory and slower insertion. Typical values: 16 to 64. Each doubling of M roughly doubles memory usage.
ef_construction (beam width during index building): Higher values build a better graph at the cost of slower construction. Typical values: 100 to 400. Only affects build time, not search time.
ef_search (beam width during query): Higher values increase recall at the cost of higher latency. This is the primary tuning knob for the recall-latency trade-off. Typical values: 50 to 500.
Trade-offs of HNSW:
- Strengths: excellent recall at low latency (99 percent recall with sub-millisecond search on millions of vectors), supports dynamic inserts and deletes without full rebuild, no training phase required.
- Weaknesses: high memory usage because the full graph must reside in memory. For 10 million 1536-dim float32 vectors, HNSW requires approximately 60 GB for vectors plus 5 to 15 GB for the graph structure. Not ideal for extremely high-dimensional vectors (above 2000 dimensions) without dimensionality reduction.
Common mistakes: not understanding the layered structure, confusing ef_construction and ef_search, not recognizing the memory implications.
3. Explain the IVF (Inverted File Index) algorithm and when you would choose it over HNSW.
What the interviewer is really asking: Can you reason about multiple ANN strategies and articulate when partitioning-based approaches are appropriate?
Answer framework:
IVF is a partitioning-based ANN algorithm that divides the vector space into clusters (Voronoi cells) and limits the search to only the most promising clusters. It consists of two phases.
Training phase: run k-means clustering on a sample of vectors to create nlist centroids (typically sqrt(N) to 4sqrt(N) centroids for N vectors). Each centroid defines a Voronoi cell. Assign every vector to its nearest centroid.
Search phase: given a query vector, compute distances to all centroids, select the nprobe closest centroids, and search exhaustively within those clusters. The trade-off is controlled by nprobe: higher values search more clusters giving better recall but slower search.
When to choose IVF over HNSW:
-
Memory constraints. IVF can work with disk-backed storage because you only load nprobe clusters into memory per query. HNSW requires the full graph in memory. For billion-scale datasets that do not fit in RAM, IVF with disk-backed clusters is practical.
-
Very large datasets with batch queries. IVF can be combined with GPU acceleration (FAISS GPU) for massive batch search throughput. GPU-accelerated IVF can search billions of vectors faster than HNSW on CPU.
-
Combined with quantization. IVF pairs naturally with Product Quantization (IVF-PQ) to dramatically reduce memory usage. HNSW can also use quantization but IVF-PQ is a more established combination.
-
Datasets with natural cluster structure. When vectors form well-separated clusters (different product categories, document types), IVF's cluster-based pruning is very effective.
When HNSW is better: dynamic workloads with frequent inserts and deletes (IVF requires periodic re-clustering), when you need consistently high recall above 95 percent, and when memory is not the primary constraint.
Common mistakes: not understanding that IVF requires a training phase, using too few or too many clusters, not recognizing that nprobe is the key recall-latency knob.
4. What is Product Quantization and how does it reduce memory usage for vector search?
What the interviewer is really asking: Do you understand the compression technique that makes billion-scale vector search practical, and can you explain the accuracy trade-off?
Answer framework:
Product Quantization (PQ) is a vector compression technique that reduces memory usage by 10 to 100x while maintaining reasonable search quality. It works by decomposing each vector into sub-vectors and quantizing each sub-vector independently.
The algorithm has three steps. First, split each d-dimensional vector into m sub-vectors of d/m dimensions each. For a 1536-dimensional vector with m=192, each sub-vector has 8 dimensions.
Second, for each sub-vector position, run k-means clustering on the training data to learn k* centroids (typically k*=256 so each cluster ID fits in one byte). This creates m codebooks, each with 256 entries.
Third, encode each vector by replacing each sub-vector with the index of its nearest centroid. A 1536-dimensional float32 vector (6,144 bytes) becomes 192 centroid IDs of 1 byte each (192 bytes), which is a 32x compression.
For search, use Asymmetric Distance Computation (ADC): compute the exact distance between the query sub-vectors and all centroids in each codebook (m * k* distance computations, precomputed once per query), then approximate the query-to-vector distance by summing the precomputed sub-vector distances. This enables fast distance computation directly on compressed vectors.
The accuracy trade-off: PQ introduces quantization error because each sub-vector is approximated by its nearest centroid. The error is controlled by m (more sub-quantizers means less compression but better accuracy) and the training data quality. For 1536-dimensional vectors, m=96 gives good compression with acceptable accuracy loss, m=192 gives less compression but nearly lossless quality.
OPQ (Optimized Product Quantization) applies a rotation matrix before splitting into sub-vectors, ensuring each sub-vector captures balanced variance. This consistently improves accuracy over standard PQ by 5 to 10 percent.
Common mistakes: not understanding the sub-vector decomposition, confusing PQ with scalar quantization, not mentioning that PQ requires a training phase on representative data.
5. Compare Pinecone, Weaviate, Qdrant, and pgvector. When would you choose each?
What the interviewer is really asking: Can you make informed infrastructure decisions based on practical trade-offs rather than just hype or familiarity?
Answer framework:
Pinecone is a fully managed, serverless vector database. Choose Pinecone when you want zero operational overhead, need to ship quickly, and are willing to pay a premium for convenience. Strengths: easiest to get started, automatic scaling, built-in hybrid search with sparse-dense vectors, excellent uptime SLA. Weaknesses: vendor lock-in (no self-hosted option), limited query flexibility, can be expensive at scale (costs scale with pod usage and storage), opaque internals make debugging difficult.
Weaviate is an open-source vector database with optional managed cloud. Choose Weaviate when you need rich hybrid search (BM25 plus vector), multi-modal support (text, image, audio), or built-in ML model integration (vectorize data within Weaviate). Strengths: excellent hybrid search, GraphQL-like query API, multi-tenancy support, generative search modules that integrate LLM calls directly. Weaknesses: higher operational complexity for self-hosting, can be memory-intensive, schema management adds friction.
Qdrant is an open-source vector database written in Rust, focused on performance. Choose Qdrant when you need high throughput, advanced filtering, and production-grade performance with full control. Strengths: excellent performance benchmarks, rich filtering capabilities (nested filters, geo-filters), payload storage and indexing, memory-efficient quantization options (scalar and product quantization built-in), strong multi-tenancy via collections. Weaknesses: smaller ecosystem than Weaviate, less built-in ML integration.
pgvector is a PostgreSQL extension. Choose pgvector when your dataset is under 10 million vectors, you already use PostgreSQL, and you want to avoid adding a new database to your stack. Strengths: uses existing PostgreSQL infrastructure, ACID transactions, joins with relational data, familiar SQL interface, zero additional operational burden. Weaknesses: performance degrades significantly above 5 to 10 million vectors, limited ANN algorithm options (HNSW and IVFFlat), no built-in sharding for horizontal scaling, single-node constraint.
Decision framework:
- Prototype or small scale (under 1M vectors): pgvector or Pinecone serverless
- Production with managed infrastructure preference: Pinecone or Weaviate Cloud
- Production with self-hosting and performance focus: Qdrant
- Need hybrid search and multi-modal: Weaviate
- Existing PostgreSQL stack with moderate scale: pgvector
- Cost-sensitive with large scale: Qdrant self-hosted or Weaviate self-hosted
For a deeper comparison of these options, see our vector database comparison guide. Consider also Milvus for very large scale deployments.
Common mistakes: choosing based on hype rather than requirements, not benchmarking on your actual workload, ignoring operational costs of self-hosting.
6. What distance metrics are used for vector similarity search, and when do you use each?
What the interviewer is really asking: Do you understand the mathematical foundations of similarity search and can you select the right metric for different use cases?
Answer framework:
The three primary distance metrics are cosine similarity, Euclidean distance (L2), and dot product (inner product). The choice affects both search quality and performance.
Cosine similarity measures the angle between two vectors, ignoring magnitude. Values range from -1 (opposite) to 1 (identical direction). It is the default choice for text embeddings because most embedding models normalize their outputs to unit vectors, and cosine similarity captures semantic similarity regardless of document length. When vectors are already normalized, cosine similarity and dot product are equivalent.
Euclidean distance (L2) measures the straight-line distance between two points in the vector space. Smaller values indicate greater similarity. Use L2 when the magnitude of vectors carries meaning. For example, in image similarity search where pixel intensity matters, or in recommendation systems where the vector magnitude encodes confidence.
Dot product (inner product) combines both direction and magnitude. Use it when magnitude is meaningful and you want to reward vectors that are both similar in direction and large in magnitude. Many embedding models are trained with dot product loss (like the original word2vec), so dot product matches the training objective.
Practical considerations: if your embedding model outputs normalized vectors (most modern text embedding models do), use cosine similarity or dot product interchangeably. Some vector databases store cosine as 1 minus cosine_similarity to convert it to a distance metric where lower is better. Check your embedding model's documentation for the recommended metric, as using the wrong metric can degrade recall by 5 to 20 percent.
For specialized use cases: Manhattan distance (L1) is more robust to outliers in high-dimensional spaces, and Hamming distance is used for binary vectors (binary quantization).
Common mistakes: using Euclidean distance with non-normalized text embeddings, not checking the embedding model's recommended metric, confusing distance (lower is more similar) with similarity (higher is more similar).
7. How do you benchmark and evaluate vector search quality? What is recall@k and how do you measure it?
What the interviewer is really asking: Can you rigorously measure search quality rather than relying on intuition, and do you understand the accuracy-performance trade-off?
Answer framework:
Recall@k is the primary metric for evaluating ANN search quality. It measures what fraction of the true k nearest neighbors are returned by the approximate search. If the exact brute-force search returns vectors {A, B, C, D, E} as the top 5, and your ANN returns {A, B, C, F, E}, recall@5 is 4/5 = 0.80.
To measure recall@k, you need a ground-truth dataset: compute exact nearest neighbors using brute-force search on a representative sample of queries (1,000 to 10,000 queries), then compare ANN results against this ground truth.
Plot the recall-vs-latency curve by varying the search parameters (ef_search for HNSW, nprobe for IVF). This curve is the most informative artifact for vector search evaluation. The goal is to find the operating point that gives your required recall (typically above 95 percent) with acceptable latency.
Beyond recall@k, measure:
- QPS (queries per second): throughput at your target recall level
- p99 latency: tail latency matters for user-facing applications
- Memory usage: including both vectors and index structures
- Build time: how long it takes to construct the index
- Update latency: time to insert or delete a single vector
Use standard benchmarks like ANN-Benchmarks (ann-benchmarks.com) to compare algorithms and implementations on standardized datasets. But always benchmark on your own data because performance varies significantly with vector dimensionality, distribution, and dataset size.
Common mistakes: benchmarking only latency without measuring recall, using unrealistically small datasets, not measuring p99 latency which can be 10x worse than median.
8. How would you design a vector search system that scales to billions of vectors?
What the interviewer is really asking: Can you architect a distributed vector search system that handles scale beyond a single machine?
Answer framework:
Billion-scale vector search requires distribution across multiple nodes because a single machine cannot hold the data in memory. For 1 billion 1536-dimensional float32 vectors, raw storage alone is 6 TB, plus index overhead.
The architecture has three components: a query router, index shards, and a metadata store.
Sharding strategy. Partition vectors across N shards. Two approaches: random sharding (distribute vectors randomly across shards, query all shards in parallel, merge results) and cluster-based sharding (cluster vectors and assign each cluster to a shard, route queries to relevant shards only). Random sharding is simpler and gives balanced load. Cluster-based sharding reduces query fan-out but creates hot spots for popular clusters.
For random sharding, every query must fan out to all shards. Each shard returns its local top-k, and the router merges results. The merge is a simple top-k selection across Nk candidates. This is embarrassingly parallel and scales linearly with shard count.
Replication. Each shard has R replicas (typically 2 to 3) for fault tolerance and read scaling. Route queries to the least-loaded replica. Use a consensus protocol for write coordination to ensure all replicas have the same vectors.
Compression. At billion scale, compression is mandatory. Use Product Quantization to reduce memory per vector from 6 KB (1536 * 4 bytes) to 192 bytes (32x compression). This makes each shard feasible to hold in memory. Use SQ8 (scalar quantization to 8-bit) for a simpler 4x compression. Some systems use a two-level approach: compressed vectors in memory for fast approximate search, full-precision vectors on disk for re-scoring the top candidates.*
Tiered storage. Keep hot data (recent vectors, frequently accessed) in memory with HNSW indexes. Move cold data to disk-based IVF indexes. Use an access-frequency tracker to promote and demote vectors between tiers.
Query optimization. Pre-filter by metadata before vector search to reduce the candidate set. If a query includes a filter like category="electronics," only search the shards or segments containing electronics vectors. This requires metadata-aware partitioning.
For practical implementation, Milvus and Qdrant support distributed deployments natively. FAISS provides the algorithmic primitives but requires building the distributed layer yourself.
Common mistakes: not accounting for index memory overhead on top of vector storage, ignoring the merge step latency in distributed queries, not implementing replication for fault tolerance.
9. What is hybrid search in vector databases and how is it implemented?
What the interviewer is really asking: Do you understand how to combine vector similarity with traditional keyword search, and why this combination is important for RAG systems?
Answer framework:
Hybrid search combines dense vector retrieval (semantic similarity) with sparse retrieval (keyword matching) in a single query. This is important because neither approach alone covers all query types. A query for "Python asyncio event loop" benefits from exact keyword matching, while "how to handle concurrent tasks in Python" benefits from semantic understanding.
Implementation approaches vary by database:
Weaviate implements hybrid search natively with a single API call. It runs BM25 and vector search in parallel and fuses results using a configurable alpha parameter (0 = pure keyword, 1 = pure vector).
Qdrant supports hybrid search by combining dense vector search with sparse vectors (SPLADE or BM25 encoded as sparse vectors). You store both dense and sparse vectors per document and query both simultaneously.
Pinecone supports sparse-dense vectors natively. Each vector can have both a dense embedding and a sparse representation (token IDs and weights from BM25 or SPLADE).
pgvector + PostgreSQL achieves hybrid search by combining pgvector's ANN search with PostgreSQL's full-text search (tsvector/tsquery) using a combined scoring function.
Fusion strategies: linear combination (alpha * vector_score + (1 - alpha) * keyword_score) requires score normalization since vector and keyword scores have different ranges. Reciprocal Rank Fusion (RRF) uses rank positions rather than raw scores, avoiding the normalization problem. Learned fusion uses a small model trained on relevance labels to combine scores optimally.
The alpha parameter should be tuned on your evaluation dataset. Start with 0.7 favoring vector search and adjust based on query analysis. Some systems use query-dependent routing: if the query contains specific identifiers or error codes, increase keyword weight; if it is a natural language question, increase vector weight.
Common mistakes: not normalizing scores before linear combination, using a fixed alpha for all query types, implementing hybrid search at the application level rather than using database-native support.
10. How do you handle filtering in vector search efficiently?
What the interviewer is really asking: Do you understand the tension between metadata filtering and vector search performance, and the different strategies to resolve it?
Answer framework:
Filtering is the combination of vector similarity search with metadata predicates like "find the most similar documents WHERE category = 'engineering' AND date > '2025-01-01'." This seems simple but creates a fundamental tension: should you filter before search (pre-filtering), after search (post-filtering), or during search (integrated filtering)?
Post-filtering is the simplest approach: run vector search for top-K candidates (K >> k), then apply metadata filters, return the top k that pass. The problem: if the filter is very selective (only 1 percent of vectors match), you need K = 100 * k candidates to expect k results, which is slow and wasteful. Worse, you might not get k results at all.*
Pre-filtering first identifies all vectors matching the metadata filter using a traditional index, then runs vector search only on those vectors. This guarantees exact filter compliance but may require building separate ANN indexes per filter value, which is impractical for high-cardinality filters.
Integrated filtering is the best approach and what modern vector databases implement. During graph traversal (HNSW) or cluster scanning (IVF), check metadata predicates as candidates are evaluated and skip non-matching vectors. This prunes the search space without requiring separate indexes. Qdrant, Weaviate, and Pinecone all support integrated filtering.
Performance considerations: filtering degrades vector search quality because it effectively removes vectors from the search graph, creating gaps in the navigation structure. When the filter is very selective (matches less than 1 percent of vectors), search quality can drop significantly. Mitigation strategies include building partitioned indexes per high-cardinality filter dimension, increasing the search beam width when filters are applied, and using pre-filtering for very selective filters.
For multi-tenant applications, tenant_id is the most common filter. If tenants have very different data sizes, consider separate collections per tenant rather than a single collection with tenant_id filters. See our multi-tenant architecture guide for detailed patterns.
Common mistakes: using post-filtering and being surprised when results are fewer than requested, not indexing metadata fields used in filters, not increasing search parameters to compensate for filter selectivity.
11. What is scalar quantization and binary quantization? How do they compare to Product Quantization?
What the interviewer is really asking: Do you understand the full spectrum of quantization techniques and their trade-offs in precision, memory, and speed?
Answer framework:
Scalar Quantization (SQ) compresses each vector dimension independently. SQ8 maps each float32 value to an 8-bit integer (int8), reducing memory by 4x. The mapping uses the min and max values of each dimension as the quantization range. SQ4 uses 4-bit integers for 8x compression but with more accuracy loss.
SQ is simpler than PQ: no training required, no sub-vector decomposition, and the error is bounded by the quantization step size. Recall loss is typically 1 to 3 percent for SQ8 compared to full precision.
Binary Quantization (BQ) is the most aggressive compression: each dimension becomes a single bit (1 if positive, 0 if negative). A 1536-dim float32 vector (6,144 bytes) becomes 1536 bits (192 bytes), a 32x compression. Distance computation uses Hamming distance which is extremely fast using CPU popcount instructions.
BQ works surprisingly well when vectors are high-dimensional (above 768 dimensions) and the embedding model distributes information uniformly across dimensions. For OpenAI embeddings (1536-dim), BQ achieves 95 percent or higher recall when combined with rescoring: use binary search to find the top 100 candidates quickly, then rescore those 100 with full-precision vectors loaded from disk.
Comparison:
| Method | Compression | Recall Impact | Speed | Training Required |
|---|---|---|---|---|
| SQ8 | 4x | 1-3% loss | Moderate | No |
| SQ4 | 8x | 3-8% loss | Fast | No |
| BQ | 32x | 5-15% loss (without rescore) | Very fast | No |
| PQ (m=96) | 16x | 2-5% loss | Moderate | Yes |
| PQ (m=192) | 8x | 1-2% loss | Slower | Yes |
In practice, the best approach often combines quantization with rescoring: use aggressive quantization (BQ or PQ with high compression) for the initial search over the full dataset, then rescore the top 100 to 500 candidates using full-precision vectors stored on SSD. This achieves the memory savings of compression with the accuracy of full precision.
Qdrant supports all three quantization types natively and allows combining them with HNSW for a good balance of speed and accuracy.
Common mistakes: not considering rescoring with full-precision vectors after quantized search, applying binary quantization to low-dimensional embeddings where it degrades significantly, treating quantization as lossless.
12. How do you handle vector index updates in a production system? What happens when you need to add, update, or delete vectors?
What the interviewer is really asking: Do you understand the operational reality of maintaining a vector index that is not static but constantly changing?
Answer framework:
Vector index mutability varies significantly by algorithm and database. This operational characteristic often determines which system is appropriate for your workload.
HNSW handles mutations well. Insertions add a new node to the graph and create connections to existing nodes. This is efficient and does not require rebuilding the index. Deletions are more complex: most implementations use soft deletes (marking nodes as deleted and skipping them during search) because removing a node from the graph could disconnect portions of the graph. Periodically, the index is compacted to physically remove deleted nodes and rebuild affected connections.
IVF handles insertions but degrades over time. New vectors are assigned to the nearest centroid and added to that cluster. Over time, if the data distribution shifts, the centroids become stale and search quality degrades. Periodic re-training of the centroids (re-clustering) is needed, which requires a full rebuild.
For production systems, implement a write-ahead strategy:
For updates (changing a vector's value), the typical pattern is delete-then-insert: soft-delete the old vector and insert the new one. There is no in-place update because the vector's position in the index depends on its value.
For index rebuilds, use blue-green deployment: build the new index in the background, validate it against the evaluation set, then atomically swap it with the old index. This avoids downtime during rebuilds.
Monitor index health metrics: recall degradation over time (compare against ground truth periodically), soft-delete ratio (rebuild when it exceeds 10 to 20 percent), and query latency trends.
Common mistakes: not planning for index rebuilds, ignoring the soft-delete overhead, not implementing blue-green deployment for zero-downtime rebuilds.
13. How would you implement vector search with access control for a multi-user application?
What the interviewer is really asking: Can you combine vector search with fine-grained authorization without sacrificing performance or security?
Answer framework:
Access control in vector search means ensuring that users only see search results for documents they are authorized to access. This is critical for enterprise applications where different teams, departments, or roles have different document access.
The naive approach is post-filtering: search for top-K (with large K) and filter out unauthorized results. This has the same problems as any post-filtering approach: unpredictable result counts, wasted computation, and potential information leakage through timing side-channels (a query returning quickly reveals that the top results are accessible).
The recommended approach uses pre-computed access control lists (ACLs) stored as metadata:
For complex permission models (hierarchical roles, row-level security), pre-compute a materialized access list per document during ingestion. When permissions change, update the metadata for affected documents. This trades ingestion complexity for query simplicity and performance.
For large organizations with complex permission hierarchies, consider document-level encryption: encrypt each chunk with a key derived from its access group, and only decrypt for authorized users. This provides defense-in-depth even if the vector database is compromised. The vector embeddings themselves do not reveal document content, but the metadata and returned content do.
Performance consideration: ACL-based filtering is effectively a metadata filter, which means the vector database must support efficient integrated filtering. If a user has access to only 0.1 percent of documents, the filter is very selective and may require larger search beam widths as discussed in the filtering question above.
For the RAG use case specifically, ensure that the LLM never sees content from documents the user cannot access, even in the retrieval stage. Every retrieved chunk must pass the authorization check before being included in the LLM context.
Common mistakes: implementing ACLs only at the application layer without database-level enforcement, not updating ACL metadata when permissions change, leaking unauthorized content through the LLM context.
14. What are the key operational considerations for running a vector database in production?
What the interviewer is really asking: Have you actually operated vector databases in production and dealt with real-world challenges beyond the happy path?
Answer framework:
Production vector database operations require attention to several areas that tutorials and documentation do not cover.
Capacity planning. Calculate memory requirements: (vector_dimensions * bytes_per_value * num_vectors) + index_overhead. For HNSW, the graph overhead is typically 20 to 40 percent of vector storage. Account for replicas: 3 replicas triple your memory needs. Plan for growth: if your corpus grows 10 percent monthly, provision for 18 months ahead.
Backup and disaster recovery. Vector indexes take hours to rebuild from scratch. Implement regular snapshots of both the vectors and the index structures. For Qdrant, use the snapshot API. For pgvector, standard PostgreSQL backup tools work. Test restore procedures regularly. Measure Recovery Time Objective (RTO): how long until the system is back online after a failure.
Monitoring and alerting. Track these metrics: query latency (p50, p95, p99), recall degradation over time (run periodic evaluation against ground truth), index size and memory usage, write throughput and indexing lag, error rates and timeout rates, cache hit rates if using result caching.
Schema evolution. When you change your embedding model, all existing vectors must be re-embedded because vectors from different models are not comparable. Plan for this by maintaining the ability to re-embed the entire corpus and using blue-green deployment for the cutover. Track the embedding model version in metadata.
Cost optimization. Vector databases are memory-intensive and expensive at scale. Use quantization (SQ8 or PQ) to reduce memory 4 to 32x. Tier storage: keep recent or frequently accessed vectors in fast HNSW, archive old vectors to disk-based storage. Consider serverless options (Pinecone serverless) for spiky workloads where you pay per query rather than for provisioned capacity.
Security. Encrypt vectors at rest and in transit. Implement network-level access controls. Audit query logs for unauthorized access patterns. Remember that while embeddings do not directly contain text, they can sometimes be reversed to approximate the original content using model inversion attacks.
For comprehensive production readiness, see our infrastructure design guide and operational excellence concepts.
Common mistakes: not planning for embedding model migration, not monitoring recall over time, underestimating memory requirements by forgetting index overhead and replicas.
15. Design a complete vector search pipeline for an e-commerce product search engine.
What the interviewer is really asking: Can you apply vector database concepts to a concrete end-to-end system design with real-world constraints?
Answer framework:
Requirements: support 50 million products, handle 10,000 queries per second, return results in under 50ms, support text search ("blue running shoes"), image search (upload a photo, find similar products), and filtered search (category, price range, brand, rating).
Phase 1: Embedding pipeline. Generate product embeddings from multiple modalities. For text: embed the concatenation of title, description, and key attributes using a fine-tuned E5 or BGE model. For images: use CLIP to generate image embeddings in the same space as text, enabling cross-modal search. Store both text and image embeddings per product.
Phase 2: Vector storage architecture. For 50 million products, use Qdrant or Weaviate with HNSW indexing and SQ8 quantization. Memory estimate with SQ8: 50M * 1024 * 1 byte = 50 GB for text vectors, plus 50M * 768 * 1 byte = 38 GB for image vectors, plus index overhead approximately 30 GB. Total approximately 120 GB, feasible on a single high-memory node with replication.
Create separate collections for text and image embeddings, or use named vectors in Qdrant to store both in a single collection. Index metadata fields (category, brand, price, rating) for efficient filtered search.
Phase 3: Query routing. Implement a query classifier that determines the search strategy based on the input: text queries use hybrid search (vector plus BM25), image uploads use image vector search, filtered queries apply metadata filters with vector search, and popular category browsing uses pre-computed results from cache.
Phase 4: Ranking and personalization. Combine vector similarity with business signals: relevance score from vector search, popularity (click-through rate, conversion rate), personalization (user's past purchase and browse history), availability (boost in-stock items), and margin (slight boost for higher-margin products). Use a lightweight learning-to-rank model that combines these features into a final score.
Phase 5: Serving infrastructure. Deploy behind a load balancer with read replicas for the vector index. Implement result caching with Redis for popular queries (cache hit rate for e-commerce search is typically 30 to 50 percent). Use CDN-cached results for category browsing pages.
Phase 6: Continuous improvement. Log search queries, clicks, and conversions. Use this data to fine-tune the embedding model, improve the ranking model, and identify queries with poor results. A/B test ranking changes against conversion rate metrics.
For cost estimation and infrastructure sizing, see our pricing guide and capacity planning concepts.
Common mistakes: using a single embedding for both text and image search, not implementing filtered search efficiently, ignoring business signals in ranking, not caching popular queries.
How to Practice Vector Database Interview Questions
Build hands-on experience by implementing vector search from scratch using FAISS before using managed databases. Understand the algorithms at a mathematical level so you can explain them on a whiteboard. Then build a production-like system using Qdrant or Weaviate with real data.
Run benchmarks on your own data: measure recall@k versus latency curves for HNSW with different parameters, compare IVF-PQ against HNSW, and test how filtering affects search quality. This practical experience is visible in interviews and differentiates you from candidates who only read documentation.
Study the ANN-Benchmarks results to understand how different algorithms perform on different datasets. Read engineering blogs from Pinecone, Weaviate, and Qdrant about their architectural decisions. Understand why certain design choices were made.
For a comprehensive preparation plan, explore our system design interview guide and related learning paths. Practice system design questions by sketching vector search architectures for different use cases: recommendation engines, document search, image similarity, and anomaly detection.
Common Mistakes to Avoid
-
Not understanding the approximate nature of ANN search. Vector databases trade accuracy for speed. Always measure recall@k and understand your accuracy-latency operating point.
-
Ignoring memory requirements. Vector databases are memory-hungry. Calculate total memory including vectors, index overhead, metadata, and replication before selecting instance sizes.
-
Choosing a database based on hype. Benchmark on your actual workload. A database that wins on benchmarks may not be best for your specific data distribution, query patterns, and operational requirements.
-
Using the wrong distance metric. Check your embedding model's documentation. Using Euclidean distance when the model was trained with cosine loss can degrade recall by 10 to 20 percent.
-
Not planning for embedding model migration. When you upgrade your embedding model, you need to re-embed everything. Design your pipeline for this from day one.
-
Neglecting filtered search performance. If your application requires metadata filtering, test filter performance explicitly. Highly selective filters can degrade ANN quality significantly.
-
Treating vector search as a solved problem. The field is evolving rapidly. Stay current with new algorithms, quantization techniques, and database features.
Related Resources
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.