Implement a function to retrieve the most relevant document from a list of documents based on a query, using cosine similarity. Optimize for time and space complexity, and explain how your approach improves retrieval efficiency.

Interview

How to structure your answer

To solve this, first precompute document vectors using a TF-IDF or word embedding model. Then, represent the query as a vector using the same model. Compute cosine similarity between the query vector and all document vectors using dot products. Optimize by precomputing document vectors once, reducing query-time computation. Use efficient libraries like NumPy for vector operations. Select the document with the highest similarity score. This approach minimizes redundant computation and leverages vectorized operations for speed, achieving O(1) query-time complexity after precomputation.

Sample answer

The solution involves three key steps: (1) Precompute document vectors using TF-IDF or embeddings, storing them in a matrix. (2) For a query, generate its vector using the same model. (3) Compute cosine similarity via dot product and norms, selecting the maximum. Precomputing document vectors ensures O(n) time during indexing and O(1) per query, while space complexity is O(n*d), where d is vector dimension. Using NumPy optimizes numerical operations. This avoids recomputing document vectors for each query, reducing time complexity from O(n) per query to O(1). Cosine similarity is calculated efficiently with vectorized math, improving retrieval speed. This approach is scalable for large document collections and real-time queries.

Key points to mention

• cosine similarity formula
• time complexity O(n) for retrieval
• space optimization via sparse representations

Common mistakes to avoid

✗ forgetting to normalize vectors
✗ using brute-force O(n²) computation
✗ ignoring space complexity trade-offs

Back to all questions Practice with AI mock