🚀 AI-Powered Mock Interviews Launching Soon - Join the Waitlist for Early Access

technicalmedium

Write a Python function that reads two CSV files – patients.csv (patient_id, name, dob) and labs.csv (patient_id, visit_date, test_name, result_value, unit, reference_range_low, reference_range_high) – and returns a DataFrame of patients who had a Hemoglobin result below the lower limit of normal (LLN) in any visit, including visit_date and result_value. Handle missing values and duplicate rows.

technical screen · 3-5 minutes

How to structure your answer

Use the CIRCLES framework: Clarify, Identify, Recommend, Create, List, Execute, Summarize. 1) Clarify inputs/outputs. 2) Identify key steps: load CSVs, drop duplicates, handle NaNs, filter Hemoglobin, compare to LLN, select columns. 3) Recommend vectorized Pandas ops. 4) Create function skeleton. 5) List edge cases. 6) Execute code. 7) Summarize return type. 120‑150 words.

Sample answer

First, I import pandas and define the function signature: def get_low_hemoglobin(patients_path, labs_path). I read both CSVs using pd.read_csv, specifying dtype for patient_id to avoid type mismatches. I drop duplicate rows in labs using drop_duplicates on ['patient_id', 'visit_date', 'test_name']. I handle missing values by dropping rows where result_value or reference_range_low is NaN, as they cannot be evaluated. I filter labs for test_name == 'Hemoglobin', then create a boolean mask where result_value < reference_range_low. I merge the filtered labs with patients on patient_id to bring in patient demographics. Finally, I select patient_id, name, visit_date, result_value, and return the resulting DataFrame. I include a docstring explaining parameters and return type. This solution is fully vectorized, handles edge cases, and is ready for unit testing.

Key points to mention

  • Vectorized Pandas operations for performance
  • Handling of missing and duplicate data
  • Clear function signature and docstring
  • Return type specification

Common mistakes to avoid

  • Using loops instead of vectorized ops
  • Ignoring duplicate rows
  • Not handling NaNs before comparison
  • Returning a list or dict instead of DataFrame