Given a dataset of historical supply chain disruptions (e.g., supplier delays, natural disasters) and their impact on delivery times, develop a Python script to predict the likelihood and potential delay duration for future shipments using a machine learning model. Outline the features you would engineer from the raw data and the model selection process.
final round · 15-20 minutes
How to structure your answer
Leverage a CRISP-DM framework. 1. Business Understanding: Define prediction goals (likelihood, duration). 2. Data Understanding: Identify raw data (event type, location, date, supplier, product, original ETA, actual delivery). 3. Data Preparation: Engineer features like 'disruption_severity' (based on event type), 'supplier_reliability_score' (historical performance), 'lead_time_variance', 'geographic_risk_index', 'seasonality_indicators', and 'product_criticality'. Handle missing values and outliers. 4. Modeling: For likelihood, use classification (Logistic Regression, Random Forest, XGBoost). For duration, use regression (Random Forest Regressor, Gradient Boosting Regressor, Prophet for time-series). 5. Evaluation: Use F1-score/AUC for classification, RMSE/MAE for regression. Cross-validation is crucial. 6. Deployment: Integrate into supply chain planning systems.
Sample answer
To predict supply chain disruptions, I'd employ a CRISP-DM methodology. First, for Data Understanding, I'd gather raw data including 'disruption_type' (e.g., natural disaster, labor strike), 'supplier_ID', 'product_SKU', 'origin_port', 'destination_port', 'scheduled_delivery_date', and 'actual_delivery_date'.
For Data Preparation, I'd engineer features: 'Supplier_Risk_Score' (based on historical delay frequency/severity), 'Geopolitical_Risk_Index' (for origin/destination), 'Seasonality_Factor' (month/quarter), 'Lead_Time_Deviation' (actual vs. planned), and 'Product_Criticality_Score'. I'd also create a binary 'Disruption_Event' target for likelihood and 'Delay_Duration_Days' for regression.
For Modeling, I'd use a two-stage approach. For disruption likelihood (classification), I'd evaluate XGBoost or Random Forest due to their robustness with mixed data types. For delay duration (regression), I'd consider Gradient Boosting Regressor or a time-series model like Prophet if temporal patterns are strong. Model selection would involve cross-validation and hyperparameter tuning. Evaluation metrics would include AUC-ROC for classification and RMSE/MAE for regression, ensuring the model generalizes well to unseen data.
Key points to mention
- • Problem decomposition (classification for likelihood, regression for duration)
- • Comprehensive feature engineering (temporal, categorical, interaction, external data)
- • Model selection rationale (ensemble methods like XGBoost/LightGBM)
- • Evaluation metrics appropriate for classification (F1, Precision, Recall) and regression (RMSE, MAE)
- • Cross-validation strategy (time-series split)
- • Handling imbalanced datasets (for disruption likelihood)
- • Interpretability of the model (e.g., SHAP values for feature importance)
Common mistakes to avoid
- ✗ Not addressing data imbalance for disruption prediction, leading to a model that always predicts 'no disruption'.
- ✗ Using standard k-fold cross-validation instead of time-series cross-validation, causing data leakage.
- ✗ Overlooking the importance of external data sources for richer feature sets.
- ✗ Failing to articulate the specific evaluation metrics for both classification and regression tasks.
- ✗ Not considering model interpretability for business stakeholders.