Predicting Insurance Costs with Machine Learning
A deep dive into how I built a data pipeline to predict medical insurance billing amounts using Python and Decision Tree Regressor.
1. Problem Statement
Insurance companies want to estimate billing based on age, BMI, smoker status, etc.
2. Tools Used
- Python (Pandas, NumPy)
- Matplotlib & Seaborn
- DecisionTreeRegressor from scikit-learn
- Jupyter Notebook
3. Key Highlights
- Cleaned messy, real-world dataset
- Performed EDA and feature engineering
- Applied ML model to predict billing
- Visualized results with accuracy metrics
4. Outcome
Smokers tend to be charged significantly more for insurance, especially in higher BMI ranges. Identified BMI and smoker status as key drivers. Children vs Age: People aged between X–Y tend to have more dependents. Entire project is on [https://github.com/AshishSahai/Insurance-Data-Analysis].
5. What I Learned
- ML intuition: overfitting, feature importance
- Model explainability
- Communicating results with non-tech teams