Iris Dataset

The Challenge

The iris dataset classification project presented a fundamental machine learning challenge: accurately classifying three iris species (Iris-setosa, Iris-versicolor, and Iris-virginica) based on four botanical measurements - sepal length, sepal width, petal length, and petal width.

The dataset contained multiple data quality issues that needed to be addressed before model development could begin. Missing values were present across three of the four measurement features, with some records containing zero values for petal width measurements - biologically impossible since these represent actual flower dimensions.

Additionally, the dataset contained duplicate records and outliers that could skew model performance. The challenge required developing a robust data preprocessing pipeline followed by an optimized classification model that could achieve high accuracy while maintaining interpretability for botanical research applications

The Strategic Approach

I developed a comprehensive data science pipeline that prioritized data quality and model interpretability. The strategy centered on thorough data preprocessing combined with decision tree classification to create an explainable model suitable for scientific applications.

The critical insight was that different iris species showed varying degrees of separability across the four botanical features. Through extensive exploratory data analysis, I identified that petal measurements provided superior species discrimination compared to sepal measurements, with petal width being the most decisive feature.

Rather than using complex ensemble methods, I chose decision trees for their interpretability - allowing botanists to understand exactly which measurements drive species classification. This approach balanced high accuracy with practical usability in real-world botanical identification scenarios

Technical Implementation

The solution achieved exceptional performance through systematic data preprocessing and optimized model selection:

Data Cleaning Pipeline: Implemented comprehensive data quality checks, handling missing values through median imputation based on statistical distribution analysis, removing zero values, and eliminating duplicate recordsk
Outlier Treatment: Applied IQR-based outlier detection and capping techniques specific to each species and feature combination, preserving data integrity while removing extreme values
Feature Engineering: Conducted correlation analysis and pairplot visualization to identify optimal feature combinations, discovering that petal width and sepal length provided the best species separation
Model Optimization: Used GridSearchCV with cross-validation to determine optimal decision tree parameters, preventing overfitting through max_depth tuning

Results

Primary Model Accuracy: Achieved 90% accuracy using petal width and sepal length features with a max_depth=2 decision tree
Secondary Model Performance: Demonstrated 83.3% accuracy using petal length and sepal width combination, validating feature importance hierarchy
Perfect Setosa Classification: Achieved 100% precision, recall, and F1-score for Iris-setosa species, confirming clear biological separation
Robust Generalization: Maintained consistent performance across train/test splits with balanced species distribution

The model successfully identified that petal width alone could distinguish Iris-setosa from other species with perfect accuracy, while the combination of petal width and sepal length provided optimal discrimination between all three species.

This project was for a coding challenge for a job interview. The code can be found on GitHub

Development Environment

Python
Pandas
Numpy
Matplotlib & Seaborn
Scikit-Learn
Jupyter Lab