Unsupervised Learning Projects
Unsupervised Learning: Customer Segmentation, Recommendation Systems, and Workforce Analytics
Three analytical projects applying unsupervised learning to business segmentation challenges. The work demonstrates pattern discovery through clustering analysis, preference modeling through collaborative filtering, and dimensionality reduction for interpretable visualization—addressing customer behavior segmentation, personalized recommendations, and employee retention risk assessment.
The complete project repository can be found at Unsupervised Learning Class Repository
Project Scope
This work encompasses three distinct analytical applications of unsupervised learning:
- Wholesale Distribution: Identifying customer purchasing patterns across 440 B2B clients to enable targeted marketing strategies
- Restaurant Recommendations: Building a collaborative filtering system to predict user preferences from sparse rating data (1,161 ratings across 138 consumers and 127 restaurants)
- Workforce Retention: Segmenting 1,470 employees to identify distinct groups with varying attrition risk profiles (overall 16% churn rate)
Analysis 1: Wholesale Customer Segmentation
Analytical Objective
A wholesale distributor needed to understand customer purchasing behavior patterns to develop targeted marketing approaches. With 440 customers purchasing across six product categories (Fresh, Milk, Grocery, Frozen, Detergents & Paper, and Delicatessen), the challenge was identifying natural customer segments without predefined labels.
Methodology Selection
The analysis evaluated three clustering approaches to determine optimal segmentation:
- K-Means Clustering: Tested 2-15 cluster configurations, evaluating both elbow plots and silhouette scores
- Hierarchical Clustering: Applied agglomerative methods with dendrogram visualization to assess natural groupings
- DBSCAN: Explored density-based clustering to identify potential outliers
Key Findings
Through systematic comparison across methods, K-Means with 3 clusters emerged as optimal, balancing interpretability (silhouette score: 0.458) with business utility. The analysis identified three distinct customer segments:
| Segment | Size | Purchasing Profile |
|---|---|---|
| Typical Clients | 350 customers | Balanced purchasing across all categories—the mainstream customer base |
| Fresh & Frozen Focus | 53 customers | Concentrated spending on Fresh and Frozen products—likely restaurants or food service operations |
| Grocery & Paper Buyers | 37 customers | Heavy emphasis on Milk, Grocery, and Paper products—possibly retail or convenience stores |
Business Applications
The segmentation enables differentiated strategies:
- Typical Clients: Broad product education and cross-category promotions
- Fresh & Frozen Focus: Priority handling for perishables, cold chain logistics emphasis, and relationships with fresh product suppliers
- Grocery & Paper Buyers: Bulk ordering incentives and advance notifications on non-perishable inventory
Analysis 2: Restaurant Recommendation System
Analytical Challenge
The goal was building a collaborative filtering system to predict restaurant preferences from sparse rating data. With 1,161 ratings distributed across 138 consumers and 127 restaurants (ratings on 0-2 scale), the challenge was handling missing data while generating meaningful recommendations.
Approach
The analysis employed matrix factorization using Truncated SVD (Singular Value Decomposition) to reduce dimensionality and capture latent preference patterns:
- Data Structure: Constructed user-item matrix with consumers as rows and restaurants as columns
- Missing Data Strategy: Applied mean imputation for unrated restaurants to enable matrix operations
- Mean Centering: Normalized ratings by subtracting global mean to account for individual rating tendencies
- Dimensionality Reduction: Tested component counts from 2-127, selecting 50 components based on explained variance analysis (capturing 91% of rating variance)
Validation Example
Testing the system with a new user profile (rated KFC and McDonald's Centro both as 2/2) demonstrated practical recommendation generation:
| Recommendation Rank | Restaurant | Predicted Rating | Cuisine | Price |
|---|---|---|---|---|
| 1 | Potzocalli | 0.133 | Mexican | Low |
| 2 | Chilis Cuernavaca | 0.120 | American | Medium |
| 3 | Vips | 0.090 | American | Low |
The system successfully identified similar American fast-food preferences, recommending other casual dining options aligned with the user's demonstrated taste profile. Content-based filtering using restaurant attributes (cuisine, price, franchise status) provided complementary recommendations based on feature similarity rather than user patterns alone.
System Limitations
The sparse rating matrix (most user-restaurant pairs unrated) creates cold-start challenges for new users and restaurants. The recommendation quality depends heavily on having sufficient rating overlap between users to identify meaningful similarity patterns.
Analysis 3: Employee Retention Segmentation
Business Objective
Analysis of 1,470 employees to identify workforce segments with distinct attrition risk profiles. The company faced 16.1% overall attrition, and leadership needed to understand which employee groups required targeted retention interventions.
Analytical Approach
The analysis progressed through iterative refinement:
- Initial Assessment: First-round clustering on all features (including department affiliation) revealed department dominated the segmentation, masking other meaningful patterns
- Refined Analysis: Removed department variables to expose underlying behavioral and demographic segments
- Cluster Selection: Evaluated 2-15 cluster solutions, selecting 6 clusters balancing interpretability with segment actionability
- Dimensionality Reduction: Applied PCA for 2D visualization, with first two components explaining 48% of variance (PC1 capturing seniority/compensation, PC2 capturing job satisfaction patterns)
Identified Segments and Attrition Analysis
The clustering revealed six distinct employee segments with significantly different retention profiles:
| Segment | Size | Attrition Rate | Key Characteristics |
|---|---|---|---|
| Long-Distance Commuters | 197 | 22% | Elevated commute distance, particularly severe in HR department (67% attrition) |
| Disengaged | 201 | 19% | Low satisfaction, low performance, concentrated in Sales (22%) and HR (21%) |
| High-Performing Rising Stars | 200 | 18% | Exceptional performance ratings but lower job levels and compensation—retention risk from external recruitment |
| Happy Strugglers | 304 | 16% | High job satisfaction despite lower performance and compensation levels |
| Plateaued Mid-Career | 349 | 15% | Average performance, limited advancement opportunities, particularly in Sales (21%) |
| Senior Leadership | 219 | 7% | Highest job levels and compensation, lowest attrition risk |
Critical Finding: Performance and Retention
A key insight emerged from comparing attrition groups: departing employees showed identical performance ratings (3.16) to those staying (3.15). This indicates the company is losing good performers, not addressing performance issues. The attrition is driven by other factors—compensation gap ($2,046/month lower for leavers), career progression limitations (lower job levels), and work conditions (commute distance).
Segment-Specific Retention Strategies
Immediate Action Required (15%+ Attrition):
- Long-Distance Commuters (22%): Implement remote work options, flexible schedules, and commuter benefits. Priority focus on HR department where attrition reaches 67%
- Disengaged (19%): Conduct stay interviews to diagnose disengagement root causes, review compensation benchmarking, audit workload distribution, and assess management quality
- High-Performing Rising Stars (18%): Create accelerated promotion tracks, implement retention bonuses, provide challenging project assignments, and establish mentorship with senior leadership
Moderate Risk:
- Happy Strugglers (16%): Invest in skills training and professional development while maintaining current satisfaction levels
- Plateaued Mid-Career (15%): Create lateral movement opportunities, provide leadership training, and offer job enrichment through special projects
Stable Group:
- Senior Leadership (7%): Maintain current retention strategies, continue competitive compensation, and ensure strategic involvement
Analytical Approach and Methodology
Across all three projects, the analysis demonstrated systematic evaluation of multiple clustering approaches—K-Means, hierarchical methods, and density-based techniques—selecting optimal solutions based on both statistical metrics (silhouette scores, explained variance) and business interpretability. The work balanced technical rigor with practical business utility, ensuring insights could drive actionable strategies rather than remaining academic exercises.
Key analytical decisions included:
- Feature Selection: Iteratively refining feature sets (removing department variables in retention analysis) when initial clustering revealed dominated patterns
- Cluster Interpretation: Naming segments based on dominant characteristics rather than generic labels, enabling stakeholder understanding and action
- Validation Strategy: Using multiple metrics (inertia, silhouette scores, dendrograms) to triangulate optimal cluster counts rather than relying on single indicators
- Dimensionality Reduction: Applying PCA for visualization while maintaining analytical work in full-dimensional space to preserve information
- Business Translation: Connecting statistical findings to specific business actions (remote work policies, retention bonuses, commuter benefits)
Conclusion
This work demonstrates analytical pattern discovery across three business domains using unsupervised learning. The wholesale analysis identified three actionable customer segments from purchasing data. The recommendation system successfully predicted user preferences from sparse ratings. The workforce analysis revealed six employee segments with attrition rates ranging from 7% to 22%, enabling targeted retention strategies.
Key analytical contributions include identifying that high-performing employees leave at the same rate as average performers (indicating compensation rather than performance issues), recognizing commute distance as the strongest attrition predictor (22% churn for long-distance commuters), and developing segment-specific retention recommendations addressing root causes rather than symptoms.
The analytical framework applies broadly to segmentation challenges across retail, SaaS, healthcare, and financial services where understanding customer or employee subpopulations drives strategic decision-making.
Project Context and Attribution
These projects were completed as part of Maven Analytics' Data Science in Python: Unsupervised Learning course, demonstrating proficiency in clustering analysis, dimensionality reduction, and collaborative filtering applied to business segmentation challenges.
Course Information
- Course: Data Science in Python: Unsupervised Learning
- Platform: Maven Analytics
- Topics Covered: K-Means and hierarchical clustering, DBSCAN, PCA for dimensionality reduction, collaborative filtering, and cluster interpretation
Development Environment and Tools
- Python 3.x
- Jupyter Notebook
- pandas - Data manipulation and feature engineering
- NumPy - Numerical operations
- scikit-learn - K-Means, Agglomerative Clustering, PCA, StandardScaler, TruncatedSVD, cosine similarity
- Matplotlib & Seaborn - Visualization (dendrograms, heatmaps, scatter plots)
- scipy - Hierarchical clustering and dendrogram generation
Key Techniques Applied
- Elbow method and silhouette analysis for optimal cluster selection
- Feature standardization and normalization
- One-hot encoding for categorical variables
- Matrix factorization with SVD for dimensionality reduction
- Cosine similarity for user-based collaborative filtering
- PCA for 2D visualization of high-dimensional clusters
- Iterative clustering refinement based on business interpretability