Maven Music Customer Churn Analysis

Maven Music: Customer Churn Analysis and Data Preparation

Maven Music, a music streaming service, has been experiencing higher-than-usual customer churn. This project applies the complete data science workflow—from data gathering through feature engineering—to investigate churn patterns and prepare a modeling-ready dataset using three months of customer subscription and listening history data.

The complete project repository and notebook can be found at Maven Music Churn Analysis Repository

Project Objective

The project goal is to use supervised learning to predict which customers are most likely to cancel their subscriptions, using the past three months of customer data including subscription details and listening history.

Data Sources

The analysis utilizes two primary data sources:

maven_music_customers.csv: Customer subscription information including Member Since date, Subscription Plan, Subscription Rate, Discount status, and Cancellation Date
maven_music_listening_history.xlsx: Multi-sheet Excel file containing:
- Listening history records (Customer ID, Session ID, Audio Order, Audio ID, Audio Type)
- Audio details (ID, Name, Genre, Popularity)
- Session information (Session ID, Session Log In Time)

Data Cleaning and Transformation

The data cleaning process addressed multiple data quality issues:

Data Type Conversions

Member Since: Converted from string to datetime format ('%m/%d/%y')
Cancellation Date: Converted from string to datetime format
Subscription Rate: Removed '$' symbol and converted to float (e.g., "$2.99" → 2.99)

Missing Data Treatment

Subscription Plan: 5 missing values filled with "Basic (Ads)" based on analysis showing $2.99 rate corresponded to this plan
Discount?: Converted to binary (1 for "Yes", 0 for missing/NaN)

Text Cleaning

Email field: Removed "Email: " prefix from addresses
Genre field: Identified and standardized inconsistencies (e.g., "Pop" vs "Pop Music")

Feature Creation

Cancelled: Binary indicator created from Cancellation Date (1 if date exists, 0 otherwise)
Duration: Calculated subscription duration for cancelled customers (Cancellation Date - Member Since)

Exploratory Data Analysis

The EDA phase revealed several patterns in customer behavior and listening preferences:

Subscription Patterns

30 total customers in the dataset
13 customers cancelled (43% churn rate)
Subscription plans: 17 Basic (Ads), 13 Premium (No Ads)
7 customers received discounts
Cancelled customers had subscription durations ranging from 61-80 days

Listening Behavior Analysis

Analysis of 505 listening records across 90 sessions revealed:

Genre Popularity:
- Pop: 267 listens (52.9%)
- Hip Hop: 88 listens (17.4%)
- Country: 68 listens (13.5%)
- Jazz: 48 listens (9.5%)
- Comedy (podcast): 19 listens (3.8%)
- True Crime (podcast): 15 listens (3.0%)
Audio Type Distribution: 463 songs (91.7%) vs 42 podcasts (8.3%)
Session Activity: Number of sessions per customer ranged from 2 to 60

Feature Engineering for Predictive Modeling

The feature engineering process created a modeling-ready dataset with one row per customer and the following features:

Target Variable

Cancelled: Binary indicator (1 = cancelled, 0 = active)

Engineered Features

Discount? (Binary, 0/1)
Indicates whether customer received a subscription discount
Num Sessions (Integer)
Count of listening sessions per customer, calculated using:
```
num_sessions = history.groupby('Customer ID')['Session ID'].count()
```
Range: 2-60 sessions across customers
Percent Pop (Float, 0-100)
Percentage of listening history consisting of Pop music

Calculation steps:
- Created dummy variables for each genre using pd.get_dummies()
- Grouped by Customer ID and summed genre counts
- Calculated total audio items per customer
- Computed percentage: (Pop count / Total Audio) * 100
Percent Podcasts (Float, 0-100)
Percentage of listening history consisting of podcasts (Comedy and True Crime genres)

Calculated as: ((Comedy + True Crime) / Total Audio) * 100

Data Integration

The final modeling dataset was constructed through systematic merging:

Started with customer base features (Customer ID, Cancelled, Discount?)
Merged listening session counts
Integrated genre preference percentages
Result: Clean DataFrame with all numeric, non-null features ready for machine learning

Correlation Analysis and Key Findings

Correlation analysis of the modeling features revealed which variables have the strongest relationship with customer churn:

Feature Correlations with Churn

Feature	Correlation with Cancelled	Interpretation
Percent Pop	0.586	Strong positive - customers who listen to more Pop music are more likely to churn
Discount?	0.472	Moderate positive - customers with discounts have higher churn rates
Num Sessions	-0.237	Weak negative - customers with more sessions are slightly less likely to churn
Percent Podcasts	-0.035	Very weak negative - minimal relationship with churn

Key Insights

Most Predictive Features: Discount? and Percent Pop show the highest correlation with churn, making them the most valuable features for predictive modeling
Genre Preferences: Customers with higher Pop listening percentages are significantly more likely to cancel, suggesting potential content or recommendation issues for this segment
Engagement Patterns: Lower session counts correlate with higher churn, though the relationship is relatively weak
Discount Paradox: Customers receiving discounts show higher churn rates, potentially indicating these were retention attempts for already at-risk customers

Additional Feature Relationships

Percent Pop and Percent Podcasts have strong negative correlation (-0.487), indicating customers tend to specialize in one content type
Num Sessions shows weak correlation with Percent Pop (0.210), suggesting engagement level has minimal relationship with genre preference
Num Sessions and Percent Podcasts are negatively correlated (-0.495), suggesting podcast listeners may have shorter but more focused sessions

Next Steps and Extensions

The modeling-ready dataset enables several natural extensions:

Building classification models (logistic regression, decision trees, random forests) to predict churn probability
Developing customer segmentation using clustering on listening behavior patterns
Creating time-based features from session timestamps to capture temporal engagement patterns
Expanding feature set with additional behavioral metrics (session duration, skip rates, playlist usage)
A/B testing retention interventions targeted at high-risk segments

This analytical framework applies broadly to subscription-based businesses facing similar churn challenges across streaming media, SaaS platforms, and digital content services.

Conclusion

This project demonstrates a complete data preparation and EDA workflow for customer churn analysis. Starting with raw CSV and Excel data, the analysis systematically cleaned data, handled missing values, engineered behavioral features, and identified key predictors of churn.

The correlation analysis revealed that Discount? (r=0.47) and Percent Pop (r=0.59) are the strongest predictors of cancellation, providing clear direction for future predictive modeling efforts. The modeling-ready dataset—with 30 customer records and 5 numeric, non-null features—is prepared for classification algorithms such as logistic regression, decision trees, or random forests.

Beyond preparing data for modeling, the analysis surfaced actionable insights: customers who listen primarily to Pop music and those receiving discounts show significantly higher churn rates. These findings suggest targeted retention strategies could focus on improving content recommendations for Pop listeners and re-evaluating discount effectiveness.

This work exemplifies foundational data science skills: data gathering from multiple sources, systematic data cleaning, exploratory analysis with visualization, feature engineering, and preparation of modeling-ready datasets—all documented in a reproducible Jupyter notebook workflow.

Project Context and Attribution

This project serves as the final project for Maven Analytics' Data Science in Python: Data Prep & EDA course, demonstrating comprehensive skills in data gathering, cleaning, exploratory analysis, and feature engineering.

Course Information

Course: Data Science in Python: Data Prep & EDA
Platform: Maven Analytics
Instructor: Alice Zhao, Data Scientist and Author of SQL Pocket Guide (O'Reilly)
Topics Covered: Project scoping, data gathering, data cleaning, EDA, feature engineering, and data preparation for modeling

Development Environment and Tools

Python 3.11.13
Jupyter Notebook (IPython kernel)
pandas - Data manipulation, cleaning, and feature engineering
NumPy - Numerical operations and conditional logic
seaborn - Correlation heatmap visualization

Key Techniques Applied

Multi-sheet Excel file reading with pandas
Datetime parsing and manipulation
Missing value analysis and imputation
String cleaning and transformation
Dummy variable creation with pd.get_dummies()
GroupBy aggregations for feature engineering
DataFrame merging for data integration
Correlation analysis for feature selection