Music Streaming Analytics

Maven Music: Customer Churn Analysis and Data Preparation

Maven Music, a music streaming service, has been experiencing higher-than-usual customer churn. This project applies the complete data science workflow—from data gathering through feature engineering—to investigate churn patterns and prepare a modeling-ready dataset using three months of customer subscription and listening history data.

The complete project repository and notebook can be found at Maven Music Churn Analysis Repository

Project Objective

The project goal is to use supervised learning to predict which customers are most likely to cancel their subscriptions, using the past three months of customer data including subscription details and listening history.

Data Sources

The analysis utilizes two primary data sources:

  • maven_music_customers.csv: Customer subscription information including Member Since date, Subscription Plan, Subscription Rate, Discount status, and Cancellation Date
  • maven_music_listening_history.xlsx: Multi-sheet Excel file containing:
    • Listening history records (Customer ID, Session ID, Audio Order, Audio ID, Audio Type)
    • Audio details (ID, Name, Genre, Popularity)
    • Session information (Session ID, Session Log In Time)

Data Cleaning and Transformation

The data cleaning process addressed multiple data quality issues:

Data Type Conversions

  • Member Since: Converted from string to datetime format ('%m/%d/%y')
  • Cancellation Date: Converted from string to datetime format
  • Subscription Rate: Removed '$' symbol and converted to float (e.g., "$2.99" → 2.99)

Missing Data Treatment

  • Subscription Plan: 5 missing values filled with "Basic (Ads)" based on analysis showing $2.99 rate corresponded to this plan
  • Discount?: Converted to binary (1 for "Yes", 0 for missing/NaN)

Text Cleaning

  • Email field: Removed "Email: " prefix from addresses
  • Genre field: Identified and standardized inconsistencies (e.g., "Pop" vs "Pop Music")

Feature Creation

  • Cancelled: Binary indicator created from Cancellation Date (1 if date exists, 0 otherwise)
  • Duration: Calculated subscription duration for cancelled customers (Cancellation Date - Member Since)

Exploratory Data Analysis

The EDA phase revealed several patterns in customer behavior and listening preferences:

Subscription Patterns

  • 30 total customers in the dataset
  • 13 customers cancelled (43% churn rate)
  • Subscription plans: 17 Basic (Ads), 13 Premium (No Ads)
  • 7 customers received discounts
  • Cancelled customers had subscription durations ranging from 61-80 days

Listening Behavior Analysis

Analysis of 505 listening records across 90 sessions revealed:

  • Genre Popularity:
    • Pop: 267 listens (52.9%)
    • Hip Hop: 88 listens (17.4%)
    • Country: 68 listens (13.5%)
    • Jazz: 48 listens (9.5%)
    • Comedy (podcast): 19 listens (3.8%)
    • True Crime (podcast): 15 listens (3.0%)
  • Audio Type Distribution: 463 songs (91.7%) vs 42 podcasts (8.3%)
  • Session Activity: Number of sessions per customer ranged from 2 to 60

Feature Engineering for Predictive Modeling

The feature engineering process created a modeling-ready dataset with one row per customer and the following features:

Target Variable

  • Cancelled: Binary indicator (1 = cancelled, 0 = active)

Engineered Features

  1. Discount? (Binary, 0/1)

    Indicates whether customer received a subscription discount

  2. Num Sessions (Integer)

    Count of listening sessions per customer, calculated using:

    num_sessions = history.groupby('Customer ID')['Session ID'].count()

    Range: 2-60 sessions across customers

  3. Percent Pop (Float, 0-100)

    Percentage of listening history consisting of Pop music

    Calculation steps:

    • Created dummy variables for each genre using pd.get_dummies()
    • Grouped by Customer ID and summed genre counts
    • Calculated total audio items per customer
    • Computed percentage: (Pop count / Total Audio) * 100
  4. Percent Podcasts (Float, 0-100)

    Percentage of listening history consisting of podcasts (Comedy and True Crime genres)

    Calculated as: ((Comedy + True Crime) / Total Audio) * 100

Data Integration

The final modeling dataset was constructed through systematic merging:

  1. Started with customer base features (Customer ID, Cancelled, Discount?)
  2. Merged listening session counts
  3. Integrated genre preference percentages
  4. Result: Clean DataFrame with all numeric, non-null features ready for machine learning

Correlation Analysis and Key Findings

Correlation analysis of the modeling features revealed which variables have the strongest relationship with customer churn:

Feature Correlations with Churn

Feature Correlation with Cancelled Interpretation
Percent Pop 0.586 Strong positive - customers who listen to more Pop music are more likely to churn
Discount? 0.472 Moderate positive - customers with discounts have higher churn rates
Num Sessions -0.237 Weak negative - customers with more sessions are slightly less likely to churn
Percent Podcasts -0.035 Very weak negative - minimal relationship with churn

Key Insights

  • Most Predictive Features: Discount? and Percent Pop show the highest correlation with churn, making them the most valuable features for predictive modeling
  • Genre Preferences: Customers with higher Pop listening percentages are significantly more likely to cancel, suggesting potential content or recommendation issues for this segment
  • Engagement Patterns: Lower session counts correlate with higher churn, though the relationship is relatively weak
  • Discount Paradox: Customers receiving discounts show higher churn rates, potentially indicating these were retention attempts for already at-risk customers

Additional Feature Relationships

  • Percent Pop and Percent Podcasts have strong negative correlation (-0.487), indicating customers tend to specialize in one content type
  • Num Sessions shows weak correlation with Percent Pop (0.210), suggesting engagement level has minimal relationship with genre preference
  • Num Sessions and Percent Podcasts are negatively correlated (-0.495), suggesting podcast listeners may have shorter but more focused sessions

Next Steps and Extensions

The modeling-ready dataset enables several natural extensions:

  • Building classification models (logistic regression, decision trees, random forests) to predict churn probability
  • Developing customer segmentation using clustering on listening behavior patterns
  • Creating time-based features from session timestamps to capture temporal engagement patterns
  • Expanding feature set with additional behavioral metrics (session duration, skip rates, playlist usage)
  • A/B testing retention interventions targeted at high-risk segments

This analytical framework applies broadly to subscription-based businesses facing similar churn challenges across streaming media, SaaS platforms, and digital content services.

Conclusion

This project demonstrates a complete data preparation and EDA workflow for customer churn analysis. Starting with raw CSV and Excel data, the analysis systematically cleaned data, handled missing values, engineered behavioral features, and identified key predictors of churn.

The correlation analysis revealed that Discount? (r=0.47) and Percent Pop (r=0.59) are the strongest predictors of cancellation, providing clear direction for future predictive modeling efforts. The modeling-ready dataset—with 30 customer records and 5 numeric, non-null features—is prepared for classification algorithms such as logistic regression, decision trees, or random forests.

Beyond preparing data for modeling, the analysis surfaced actionable insights: customers who listen primarily to Pop music and those receiving discounts show significantly higher churn rates. These findings suggest targeted retention strategies could focus on improving content recommendations for Pop listeners and re-evaluating discount effectiveness.

This work exemplifies foundational data science skills: data gathering from multiple sources, systematic data cleaning, exploratory analysis with visualization, feature engineering, and preparation of modeling-ready datasets—all documented in a reproducible Jupyter notebook workflow.

Project Context and Attribution

This project serves as the final project for Maven Analytics' Data Science in Python: Data Prep & EDA course, demonstrating comprehensive skills in data gathering, cleaning, exploratory analysis, and feature engineering.

Course Information

  • Course: Data Science in Python: Data Prep & EDA
  • Platform: Maven Analytics
  • Instructor: Alice Zhao, Data Scientist and Author of SQL Pocket Guide (O'Reilly)
  • Topics Covered: Project scoping, data gathering, data cleaning, EDA, feature engineering, and data preparation for modeling

Development Environment and Tools

  • Python 3.11.13
  • Jupyter Notebook (IPython kernel)
  • pandas - Data manipulation, cleaning, and feature engineering
  • NumPy - Numerical operations and conditional logic
  • seaborn - Correlation heatmap visualization

Key Techniques Applied

  • Multi-sheet Excel file reading with pandas
  • Datetime parsing and manipulation
  • Missing value analysis and imputation
  • String cleaning and transformation
  • Dummy variable creation with pd.get_dummies()
  • GroupBy aggregations for feature engineering
  • DataFrame merging for data integration
  • Correlation analysis for feature selection