Spam Filter Detection

The Problem

Email marketing engagement metrics (opens and clicks) are contaminated by automated spam filter activity. Spam filters automatically open emails and click links during their scanning process, inflating engagement statistics. Customers requested a product that could separate genuine human engagement from spam filter activity to provide accurate campaign performance metrics. No competitors offered this capability.

I was hired as a Data Scientist by Act-On Software to lead this ML initiative.

Research & Technical Approach

Reviewed academic literature on bot detection in web analytics, identifying this as an analogous problem to spam filter classification. Research indicated that machine learning approaches, specifically Random Forest classifiers, outperformed rule-based methods, neural networks and gradient boosted classifiers for distinguishing automated from human behavior in similar domains.

Defined the problem as binary classification: each email engagement event (open or click) needed classification as either human or spam filter interaction. The model had to handle event sequences where both human and spam filter activity could be present.

Feature Engineering & Challenging Initial Assumptions

Engineering initially proposed features based on their assumptions about spam filter behavior, emphasizing timing patterns and IP address providers. Analysis of engagement data revealed these features had significant limitations:

  • Timing patterns: While spam filters often act immediately, legitimate users also open emails instantly (test emails, expected messages, mobile notifications). Spam filters also deliberately delay actions to disguise themselves or due to processing queues, making timing alone unreliable.
  • IP address providers: Certain providers correlate strongly with spam filters, but mapping the entire IP address space to providers is impractical, and IP-to-provider mappings shift over time requiring constant updates.

Built comparative models using Engineering's proposed features versus alternative feature sets, demonstrating through performance metrics that the initial assumptions produced inferior predictions. This evidence-based approach shifted the feature selection process toward data-driven decisions.

Collaborated with the Deliverability team to validate observed behavioral patterns against their customer-facing experience. Engineered features from engagement event data including:

  • Time between open and click events
  • Parsed user agent strings into separate fields: browser name, version, privacy protection status, spoofing indicators
  • IP address provider identification where reliably determinable
  • Sender/receiver email address matching (discovered through analysis of edge cases)

Class Imbalance and Training Set Curation

The dataset exhibited multi-faceted class imbalance:

  • Human interactions outnumbered spam filter interactions overall
  • Privacy-protected browsers dominated human interactions
  • Specific browser versions dominated spam filter interactions

Curated the training set to limit over-represented classes while ensuring sufficient representation of all interaction types. This curation enabled the model to learn subtle differences between behavioral patterns rather than simply memorizing the majority class characteristics.

Solving Edge Cases Through Data Analysis

During model evaluation, the Deliverability team raised concerns about customers who send test emails to themselves and immediately open them across multiple browsers and devices. These legitimate test scenarios exhibited behavioral patterns nearly identical to spam filter activity.

Developed a sender/receiver matching feature to identify when the sending and receiving email addresses were identical, allowing the model to distinguish customer testing behavior from spam filter activity. Refined the annotation process to incorporate this feature alongside existing criteria including browser spoofing indicators, privacy protection status, and browser version patterns.

Model Development and Selection

Evaluated Random Forest, XGBoost, and CatBoost classifiers using F1 score as the primary metric due to its balance between precision and recall. Random Forest achieved the best performance on the initial training data and was selected for production deployment.

After one year in production, I initiated a model refresh to address spam filter evolution. Spam filter behavior changes continuously as providers update their detection mechanisms. Comparative evaluation of Random Forest and CatBoost on newly annotated data showed CatBoost better handled the test email edge cases and adapted to evolved spam filter patterns. Recommended switching to CatBoost for the production model update.

Final model achieved F1 score of 92%.

Data Annotation Process Design

Designed and managed the data annotation process where interns labeled training examples. Built Snowflake dashboards to monitor annotation quality and identify systematic issues:

  • Class distribution tracking to ensure adequate representation of less-frequent interaction types
  • Common annotation error identification and correction workflows

Wrote white papers for product management and upper management explaining the annotation methodology and model behavior when results differed from initial expectations.

Business Impact

The model became a customer-facing product feature providing accurate email engagement reporting. This enabled customers to:

  • Identify and contact providers to request removal from spam filtering lists
  • Detect and remove spam trap addresses from their contact lists
  • Evaluate campaign effectiveness based on genuine human engagement rather than inflated metrics
  • Work with the Deliverability team to develop strategies for improving legitimate engagement rates

Development Environment

  • Python
  • Numpy
  • Pandas
  • scikit-learn
  • Random Forest
  • XGBoost
  • CatBoost
  • Matplotlib
  • Modin
  • Ray
  • MLFlow
  • Snowflake
  • Snowflake Python Connector
  • Git
  • Bitbucket
  • Jupyter Lab
  • Visual Studio Code
  • ChatGPT
  • Claude