Contact List Data Quality Analysis
Strategic Context
The VP of Engineering at Act-On Software asked me to investigate potential AI Agent applications within the company's product suite. Through stakeholder interviews with the product team, I identified a recurring customer pain point: messy contact list data that degraded the effectiveness of email marketing campaigns.
Product management had prioritized contact list cleanup as a high-value feature. Initial discussions assumed this problem required AI solutions given its complexity and the volume of data involved. The question posed: could we build an AI-powered system to automatically clean customer contact lists?
Data Quality Investigation
I conducted systematic analysis of customer contact list data to identify specific data quality issues and their prevalence. The analysis revealed four distinct categories of problems:
Capitalization Inconsistencies
Contact names and place names exhibited inconsistent capitalization patterns: "John Smith" vs "john smith" vs "JOHN SMITH", "Seattle" vs "seattle" vs "SEATTLE". Analysis of field distributions showed this affected a significant proportion of contact records across customer databases.
Postal Abbreviation Inconsistencies
State and province names appeared in multiple formats: full names ("Washington", "California"), standard postal abbreviations ("WA", "CA"), and non-standard variations. This inconsistency complicated geographic segmentation and analysis.
Partial Location Information
Many contact records contained incomplete location data: city present but state missing, zip code present but city missing, or other combinations of partial geographic information. Cross-referencing analysis showed this missing information could often be inferred from the fields that were populated.
Data in Wrong Columns
The most challenging issue: city, state, and zip code information frequently appeared in incorrect database columns. For example, state names appearing in the city field, or zip codes appearing in address line fields. Pattern analysis revealed this was widespread but inconsistent - no single rule could reliably identify all misplaced data.
Methodology Evaluation
For each category of data quality issue, I evaluated which technical approaches could solve the problem effectively and efficiently:
Capitalization: Standard NLP Techniques
Testing demonstrated that standard NLP name capitalization libraries handled the vast majority of capitalization issues correctly. These rule-based systems understood proper noun capitalization patterns and required no training data or complex models.
Missing Location Information: Lookup Tables
Created lookup tables from postal service databases mapping cities to states, zip codes to cities and states, and other geographic relationships. When partial location information was present, table lookups could reliably fill in missing fields. This approach was deterministic, fast, and required no model training.
Postal Abbreviations: Standard References
Postal abbreviations could be normalized using standard reference tables. Similar to the capitalization problem, this required no machine learning - just systematic application of known mappings.
Wrong Column Data: LLM Required
Attempted multiple rule-based approaches to identify and relocate misplaced data. Pattern matching rules proved too brittle - they either missed legitimate cases or incorrectly flagged valid data. The problem required understanding context and semantic meaning to distinguish "Washington" the city from "Washington" the state, or to recognize when "98101" appearing in an address field was actually a zip code.
Testing showed LLMs could reliably identify misplaced data and determine the correct destination columns. The model's ability to understand semantic context made it robust where rule-based systems failed. However, this was the only component of the contact list cleanup problem that actually required AI.
Analysis and Recommendations
Documented findings in a white paper for upper management and product:
- Four distinct categories of data quality issues with quantified prevalence in customer databases
- Appropriate technical solutions for each category
- Evidence demonstrating that three of four problems could be solved with standard NLP techniques and lookup tables
- LLM application required only for wrong column detection where rule-based approaches proved inadequate
- Implementation approach and resource requirements for each component
Outcome
After reviewing the white paper, upper management and product deprioritized the contact list cleanup project. The analysis provided the technical foundation and resource requirements that informed this strategic resource allocation decision.
Development Environment
- Python
- Jupyter Lab
- Pandas
- NumPy
- spaCy
- LangChain
- Snowflake
- Snowflake Python Connector
- Git
- Bitbucket