NLP data extraction

The Problem

A client conducting price-per-quantity and price distribution analysis on Washington state public sales data encountered a data quality issue blocking their work. The dataset's structured quantity field had missing values for a substantial portion of records, while quantity information was embedded in unstructured product description and name fields. The client had no viable approach for extracting this information at scale.

Data Investigation and Problem Diagnosis

Examined the data to understand why standard string extraction methods would fail. The dataset aggregated records from multiple manufacturers, each using their own product naming and description conventions with no data entry enforcement. This created several challenges:

  • Inconsistent formatting: quantity values appeared as "3.5g", "3.5 grams", "3.5 G", and other variations
  • Variable spacing: no consistent spacing between numeric values and units
  • Variable positioning: quantities appeared at different locations within text - beginning, middle, or end of descriptions
  • Multiple manufacturers: each source had distinct formatting styles and conventions
  • No extractable patterns: the lack of standardization meant regex or positional extraction approaches would not scale

The fundamental issue was data heterogeneity from unsupervised multi-source collection. Any extraction approach needed to handle format variations without relying on positional or structural consistency.

Methodology Selection

Based on the data characteristics, selected spaCy's EntityRuler for pattern-based Named Entity Recognition. This approach could identify quantity patterns regardless of position within text while handling format variations through flexible pattern matching.

Defined extraction patterns matching numeric tokens followed by weight unit tokens (g, kg, grams, gram, mg, milligrams, ounce, ounces, oz, lb), with case-insensitive matching to handle capitalization variations. This pattern-based approach could extract quantities from any position in the text without requiring positional consistency.

Model Testing and Validation

Initial implementation used spaCy's pre-trained English models following standard tutorial approaches. Tested small, medium, and large pre-trained models - all produced poor results. The pre-trained models' learned Named Entity Recognition conflicted with the custom EntityRuler patterns.

Further research identified an alternative approach using blank spaCy models without pre-trained components. Implemented extraction using a blank model that relied exclusively on the custom EntityRuler patterns. Validated results through spot-checking extracted quantities against source descriptions.

The blank model approach substantially outperformed all pre-trained models, providing reliable extraction across the heterogeneous formatting in the dataset. This contradicted standard spaCy documentation recommendations but proved appropriate for this specific data problem.

Implementation Details

When product descriptions were null, used product names as the extraction source to maximize coverage. For records containing multiple quantity values, selected the last occurrence, which typically represented the product's actual quantity rather than contextual references.

Processed the dataset through the extraction pipeline and added the structured quantity data as a new column. Communicated methodology and limitations to stakeholders, noting that not all records contained extractable quantity information and explaining the dual-source approach (description field with product name fallback).

Outcome

Delivered the enriched dataset with extracted quantity data in under one week. The structured quantity information enabled the client to complete their price-per-quantity and distribution analysis. The client confirmed the extracted data met their analytical requirements.

Development Environment

  • Python
  • Jupyter Notebook
  • spaCy
  • spaCy EntityRuler
  • Pandas
  • NumPy