Natural Language Processing in Python
Traditional and Modern NLP Techniques
This comprehensive project demonstrates the evolution of Natural Language Processing from classical machine learning approaches to state-of-the-art transformer architectures. Through hands-on implementations, the project covers text preprocessing, traditional ML models, and modern deep learning techniques using Hugging Face transformers.
The project uses three datasets - children's books, movie reviews, and movie metadata with director information - to implement a complete NLP pipeline from data cleaning through advanced semantic analysis.
Project Structure: Three Core Components
The project is organized into three sequential sections that build upon each other, demonstrating the progression of NLP techniques:
- Text Preprocessing and Vectorization
- Traditional Machine Learning NLP
- Modern NLP with Transformers
Foundation work including text normalization with pandas (lowercase conversion, Unicode character removal, punctuation stripping) and advanced preprocessing with spaCy for tokenization and lemmatization. Implements both CountVectorizer and TF-IDF vectorization to create document-term matrices.
Classical NLP techniques including VADER sentiment analysis, text classification with Naive Bayes and Logistic Regression for predicting director gender from movie descriptions, and Non-Negative Matrix Factorization for topic modeling to discover thematic patterns in unlabeled data.
State-of-the-art implementations using Hugging Face pipelines for sentiment analysis, Named Entity Recognition, zero-shot classification, text summarization, and document similarity analysis using sentence transformers.
Section 1: Text Preprocessing and Vectorization
Dataset: Children's Books (100 titles, 1947-2014)
The preprocessing pipeline transforms raw text into numerical representations suitable for machine learning:
Preprocessing Implementation
# Pandas-based text cleaning
def lower_replace(text_series):
return (text_series
.str.lower()
.str.replace('\xa0', ' ')
.str.replace('[^\w\s]', '', regex=True))
df['Description_Clean'] = lower_replace(df.Description)
Advanced Normalization with spaCy
import spacy
nlp = spacy.load('en_core_web_sm')
from maven_text_preprocessing import clean_and_normalize
# Tokenization, lemmatization, stop word removal
df['movie_info_clean'] = df.movie_info.apply(clean_and_normalize)
Vectorization Strategy
Two vectorization approaches were implemented and compared:
- CountVectorizer: Raw term frequency counts
- TF-IDF Vectorizer: Term frequency inverse document frequency weighting
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Count Vectorizer with preprocessing
cv = CountVectorizer(stop_words='english', min_df=0.1)
X = cv.fit_transform(df.Description_Clean)
# TF-IDF Vectorizer with same parameters
tv = TfidfVectorizer(stop_words='english', min_df=0.1)
Xv = tv.fit_transform(df.Description_Clean)
Key findings: TF-IDF consistently provided better feature representations for downstream tasks by downweighting common terms and emphasizing discriminative vocabulary.
Section 2: Traditional Machine Learning NLP
Dataset: Movie Reviews (160 movies, 2019 releases)
1. Sentiment Analysis with VADER
VADER (Valence Aware Dictionary and sEntiment Reasoner) provides rule-based sentiment analysis particularly effective for social media and informal text:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
# Apply to movie descriptions
df['sentiment'] = df.movie_info.apply(
lambda x: analyzer.polarity_scores(x)['compound']
)
Results: Sentiment scores ranged from -0.9706 (most negative) to 0.9915 (most positive). Top positive sentiment: "Breakthrough" (0.9915), most negative: "Charlie Says" (-0.9706).
2. Text Classification
Predicted director gender from movie descriptions using two classic algorithms:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Prepare features
X = tv.fit_transform(df.movie_info_clean)
y = df.director_gender
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Naive Bayes
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
# Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
Both models achieved competitive accuracy, demonstrating that linguistic patterns in movie descriptions correlate with directorial gender.
3. Topic Modeling with NMF
Non-Negative Matrix Factorization discovered latent topics in unlabeled movie descriptions:
from sklearn.decomposition import NMF
# Configure NMF
nmf_model = NMF(n_components=6, random_state=42, max_iter=500)
# Fit to TF-IDF features
W = nmf_model.fit_transform(Xv_df)
H = nmf_model.components_
Discovered Topics:
- Topic 1: Family films (family, father, grow, face, home, young)
- Topic 2: True stories (film, true, base, star, comedy, inspire)
- Topic 3: Friends narratives (friend, good, live, dream, love)
- Topic 4: Award winners (academy, award, winner, nominee)
- Topic 5: Adventure (set, force, war, universe, man, black)
- Topic 6: Horror (child, sinister, evil, mother, deadly, horror)
Section 3: Modern NLP with Transformers
Hugging Face Implementation with Metal Performance Shaders
All transformer models configured for Apple Silicon acceleration using MPS device:
1. Sentiment Analysis
from transformers import pipeline, logging
logging.set_verbosity_error()
sentiment_analyzer = pipeline('sentiment-analysis', device='mps')
results = df.movie_info.apply(sentiment_analyzer)
Compared transformer-based sentiment against VADER baseline, demonstrating improved nuance in context understanding.
2. Named Entity Recognition
Extracted character names from children's book descriptions:
ner_analyzer = pipeline(
'ner',
model='dbmdz/bert-large-cased-finetuned-conll03-english',
device='mps',
aggregation_strategy='SIMPLE'
)
# Extract person entities, exclude authors
named_entities = cb_df['Description'].apply(
lambda row: [entity['word']
for entity in ner_analyzer(row)
if entity['entity_group'] == 'PER']
)
# Generate unique list
unique_characters = list(set(named_entities.explode().dropna().tolist()))
3. Zero-Shot Classification
Classified books without training data using natural language category descriptions:
classifier = pipeline(
'zero-shot-classification',
model='facebook/bart-large-mnli',
device='mps'
)
categories = [
'adventure & fantasy',
'animals & nature',
'mystery',
'humor',
'non-fiction'
]
cb_df['Category'] = cb_df.Description.apply(
lambda x: classifier(x, categories)['labels'][0]
)
4. Text Summarization
Generated abstractive summaries using BART-large:
summarizer = pipeline(
'summarization',
model='facebook/bart-large-cnn',
device='mps'
)
cb_df['Summary'] = cb_df.Description.apply(
lambda x: summarizer(
x,
min_length=10,
max_length=50,
early_stopping=True,
length_penalty=0.8
)[0]['summary_text']
)
Example: "Where the Wild Things Are" description (78 words) → summary (33 words) maintaining core narrative.
5. Document Similarity
Computed semantic similarity using sentence transformers:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
feature_extractor = pipeline(
'feature-extraction',
model='sentence-transformers/all-MiniLM-L6-v2',
device='mps'
)
# Generate 384-dimensional embeddings
embeddings = cb_df.Description.apply(
lambda x: feature_extractor(x)[0][0]
)
book_embeddings = np.vstack(embeddings)
# Compute similarity to Harry Potter
similarity_scores = cosine_similarity(
embedding_hp,
book_embeddings
)
Top 5 Most Similar to Harry Potter and the Sorcerer's Stone:
- Harry Potter and the Sorcerer's Stone (1.0000)
- Harry Potter and the Prisoner of Azkaban (0.8726)
- Harry Potter and the Chamber of Secrets (0.8554)
- The Witches (0.7991)
- The Wonderful Wizard of Oz (0.7885)
Key Technical Findings
Preprocessing Impact
Text normalization and lemmatization with spaCy improved model performance across all tasks by reducing vocabulary size while preserving semantic meaning. Minimum document frequency thresholds (10%) eliminated noise from rare terms.
Vectorization Comparison
TF-IDF consistently outperformed raw counts by downweighting common terms. The combination of stop word removal and document frequency filtering reduced feature space by approximately 60% while improving accuracy.
Traditional vs. Modern Approaches
VADER sentiment analysis: Fast (milliseconds per document), interpretable, but limited context understanding. Transformer sentiment: Slower (seconds per document), superior nuance, better handling of negation and sarcasm.
Topic Coherence
NMF with 6 components produced highly interpretable topics with clear thematic separation. Horror films (sinister, evil, deadly) cleanly separated from family films (father, home, young), validating the model's semantic understanding.
Zero-Shot Performance
BART-large MNLI achieved strong classification accuracy without task-specific training, correctly categorizing "Where the Wild Things Are" as adventure & fantasy and "The Very Hungry Caterpillar" as animals & nature.
Embedding Quality
Sentence-transformers captured semantic relationships effectively, with Harry Potter sequels ranking highest in similarity followed by other magical adventure narratives (The Witches, Wizard of Oz), demonstrating understanding beyond keyword matching.
Implementation Architecture
The project uses a modular architecture with three separate conda environments:
Environment 1: Text Preprocessing
conda create -n nlp_preprocessing python=3.12
pip install pandas spacy matplotlib
python -m spacy download en_core_web_sm
Environment 2: Traditional ML
conda create -n nlp_machine_learning python=3.12
pip install pandas scikit-learn vaderSentiment spacy
Environment 3: Transformers
conda create -n nlp_transformers python=3.12
pip install pandas transformers torch
Hardware Configuration
- Platform: Apple Silicon (M-series)
- Acceleration: Metal Performance Shaders (MPS)
- Python: 3.12.11 via miniforge3
- Development: Jupyter Notebook
Performance Characteristics
Processing Speed
- spaCy preprocessing: ~50-100 documents/second (CPU)
- VADER sentiment: ~1000 documents/second (rule-based)
- Scikit-learn vectorization: ~200 documents/second
- Transformer inference: ~5-10 documents/second (MPS)
- Feature extraction: ~3-5 documents/second (384-dim embeddings)
Memory Footprint
- Document-term matrices: Sparse format (~10-20MB for 100-160 documents)
- Transformer models: 1-2GB per model loaded
- Embedding matrices: ~380KB for 100 documents (float32)
- spaCy model: ~15MB (en_core_web_sm)
Computational Requirements
Traditional ML approaches (VADER, Naive Bayes, NMF) run efficiently on CPU. Transformer models benefit significantly from GPU/MPS acceleration, with 3-5x speedup on Apple Silicon compared to CPU inference.
Data Characteristics
Children's Books Dataset
- Size: 100 books
- Time Range: 1947-2014
- Average Rating: 4.3/5.0
- Description Length: 50-200 words
- Features: Ranking, Title, Author, Year, Rating, Description
Movie Reviews Dataset
- Size: 160 movies
- Release Year: 2019
- Features: Title, Rating, Genre, Directors, Director Gender, Tomatometer, Audience Rating, Critics Consensus
- Description Length: 100-500 words
- Director Gender Distribution: Provides binary classification target
Text Statistics
- Vocabulary Size (raw): ~3,000-5,000 unique tokens
- Vocabulary Size (after preprocessing): ~1,500-2,500 unique lemmas
- Average Document Length: 150-300 tokens
- Stop Words Removed: ~40% of tokens
Applications and Extensions
The techniques demonstrated in this project have broad applications across multiple domains:
Content Moderation
Sentiment analysis and text classification can identify problematic content, toxic comments, or spam in user-generated content platforms.
Recommendation Systems
Document similarity and topic modeling enable content-based recommendations for books, movies, articles, or products based on semantic understanding.
Information Extraction
Named Entity Recognition extracts structured information from unstructured text for knowledge graph construction, database population, or automated tagging.
Document Organization
Zero-shot classification and topic modeling organize large document collections without manual labeling, enabling automated categorization and search.
Content Summarization
Abstractive summarization condenses long documents for executive summaries, news digests, or preview generation.
Potential Enhancements
- Fine-tuning transformer models on domain-specific data for improved accuracy
- Multi-label classification for documents spanning multiple categories
- Hierarchical topic modeling for nested topic structures
- Cross-lingual approaches for multilingual document processing
- Active learning for efficient model improvement with minimal labeling
- Ensemble methods combining traditional and modern approaches
Conclusion
This project demonstrates the complete spectrum of Natural Language Processing techniques, from foundational text preprocessing through state-of-the-art transformer architectures. The implementations showcase both the enduring value of traditional approaches (VADER's speed, NMF's interpretability) and the superior semantic understanding of modern transformers.
Key achievements include successful text classification predicting director gender from linguistic patterns, coherent topic discovery in unlabeled movie descriptions, and accurate semantic similarity computation using sentence embeddings. The project validates that while transformers excel at nuanced understanding, traditional methods remain valuable for speed-critical applications and interpretable analysis.
The modular architecture with separate environments for each technique family enables efficient development and deployment, while the comprehensive documentation ensures reproducibility and knowledge transfer. This work provides a solid foundation for building production NLP systems across diverse applications from content moderation to recommendation engines.
Technologies & Libraries
Core NLP Libraries
- spaCy 3.8.0: Industrial-strength NLP for tokenization, lemmatization, and linguistic annotations
- Transformers (Hugging Face): State-of-the-art pre-trained models and pipelines
- vaderSentiment: Rule-based sentiment analysis optimized for social media text
Machine Learning
- scikit-learn: CountVectorizer, TfidfVectorizer, MultinomialNB, LogisticRegression, NMF, cosine_similarity
- PyTorch: Backend for transformer model inference
Data Processing
- pandas: Data manipulation and analysis
- numpy: Numerical operations and array handling
- matplotlib: Data visualization and charts
Pre-trained Models
- NER: dbmdz/bert-large-cased-finetuned-conll03-english
- Zero-Shot: facebook/bart-large-mnli
- Summarization: facebook/bart-large-cnn
- Embeddings: sentence-transformers/all-MiniLM-L6-v2
Development Environment
- Python 3.12.11
- Jupyter Notebook
- Apple Silicon (M-series) with MPS acceleration
- miniforge3 conda distribution
- Three separate environments for modular development