BERTopic — 20 Newsgroups

Overview

BERTopic is a topic modelling framework that combines sentence embeddings, dimensionality reduction, and density-based clustering to discover topics in a text corpus without requiring the number of topics to be specified in advance. This notebook is a step-by-step implementation of the full pipeline applied to the 20 Newsgroups dataset, a standard benchmark of approximately 18,800 Usenet posts across 20 discussion categories.

The notebook separates each stage of the pipeline — embeddings, UMAP reduction, HDBSCAN clustering — so the output at every step can be inspected before fitting the full model. This also makes it straightforward to substitute any component and observe the effect independently.

Dataset

The 20 Newsgroups dataset is loaded via sklearn.datasets.fetch_20newsgroups with subset='all' to combine train and test splits — the train/test distinction is irrelevant for unsupervised topic modelling. Headers, footers, and quoted replies are stripped with remove=('headers', 'footers', 'quotes'). Without this, the embedding model picks up on newsgroup routing metadata rather than actual post content, and the resulting topics reflect email infrastructure rather than discussion subjects.

Stripping sometimes leaves documents empty — posts that were pure metadata or entirely quoted replies. Empty strings produce zero-length embeddings that cause problems for HDBSCAN, so they are removed before encoding. 18,846 documents loaded; 515 removed after stripping; 18,331 used for modelling.

Pipeline

Embeddings

Documents are encoded with all-MiniLM-L6-v2, a distilled Sentence-BERT model that produces 384-dimensional embeddings. It is fast and memory-efficient and performs well on English text. The model is loaded once and the full corpus encoded in a single call with batch_size=64 and convert_to_numpy=True — BERTopic, UMAP, and HDBSCAN all expect NumPy arrays. The embeddings are stored and passed explicitly to BERTopic later, which tells BERTopic to skip its own internal encoding step. Encoding 18,331 documents takes approximately 19 seconds on Apple Silicon MPS.

Dimensionality Reduction

384-dimensional embeddings are too high-dimensional for HDBSCAN to cluster effectively. In high dimensions, distance metrics lose discriminative power — the curse of dimensionality. UMAP is used to reduce to 5 dimensions before clustering, using cosine similarity as the metric (direction matters more than magnitude for sentence embeddings) and min_dist=0.0 to produce tight clusters. UMAP is run outside BERTopic first so the intermediate output can be inspected before committing to the full pipeline. A separate, independent UMAP pass to 2 dimensions is run later for visualisation only — the 5-dimensional clustering reduction is not suitable for plotting.

Clustering

HDBSCAN groups documents by density rather than assigning every point to a centroid. This means the number of topics does not need to be specified in advance, topics can be any shape, and low-density points are labelled −1 (outlier) rather than forced into a cluster. min_cluster_size=15 sets the minimum number of documents required to form a topic. prediction_data=True is required by BERTopic for soft topic assignment and for outlier reduction to work. Running HDBSCAN outside BERTopic first confirms the cluster count and outlier rate before fitting the full model: 154 topics, 7,170 outliers (39.1%).

Keyword Extraction

BERTopic applies c-TF-IDF (class-based TF-IDF) after clustering: it concatenates all documents in a topic into a single pseudo-document and computes TF-IDF across all pseudo-documents. This surfaces words that are frequent within a topic but rare across topics. The keyword representation is independent of the clustering step and can be updated without re-running UMAP or HDBSCAN.

Outlier Reduction

HDBSCAN's 7,170 outlier documents are not discarded. BERTopic's reduce_outliers method reassigns each outlier to the topic whose centroid embedding is closest by cosine similarity. After reassignment, update_topics recomputes c-TF-IDF with the updated document-to-topic mapping. All 7,170 outliers were reassigned; 0 remaining.

Representation Comparison

BERTopic decouples clustering from keyword representation. Once documents are assigned to topics, the c-TF-IDF keyword layer can be updated freely without re-running UMAP or HDBSCAN. Three approaches are compared in sequence:

Baseline — default CountVectorizer on raw text. Common English function words often score highly because the IDF component does not fully suppress them when topic documents still vary in their use of those words.

Stop word removal — CountVectorizer(stop_words='english') filters scikit-learn's list of 318 common English words before computing c-TF-IDF. Function words are excluded from the vocabulary entirely, allowing content words to surface.

Lemmatisation — spaCy's en_core_web_sm model reduces each word to its base form before counting. running, runs, and ran all contribute to the score for run. The spaCy pipeline is run with the parser and NER components disabled — only the tokeniser, tagger, and lemmatiser are needed. Documents are processed in batches via nlp.pipe() and the results written as space-joined strings. CountVectorizer then splits those strings using token_pattern=r'\S+' rather than a custom tokeniser callable, which avoids the UserWarning scikit-learn raises when a callable tokeniser is combined with the default token pattern. Lemmatisation adds approximately 2–3 minutes of processing time and an additional model dependency; whether the improvement in keyword quality justifies that cost depends on the application.

Development Environment

Python 3.12+
BERTopic
sentence-transformers
UMAP
HDBSCAN
spaCy (en_core_web_sm)
scikit-learn
PyTorch (MPS / CUDA / CPU)
Jupyter

Repository

github.com/dataville/bertopic_20_newsgroups