Fixing Data Problems with Large Language Models: A Practical Guide

Software systems exist to orchestrate the flow of information. But when that data is wrong, everything breaks down. I've dealt with this mess firsthand, and I think large language models can help fix it. Here's what I've learned from my experience and two papers worth reading:

This guide will cover techinques for cleaning and preprocessing data, using LLMs to backfill missing values, fix inconsistant formatting, remove duplicates and consolidate. We also cover data integration tasks like matching schemes across different systems.

Using LLMs for Data Cleaning & Processing

Bad data ruins analysis and leads to poor decisions. Traditional cleaning techniques are too specialized and rigid, while Large Language Models (LLMs) can fix multiple problems in one sweep:

  • Missing Values: Gaps in datasets from incomplete data entry, technical errors, or system limitations.
  • Inconsistencies: Different representations of the same information (e.g., "New York" vs "NY", "123 Main St." vs "123 Main Street") that complicate aggregation and analysis.
  • Duplicate Records: Multiple entries of the same data that can skew analysis results and waste resources.

Let's look at how a single, simple prompt to an LLM can automatically detect and fix all these issues at once.

Loading code snippet...

And the results:

Loading code snippet...

Data Labeling: Inferring Categories from Text

LLMs can infer categories from text descriptions, making them useful for automated data labeling. We can either let the model choose appropriate categories based on context, or define a specific set of categories ourselves. In this example, we will constrain the model to use predefined industry labels.

Loading code snippet...

Our labeled data:

Loading code snippet...

Notice that by using the Pydantic models for structured outputs, the returned data automatically conforms to our schema. We don't need to provide additional formatting instructions or parse the response.

Data Integration: Matching Columns Across Databases

When merging data from different systems, we often need to identify which columns mean the same thing. For example, one database might call it "customer_id" while another uses "client_number" - they're the same data, just labeled differently.

Consider this common scenario: Two companies with separate sales tracking systems. Both track the same information but use different column names:

Dataset A:

ProductIDNameUnitsSold
001Laptop50
002Mouse100
003Keyboard75
004Monitor40

Dataset B:

ItemCodeProductNameQuantitySold
A1Notebook48
B2PC Mouse105
C3Key Board78
D4Screen42

Identify the columns between datasets A and B that contain the same information but use different names:

Loading code snippet...

With the columns mapped out, we can now standardize both datasets into the target schema using Python:

Loading code snippet...

The result:

Loading code snippet...

Data Integration II: Matching Identical Records

With our columns standardized, the next challenge is matching products that appear under different names across the datasets.

Looking at our merged dataset from above, we can identify several products that are the same item but named differently:

  • Laptop matches Notebook
  • Mouse matches PC Mouse
  • Keyboard matches Key Board
  • Monitor matches Screen

Next, we'll combine the matching records using the following code (which continues from our previous code):

Loading code snippet...

The merged dataset:

Loading code snippet...

The LLM merged two obvious matches:

  • Mouse (sales: 100) and PC Mouse (sales: 105) were combined into a single "Mouse" entry with 205 total sales
  • Keyboard (sales: 75) and Key Board (sales: 78) were merged into "Keyboard" with 153 total sales

Several other seemingly similar items were kept separate due to potential product distinctions:

  • Laptop vs Notebook - While often used interchangeably, "Notebook" could refer to a paper notebook product
  • Monitor vs Screen - These could represent different product categories (e.g., computer monitors vs. TV screens)

Only the most obviously identical items were combined (Mouse/PC Mouse and Keyboard/Key Board), while potentially distinct products remained separate. Of course, we could easily instruct it to merge additional pairs like Laptop/Notebook - but this default behavior of erring on the side of caution helps prevent unwanted data loss while still automating clear matches.

Want more guides and deep dives? Sign up for our newsletter.