Skip to main content

How to Compare CSV Files and Printing Differences in Python

Comparing CSV files to identify differences is a common task in data analysis and software development.

This guide demonstrates several methods for comparing CSV files in Python, ranging from simple line-by-line comparisons using sets to more sophisticated approaches using the csv-diff library and Pandas DataFrames.

Simple Line-by-Line Comparison (Ignoring CSV Structure)

The simplest (but least robust) method is to read each file line by line, treat each line as a string, and compare the sets of lines:

# csv-file-1.csv:
# Alice,Smith,500
# Bob,Smith,600
# Carl,Smith,400

# csv-file-2.csv
# Alice,Smith,500
# Bob,Smith,600
# Dan,Smith,2500

with open('csv-file-1.csv', 'r', encoding='utf-8') as file_1, \
open('csv-file-2.csv', 'r', encoding='utf-8') as file_2:

file_1_lines = file_1.readlines()
file_2_lines = file_2.readlines()

with open('differences.csv','w', encoding='utf-8') as differences_file:
for line in file_2_lines:
if line not in file_1_lines: # Check if line is present in first file
print(line) # Prints the differences
differences_file.write(line) # Write the differences to file
# Output:
# Dan,Smith,2500
  • The with open() statements open the file in read mode ('r'), and it is recommended to specify the encoding to avoid issues with file encodings.
  • The loop compares lines from file_2_lines to see if they are also in file_1_lines, using the not in operator.
  • It's important to note that this will only return the exact matches, based on the row's contents as strings.
warning

Important Limitations: This approach has significant limitations:

  • Ignores CSV Structure: It treats each line as a simple string, ignoring commas within fields, quoted fields, and header rows. This will lead to incorrect results if your CSV files have any complexity.
  • Order-Dependent: The order of lines matters. If the files have the same rows but in a different order, this method will report them as different.
  • No Column Information: You don't get any information about which columns are different.

For robust, structured comparisons, the csv-diff library is highly recommended. Install it:

pip install csv-diff

Then, use it from the command line:

csv-diff csv-file-1.csv csv-file-2.csv --key=first
  • --key=first: This is crucial. You must specify a key column (or multiple key columns) that uniquely identifies each row. Without a key, csv-diff will only compare based on row order, which is rarely what you want. Choose a column (or combination of columns) that acts like a primary key.
  • This command-line tool will show you the data that was added, removed, or rows with changed values.

You can also use it programmatically:

from csv_diff import load_csv, compare
import io # Used for reading a string as if it was a file.

# Sample CSV data (replace with your file reading)
csv_data1 = """first,last,salary
Alice,Smith,500
Bob,Smith,600
Carl,Smith,400"""

csv_data2 = """first,last,salary
Alice,Smith,500
Bob,Smith,600
Dan,Smith,2500"""

# Use io.StringIO to treat the strings as files
diff = compare(
load_csv(io.StringIO(csv_data1), key='first'),
load_csv(io.StringIO(csv_data2), key='first'),
)

print(diff)
# Output:
# {'added': [{'first': 'Dan', 'last': 'Smith', 'salary': '2500'}],
# 'removed': [{'first': 'Carl', 'last': 'Smith', 'salary': '400'}],
# 'changed': [],
# 'columns_added': [],
# 'columns_removed': []}

print(diff['added']) # Output: [{'first': 'Dan', 'last': 'Smith', 'salary': '2500'}]
print(diff['removed']) # Output: [{'first': 'Carl', 'last': 'Smith', 'salary': '400'}]

  • The load_csv reads a CSV file and loads it to a dictionary. The key parameter sets which column to use as a key.
  • The compare function returns a dictionary containing a summary of the differences between the files.
  • The example above uses io.StringIO to simulate reading from files, to make the code runnable. In a real scenario, you would use open('csv-file-1.csv').

Using Pandas compare() (for DataFrames)

Pandas provides powerful tools for comparing data. If your CSV files have the same structure (same columns), you can use the compare() method (introduced in Pandas 1.1.0):

import pandas as pd

# Assuming your CSV files have a header row
df1 = pd.read_csv('csv-file-1.csv')
df2 = pd.read_csv('csv-file-2.csv')

df3 = df1.compare(df2) # The comparison
print(df3)
# Output:
# first salary
# self other self other
# 2 Carl Dan 400.0 2500.0
  • pd.read_csv(): Reads the CSV files into Pandas DataFrames.
  • df1.compare(df2): Compares the two DataFrames and highlights the differences.
  • If you need to specify the column names explicitly, set the names attribute when calling the read_csv method:
      import pandas as pd

    df1 = pd.read_csv('csv-file-1.csv', names=['first', 'last', 'salary'])
    df2 = pd.read_csv('csv-file-2.csv', names=['first', 'last', 'salary'])
    df3 = df1.compare(df2) # The comparison
    print(df3)

Using Pandas merge() (for Identifying Unique Rows)

If you want to identify rows that are only in one file or the other (similar to a set difference), use merge() with the indicator=True option:

import pandas as pd

df1 = pd.read_csv('csv-file-1.csv')
df2 = pd.read_csv('csv-file-2.csv')
df3 = df1.merge(df2, indicator=True, how='outer')
print(df3)
# Output:
# first last salary _merge
# 0 Alice Smith 500 both
# 1 Bob Smith 600 both
# 2 Carl Smith 400 left_only
# 3 Dan Smith 2500 right_only

df3 = df3[df3['_merge'] != 'both'] # Filter out common rows
print(df3)
# Output:
# first last salary _merge
# 2 Carl Smith 400 left_only
# 3 Dan Smith 2500 right_only
  • df1.merge(df2, indicator=True, how='outer'): Performs an outer merge, adding a special column _merge that indicates whether each row came from the 'left' DataFrame (df1), the 'right' DataFrame (df2), or 'both'.
  • df3[df3['_merge'] != 'both']: Filters the result to show only rows that are not present in both DataFrames.