How to Unzip .gz Files in Python
This guide explains how to unzip .gz
files (GZIP compressed files) in Python. We'll cover both extracting the contents to a new file and reading the uncompressed data directly into memory, using the built-in gzip
and shutil
modules.
Unzipping a .gz
File to a New File (Extraction)
To unzip a .gz
file and save the uncompressed contents to a new file, use the gzip
and shutil
modules:
import gzip
import shutil
with gzip.open('example.json.gz', 'rb') as file_in:
with open('example.json', 'wb') as file_out: # wb = write bytes
shutil.copyfileobj(file_in, file_out)
print('example.json file created')
gzip.open('example.json.gz', 'rb')
: Opens the.gz
file in binary read mode ('rb'
).gzip.open()
handles the decompression.with open('example.json', 'wb') as file_out:
: Opens the output file in binary write mode ('wb'
). It's crucial to use binary mode here, even if the uncompressed content is text, because the output ofgzip.open
is bytes.shutil.copyfileobj(file_in, file_out)
: This efficiently copies the uncompressed data from the input file object (file_in
) to the output file object (file_out
). It handles reading and writing in chunks, so it works well even with large files.- The code assumes there is a file named
example.json.gz
in the same directory as the Python script, but you can use any other path instead.
Reading the Uncompressed Contents of a .gz
File
If you just want to read the uncompressed data into a Python variable (without creating a new file), you can use gzip.open()
and read()
:
import gzip
with gzip.open('example.json.gz', 'rb') as file_in:
file_contents = file_in.read() # Reads as bytes
print(file_contents) # Output a bytes object: b'...'
#If you're working with TEXT data, decode it:
with gzip.open('example.json.gz', 'rt', encoding='utf-8') as file_in:
file_contents = file_in.read() # Read as a string directly
print(file_contents)
file_in.read()
: Reads the entire uncompressed content into thefile_contents
variable. If it is a text file, therb
should be replaced withrt
, and the encoding should be provided.
By default, gzip.open()
in binary mode ('rb'
) returns bytes. If you know the file contains text, open it in text mode ('rt'
) and specify the correct encoding (usually UTF-8): gzip.open('example.json.gz', 'rt', encoding='utf-8')
. This will decode the content to a string as you read.
For line-by-line reading:
import gzip
with gzip.open('example.json.gz', 'rt', encoding='utf-8') as file_in:
for line in file_in:
print(line.strip())
Reading CSV or JSON Data from .gz
Files (using pandas
)
If your .gz
file contains structured data like CSV or JSON, the pandas
library provides convenient functions to read them directly:
import gzip # Still needed for decompression
import pandas as pd
# CSV Example:
with gzip.open('example.csv.gz', 'rt', encoding='utf-8') as file_in: # 'rt' for text mode
df = pd.read_csv(file_in)
print(df.head())
# JSON Example (one JSON object per line):
with gzip.open('example.json.gz', 'rt', encoding='utf-8') as file_in:
df = pd.read_json(file_in, lines=True) # Read line-delimited JSON
print(df.head())
# JSON Example (single JSON array):
with gzip.open('example.json.gz', 'rb') as file_in:
data = json.load(file_in) # Read data from json file.
df = pd.DataFrame(data) # Create dataframe.
print(df.head())
- CSV:
pd.read_csv()
can directly read from a file-like object, so we pass thegzip.open()
result to it. - JSON (lines=True): For line-delimited JSON, use
pd.read_json(..., lines=True)
. - JSON (Array) If the
.gz
file contains a single, large JSON array, usejson.load()
to load the data into a Python object, and then convert it into a DataFrame.