Skip to main content

How to Unzip .gz Files in Python

This guide explains how to unzip .gz files (GZIP compressed files) in Python. We'll cover both extracting the contents to a new file and reading the uncompressed data directly into memory, using the built-in gzip and shutil modules.

Unzipping a .gz File to a New File (Extraction)

To unzip a .gz file and save the uncompressed contents to a new file, use the gzip and shutil modules:

import gzip
import shutil

with gzip.open('example.json.gz', 'rb') as file_in:
with open('example.json', 'wb') as file_out: # wb = write bytes
shutil.copyfileobj(file_in, file_out)
print('example.json file created')
  • gzip.open('example.json.gz', 'rb'): Opens the .gz file in binary read mode ('rb'). gzip.open() handles the decompression.
  • with open('example.json', 'wb') as file_out:: Opens the output file in binary write mode ('wb'). It's crucial to use binary mode here, even if the uncompressed content is text, because the output of gzip.open is bytes.
  • shutil.copyfileobj(file_in, file_out): This efficiently copies the uncompressed data from the input file object (file_in) to the output file object (file_out). It handles reading and writing in chunks, so it works well even with large files.
  • The code assumes there is a file named example.json.gz in the same directory as the Python script, but you can use any other path instead.

Reading the Uncompressed Contents of a .gz File

If you just want to read the uncompressed data into a Python variable (without creating a new file), you can use gzip.open() and read():

import gzip

with gzip.open('example.json.gz', 'rb') as file_in:
file_contents = file_in.read() # Reads as bytes
print(file_contents) # Output a bytes object: b'...'

#If you're working with TEXT data, decode it:
with gzip.open('example.json.gz', 'rt', encoding='utf-8') as file_in:
file_contents = file_in.read() # Read as a string directly
print(file_contents)
  • file_in.read(): Reads the entire uncompressed content into the file_contents variable. If it is a text file, the rb should be replaced with rt, and the encoding should be provided.
note

By default, gzip.open() in binary mode ('rb') returns bytes. If you know the file contains text, open it in text mode ('rt') and specify the correct encoding (usually UTF-8): gzip.open('example.json.gz', 'rt', encoding='utf-8'). This will decode the content to a string as you read.

For line-by-line reading:

import gzip

with gzip.open('example.json.gz', 'rt', encoding='utf-8') as file_in:
for line in file_in:
print(line.strip())

Reading CSV or JSON Data from .gz Files (using pandas)

If your .gz file contains structured data like CSV or JSON, the pandas library provides convenient functions to read them directly:

import gzip  # Still needed for decompression
import pandas as pd

# CSV Example:
with gzip.open('example.csv.gz', 'rt', encoding='utf-8') as file_in: # 'rt' for text mode
df = pd.read_csv(file_in)
print(df.head())

# JSON Example (one JSON object per line):
with gzip.open('example.json.gz', 'rt', encoding='utf-8') as file_in:
df = pd.read_json(file_in, lines=True) # Read line-delimited JSON
print(df.head())

# JSON Example (single JSON array):
with gzip.open('example.json.gz', 'rb') as file_in:
data = json.load(file_in) # Read data from json file.
df = pd.DataFrame(data) # Create dataframe.
print(df.head())
  • CSV: pd.read_csv() can directly read from a file-like object, so we pass the gzip.open() result to it.
  • JSON (lines=True): For line-delimited JSON, use pd.read_json(..., lines=True).
  • JSON (Array) If the .gz file contains a single, large JSON array, use json.load() to load the data into a Python object, and then convert it into a DataFrame.