Skip to main content

How to Calculate the MD5 Hash of a File in Python

Calculating the MD5 hash of a file is a common operation for verifying file integrity or generating unique identifiers.

This guide demonstrates how to compute MD5 hashes in Python using the hashlib module, including handling large files efficiently, and using the new file_digest() method (Python 3.11+). We also cover alternatives using mmap and pathlib.

Calculating MD5 Hash with hashlib (Basic)

The standard way to calculate an MD5 hash is using the hashlib module:

import hashlib

file_name = 'example.txt'

with open(file_name, 'rb') as file_obj: # Open in binary read mode
file_contents = file_obj.read()
md5_hash = hashlib.md5(file_contents).hexdigest()

print(md5_hash) # Output: (The MD5 hash as a hex string)
  • Important: Always open files in binary read mode ('rb') when calculating hashes. Text mode ('r') can lead to inconsistent results due to newline conversions on different operating systems.
  • file_obj.read() reads the entire file content into memory. This works well for small files, but is inefficient (and can cause MemoryError) for very large files.
  • hashlib.md5(file_contents) creates an MD5 hash object and updates it with the file contents.
  • hexdigest() method returns the digest of the bytes passed to the update() method, as a string object of double length, containing only hexadecimal digits.

If you want to manually close the file after using it, you have to avoid using the with statement, however, this method is not recommended and can cause a memory leak if you forget to close the file.

import hashlib

file_name = 'example.txt'

file_obj = open(file_name, 'rb') # Open in binary read mode
file_contents = file_obj.read()
md5_hash = hashlib.md5(file_contents).hexdigest()
print(md5_hash)
file_obj.close() # Manually close the file
  • Using the with statement is always recommended to avoid potential memory leaks.

Verifying the MD5 Hash

To verify the calculated hash against an expected value:

import hashlib

file_name = 'example.txt'
md5_hash_original = 'cfd2db7dd4ffe42ce26e0b57e7e8b342' # Expected hash

with open(file_name, 'rb') as file_obj:
file_contents = file_obj.read()
md5_hash_new = hashlib.md5(file_contents).hexdigest()
print(md5_hash_new)

if md5_hash_original == md5_hash_new:
print('MD5 hash verified')
else:
print('MD5 hash verification failed')

Handling Large Files Efficiently

For large files, reading the entire file into memory at once is inefficient. Process the file in chunks:

Using a while loop and file.read()

import hashlib

def get_md5_hash(file_path, block_size=4096):
with open(file_path, 'rb') as file_obj:
md5_hash = hashlib.md5()
while chunk := file_obj.read(block_size): # Reads file in chunks
md5_hash.update(chunk) # Hashes each chunk
return md5_hash.hexdigest()

file_name = 'example.txt'
print(get_md5_hash(file_name))
  • The while loop reads chunks of the file, of size block_size until the end of the file is reached.
  • md5_hash.update(chunk) updates the hash object with each chunk. This is much more memory-efficient than reading the whole file at once.

Using a for loop and iter()

An alternative, equivalent way to read in chunks is using iter():

import hashlib

def get_md5_hash(file_path, block_size=4096):
with open(file_path, 'rb') as file_obj:
md5_hash = hashlib.md5()
for chunk in iter(lambda: file_obj.read(block_size), b''):
md5_hash.update(chunk)
return md5_hash.hexdigest()

file_name = 'example.txt'
print(get_md5_hash(file_name))
  • The for loop iterates over the result of iter(lambda: file_obj.read(block_size), b'') which returns a generator that reads the file in chunks of size block_size.
  • The second argument in iter() specifies a value which will be checked with the result of the first argument, and if they are the same the loop terminates. In our case, the empty byte string b'' will be returned by read() when the end of file is reached, so the loop will terminate correctly.

Using hashlib.file_digest() (Python 3.11+)

Python 3.11 introduced hashlib.file_digest(), a highly efficient way to hash files:

import hashlib
import sys

# Check if the Python version is 3.11 or higher.
if sys.version_info >= (3, 11):
file_name = 'example.txt'
with open(file_name, 'rb') as file_obj:
digest = hashlib.file_digest(file_obj, 'md5') # Use 'md5' for MD5
print(digest.hexdigest())
else:
print("hashlib.file_digest() is only available in python 3.11+")
  • This method is optimized for performance and handles large files efficiently. It's the recommended way to hash files in Python 3.11 and later.

Using mmap (Advanced)

mmap (memory mapping) can be used for even greater efficiency, especially with very large files, by directly mapping the file contents into virtual memory:

from mmap import mmap, ACCESS_READ
from hashlib import md5

file_name = 'example.txt'

with open(file_name, 'rb') as file, \
mmap(file.fileno(), 0, access=ACCESS_READ) as mapped_file: # Memory-map the file
print(md5(mapped_file).hexdigest()) # Calculate MD5
  • This example creates a memory map of the file and calculates its MD5 hash.
  • This technique is more advanced, and may not lead to performance gains in all cases.

Using pathlib.Path

You can also use pathlib objects with hashlib:

import pathlib
import hashlib

file_name = 'example.txt'
path_obj = pathlib.Path(file_name)
md5_hash = hashlib.md5(path_obj.read_bytes()).hexdigest() # read_bytes() instead of open()
print(md5_hash)
  • The read_bytes() method returns the contents of the file as a bytes object.
  • This approach is clean and object-oriented.