How to Calculate the MD5 Hash of a File in Python
Calculating the MD5 hash of a file is a common operation for verifying file integrity or generating unique identifiers.
This guide demonstrates how to compute MD5 hashes in Python using the hashlib
module, including handling large files efficiently, and using the new file_digest()
method (Python 3.11+). We also cover alternatives using mmap
and pathlib
.
Calculating MD5 Hash with hashlib
(Basic)
The standard way to calculate an MD5 hash is using the hashlib
module:
import hashlib
file_name = 'example.txt'
with open(file_name, 'rb') as file_obj: # Open in binary read mode
file_contents = file_obj.read()
md5_hash = hashlib.md5(file_contents).hexdigest()
print(md5_hash) # Output: (The MD5 hash as a hex string)
- Important: Always open files in binary read mode (
'rb'
) when calculating hashes. Text mode ('r'
) can lead to inconsistent results due to newline conversions on different operating systems. file_obj.read()
reads the entire file content into memory. This works well for small files, but is inefficient (and can causeMemoryError
) for very large files.hashlib.md5(file_contents)
creates an MD5 hash object and updates it with the file contents.hexdigest()
method returns the digest of the bytes passed to theupdate()
method, as a string object of double length, containing only hexadecimal digits.
If you want to manually close the file after using it, you have to avoid using the with
statement, however, this method is not recommended and can cause a memory leak if you forget to close the file.
import hashlib
file_name = 'example.txt'
file_obj = open(file_name, 'rb') # Open in binary read mode
file_contents = file_obj.read()
md5_hash = hashlib.md5(file_contents).hexdigest()
print(md5_hash)
file_obj.close() # Manually close the file
- Using the
with
statement is always recommended to avoid potential memory leaks.
Verifying the MD5 Hash
To verify the calculated hash against an expected value:
import hashlib
file_name = 'example.txt'
md5_hash_original = 'cfd2db7dd4ffe42ce26e0b57e7e8b342' # Expected hash
with open(file_name, 'rb') as file_obj:
file_contents = file_obj.read()
md5_hash_new = hashlib.md5(file_contents).hexdigest()
print(md5_hash_new)
if md5_hash_original == md5_hash_new:
print('MD5 hash verified')
else:
print('MD5 hash verification failed')
Handling Large Files Efficiently
For large files, reading the entire file into memory at once is inefficient. Process the file in chunks:
Using a while
loop and file.read()
import hashlib
def get_md5_hash(file_path, block_size=4096):
with open(file_path, 'rb') as file_obj:
md5_hash = hashlib.md5()
while chunk := file_obj.read(block_size): # Reads file in chunks
md5_hash.update(chunk) # Hashes each chunk
return md5_hash.hexdigest()
file_name = 'example.txt'
print(get_md5_hash(file_name))
- The while loop reads chunks of the file, of size
block_size
until the end of the file is reached. md5_hash.update(chunk)
updates the hash object with each chunk. This is much more memory-efficient than reading the whole file at once.
Using a for
loop and iter()
An alternative, equivalent way to read in chunks is using iter()
:
import hashlib
def get_md5_hash(file_path, block_size=4096):
with open(file_path, 'rb') as file_obj:
md5_hash = hashlib.md5()
for chunk in iter(lambda: file_obj.read(block_size), b''):
md5_hash.update(chunk)
return md5_hash.hexdigest()
file_name = 'example.txt'
print(get_md5_hash(file_name))
- The
for
loop iterates over the result ofiter(lambda: file_obj.read(block_size), b'')
which returns a generator that reads the file in chunks of sizeblock_size
. - The second argument in
iter()
specifies a value which will be checked with the result of the first argument, and if they are the same the loop terminates. In our case, the empty byte stringb''
will be returned byread()
when the end of file is reached, so the loop will terminate correctly.
Using hashlib.file_digest()
(Python 3.11+)
Python 3.11 introduced hashlib.file_digest()
, a highly efficient way to hash files:
import hashlib
import sys
# Check if the Python version is 3.11 or higher.
if sys.version_info >= (3, 11):
file_name = 'example.txt'
with open(file_name, 'rb') as file_obj:
digest = hashlib.file_digest(file_obj, 'md5') # Use 'md5' for MD5
print(digest.hexdigest())
else:
print("hashlib.file_digest() is only available in python 3.11+")
- This method is optimized for performance and handles large files efficiently. It's the recommended way to hash files in Python 3.11 and later.
Using mmap
(Advanced)
mmap
(memory mapping) can be used for even greater efficiency, especially with very large files, by directly mapping the file contents into virtual memory:
from mmap import mmap, ACCESS_READ
from hashlib import md5
file_name = 'example.txt'
with open(file_name, 'rb') as file, \
mmap(file.fileno(), 0, access=ACCESS_READ) as mapped_file: # Memory-map the file
print(md5(mapped_file).hexdigest()) # Calculate MD5
- This example creates a memory map of the file and calculates its MD5 hash.
- This technique is more advanced, and may not lead to performance gains in all cases.
Using pathlib.Path
You can also use pathlib
objects with hashlib
:
import pathlib
import hashlib
file_name = 'example.txt'
path_obj = pathlib.Path(file_name)
md5_hash = hashlib.md5(path_obj.read_bytes()).hexdigest() # read_bytes() instead of open()
print(md5_hash)
- The
read_bytes()
method returns the contents of the file as a bytes object. - This approach is clean and object-oriented.