English 中文(简体)
compute crc of file in python
原标题:

I want to calculate the CRC of file and get output like: E45A12AC. Here s my code:

#!/usr/bin/env python 
import os, sys
import zlib

def crc(fileName):
    fd = open(fileName,"rb")
    content = fd.readlines()
    fd.close()
    for eachLine in content:
        zlib.crc32(eachLine)

for eachFile in sys.argv[1:]:
    crc(eachFile)

This calculates the CRC for each line, but its output (e.g. -1767935985) is not what I want.

Hashlib works the way I want, but it computes the md5:

import hashlib
m = hashlib.md5()
for line in open( data.txt ,  rb ):
    m.update(line)
print m.hexdigest()

Is it possible to get something similar using zlib.crc32?

问题回答

A little more compact and optimized code

def crc(fileName):
    prev = 0
    for eachLine in open(fileName,"rb"):
        prev = zlib.crc32(eachLine, prev)
    return "%X"%(prev & 0xFFFFFFFF)

PS2: Old PS is deprecated - therefore deleted -, because of the suggestion in the comment. Thank you. I don t get, how I missed this, but it was really good.

A modified version of kobor42 s answer, with performance improved by a factor 2-3 by reading fixed size chunks instead of "lines":

import zlib

def crc32(fileName):
    with open(fileName,  rb ) as fh:
        hash = 0
        while True:
            s = fh.read(65536)
            if not s:
                break
            hash = zlib.crc32(s, hash)
        return "%08X" % (hash & 0xFFFFFFFF)

Also includes leading zeroes in the returned string.

hashlib-compatible interface for CRC-32 support:

import zlib

class crc32(object):
    name =  crc32 
    digest_size = 4
    block_size = 1

    def __init__(self, arg=  ):
        self.__digest = 0
        self.update(arg)

    def copy(self):
        copy = super(self.__class__, self).__new__(self.__class__)
        copy.__digest = self.__digest
        return copy

    def digest(self):
        return self.__digest

    def hexdigest(self):
        return  {:08x} .format(self.__digest)

    def update(self, arg):
        self.__digest = zlib.crc32(arg, self.__digest) & 0xffffffff

# Now you can define hashlib.crc32 = crc32
import hashlib
hashlib.crc32 = crc32

# Python > 2.7: hashlib.algorithms += ( crc32 ,)
# Python > 3.2: hashlib.algorithms_available.add( crc32 )

To show any integer s lowest 32 bits as 8 hexadecimal digits, without sign, you can "mask" the value by bit-and ing it with a mask made of 32 bits all at value 1, then apply formatting. I.e.:

>>> x = -1767935985
>>> format(x & 0xFFFFFFFF,  08x )
 969f700f 

It s quite irrelevant whether the integer you are thus formatting comes from zlib.crc32 or any other computation whatsoever.

Python 3.8+ (using the walrus operator):

import zlib

def crc32(filename, chunksize=65536):
    """Compute the CRC-32 checksum of the contents of the given filename"""
    with open(filename, "rb") as f:
        checksum = 0
        while (chunk := f.read(chunksize)) :
            checksum = zlib.crc32(chunk, checksum)
        return checksum

chunksize is how many bytes to read from the file at a time. You will get the same CRC for the same file no matter what you set chunksize to (it has to be > 0), but setting it too low might make your code slow, too high might use too much memory.

The result is a 32 bit integer. The CRC-32 checksum of an empty file is 0.

Edited to include Altren s solution below.

A modified and more compact version of CrouZ s answer, with a slightly improved performance, using a for loop and file buffering:

def forLoopCrc(fpath):
    """With for loop and buffer."""
    crc = 0
    with open(fpath,  rb , 65536) as ins:
        for x in range(int((os.stat(fpath).st_size / 65536)) + 1):
            crc = zlib.crc32(ins.read(65536), crc)
    return  %08X  % (crc & 0xFFFFFFFF)

Results, in a 6700k, HDD:

(Note: Retested multiple times and it was consistently faster.)

Warming up the machine...
Finished.

Beginning tests...
File size: 90288KB
Test cycles: 500

With for loop and buffer.
Result 45.24728019630359 

CrouZ solution
Result 45.433838356097894 

kobor42 solution
Result 104.16215688703986 

Altren solution
Result 101.7247863946586  

Tested in Python 3.6.4 x64 using the script below:

import os, timeit, zlib, random, binascii

def forLoopCrc(fpath):
    """With for loop and buffer."""
    crc = 0
    with open(fpath,  rb , 65536) as ins:
        for x in range(int((os.stat(fpath).st_size / 65536)) + 1):
            crc = zlib.crc32(ins.read(65536), crc)
    return  %08X  % (crc & 0xFFFFFFFF)

def crc32(fileName):
    """CrouZ solution"""
    with open(fileName,  rb ) as fh:
        hash = 0
        while True:
            s = fh.read(65536)
            if not s:
                break
            hash = zlib.crc32(s, hash)
        return "%08X" % (hash & 0xFFFFFFFF)

def crc(fileName):
    """kobor42 solution"""
    prev = 0
    for eachLine in open(fileName,"rb"):
        prev = zlib.crc32(eachLine, prev)
    return "%X"%(prev & 0xFFFFFFFF)

def crc32altren(filename):
    """Altren solution"""
    buf = open(filename, rb ).read()
    hash = binascii.crc32(buf) & 0xFFFFFFFF
    return "%08X" % hash

fpath = r D:	est	est.dat 
tests = {forLoopCrc:  With for loop and buffer. , 
     crc32:  CrouZ solution , crc:  kobor42 solution ,
         crc32altren:  Altren solution }
count = 500

# CPU, HDD warmup
randomItm = [x for x in tests.keys()]
random.shuffle(randomItm)
print( 
Warming up the machine... )
for c in range(count):
    randomItm[0](fpath)
print( Finished.
 )

# Begin test
print( Beginning tests...
File size: %dKB
Test cycles: %d
  % (
    os.stat(fpath).st_size/1024, count))
for x in tests:
    print(tests[x])
    start_time = timeit.default_timer()
    for c in range(count):
        x(fpath)
    print( Result , timeit.default_timer() - start_time,  
 )

It is faster because for loops are faster than while loops (sources: here and here).

Merge the above 2 codes as below:

try:
    fd = open(decompressedFile,"rb")
except IOError:
    logging.error("Unable to open the file in readmode:" + decompressedFile)
    return 4
eachLine = fd.readline()
prev = 0
while eachLine:
    prev = zlib.crc32(eachLine, prev)
    eachLine = fd.readline()
fd.close()

There is faster and more compact way to compute CRC using binascii:

import binascii

def crc32(filename):
    buf = open(filename, rb ).read()
    hash = binascii.crc32(buf) & 0xFFFFFFFF
    return "%08X" % hash

You can use base64 for getting out like [ERD45FTR]. And zlib.crc32 provides update options.

import os, sys
import zlib
import base64

def crc(fileName): fd = open(fileName,"rb") content = fd.readlines() fd.close() prev = None for eachLine in content: if not prev: prev = zlib.crc32(eachLine) else: prev = zlib.crc32(eachLine, prev) return prev

for eachFile in sys.argv[1:]: print base64.b64encode(str(crc(eachFile)))

solution:

import os, sys
import zlib

def crc(fileName, excludeLine="", includeLine=""):
  try:
        fd = open(fileName,"rb")
  except IOError:
        print "Unable to open the file in readmode:", filename
        return
  eachLine = fd.readline()
  prev = None
  while eachLine:
      if excludeLine and eachLine.startswith(excludeLine):
            continue   
      if not prev:
        prev = zlib.crc32(eachLine)
      else:
        prev = zlib.crc32(eachLine, prev)
      eachLine = fd.readline()
  fd.close()    
  return format(prev & 0xFFFFFFFF,  08x ) #returns 8 digits crc

for eachFile in sys.argv[1:]:
    print crc(eachFile)

don t realy know for what is (excludeLine="", includeLine="")...





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签