Sanity checking files before working with them
Sunday, October 9th, 2011Dealing with lots of files, especially large files can be very time consuming – not only for the processing but also the transfer of the data itself. It may take a few times to really fully grasp how expensive it is, from a personnel standpoint, in processing data multiple times. This issue of data integrity is further complicated by file transfers that may end but throw no error. sftp, ftp, and so on all suffer from this issue.
Have you ever noticed, when downloading software from a web site, where you are also given an MD5 of the file you’re downloading? I would normally ignore the MD5 because I figured that if it downloaded, it likely downloaded correctly and I can use it. Most of the time this worked well until I started processing very large data sets that ranged in the 50-150Gb range. When dealing with those files, and considering they can take over a day to do just one bit of processing, the cost of these files being corrupted is pretty big. The solution itself is fairly easy, and that’s the use of the MD5sum of the file. To generate the MD5sum of a number of files, you can use the following command:
md5sum * > my_md5_sum_file.txt
The one issue one can run into is nested directories, in which case the md5sum won’t likely work correctly. With that, you can do:
find . -type f -name ‘
The benefit of the above command is we keep the directory structure in mind when we’re doing the md5sum. The goal is to do this on both the source directory, and the directory you’re copying the data to.
At this point, we want to be able to “diff” the two txt files. Of course, that’s really not possible here because of differing sorts. We can fix that by doing something like:
find . -type f -name ‘
From this point, we can diff the two text files, as long as the directory structure is consistent between the two files.





