Today I encountered a problem I hadn’t thought much about before. How can I tell if the contents of two files are the same? If we’re directly comparing two files, that should be pretty simple. Given 3 files, how can we tell? Simply, we’d read the contents of the file and figure out if those objects have equivalency.
Let’s make 3 files.
test1.txt
and test2.txt
will contain the string “THIS IS SOME TEXT.”
test3.txt
will contain “THIS IS SOME OTHER TEXT.”
1 2 3 4 5 6 |
|
What do we expect this program to output? We expect line 5 to evaluate the true
and line 6 to evaluate to false
, which it does.
This is a good solution but it is not scalable. What if instead of only comparing three files, we wanted to compare a file many times larger across hundreds of thousands of files? That would be a nightmare. So we need to find a more efficient way to do this.
Why would we need to do this you ask? Well, if we’re maintaining a file server or database we’d need a quick way to eliminate duplicate files to keep the server lean and prevent confusion later down the line. Another common application for needing to check file equivalency is for checking your data’s integrity during transmission or storage.
How is this done? By creating something called a checksum. Wireshark provides this summary:
A checksum is basically a calculated summary of such a data portion. Network data transmissions often produce errors, such as toggled, missing or duplicated bits. As a result, the data received might not be identical to the data transmitted, which is obviously a bad thing. Because of these transmission errors, network protocols very often use checksums to detect such errors. The transmitter will calculate a checksum of the data and transmits the data together with the checksum. The receiver will calculate the checksum of the received data with the same algorithm as the transmitter. If the received and calculated checksums don’t match a transmission error has occurred.
In other words, data transmitted over a network is being spell checked as it is copied.
Checksums and Ruby Data Structures
What’s another reason for this? Storing a bunch of files in memory gets expensive very quickly. If files all have different names, then the only way to search for duplicate values is by reading the contents of a file and then comparing it to all the values stored in memory, like we did in the first example. What if instead we just stored a checksum, a smaller digital fingerprint of the file’s contents? Then we have any number of ways to store, search, or compare our data.
Ruby doesn’t natively support hashing algorithms, but fortunately the Digest
module and the MD5
hashing algorithim are built into the standard library so all we have to do is require them.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Running this program results in this: => {"81e3a7e854d334e82f75a2bcdbe6a3da"=>[#<File:test1.txt>, #<File:test2.txt>], "32b2eccab2dcc035c50820d0943e5b94"=>[#<File:test3.txt>]}
So even though these were three different files, our checksum algorithm was able to determine that the first two files have equivalent values. What’s the application for this?
Searching through a hash for a key is fast - much faster than iterating through an array. So if we wanted to find duplicate files, instead of using the file name (an intuitive choice) for the key, we could store this checksum value as the key. In a sense, the checksum is both the key AND the value. With its place reserved in memory, all we’d have to do is check to see if the new file’s checksum exists as a key in our hash.
Consider this program. We initialize a Checker
class with two files, test1.txt
and test3.txt
. Then we run our unique?
function on test2.txt
. Remember, files 1 and 2 have the same contents. We now have very small fingerprints of 1 and 3 stored in memory, and instead of reading their entire contents to check them against our new file, we simply create a fingerprint for the new file and compare it to our current set of fingerprints.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
Since the checksum value already exists in the hash, the check.unique?(two)
returns false.