August 15, 2006 Douglas White dwhite@nist.gov www.nsrl.nist.gov This is a snapshot of our experimental code. Please read the notice in the Perl scripts. We created a sorted list of unique MD5 values from the RDS 2.13 release. We mapped those 32-char ASCII strings into 16-byte integers and concatenated them to form a binary file "rds213-md5.bin" . This is the most compact form we have found that preserves the order and allows perfect matching. The binary file can be used to create an ASCII list if needed. The perl script "binsrch.pl" can be used to query the binary MD5 collection, to determine if an MD5 is known in the NSRL. "perl binsrch.pl -h" provides help for the script. At this time, the Bloom filters with which we are experimenting are stored in roughly 512MB files. The files have one 512 byte "header" section, a 2^32 bit (512MiB) Bloom bit map section, and may contain data after the bit map. A blank Bloom file is provided in a zip file; you can unzip "blank.bf.zip" to begin work with a blank filter. You may also run the script "init-bf-file.pl" to create a blank filter. To view the header information, you can run "dd if=blank.bf bs=512 count=1" and you can read the content. If you modify the filter bit map, you should update the header information by using the "mod-bf.pl" script. (again, -h provides help) The script "add-md5-32-16.pl" is used to add the information about MD5 hash values to the Bloom filter bit map. You may add one or more MD5s via the command line, or specify a bulk load file. Ten bulk load files are provided, "loadMD5.aa" through "loadMD5.aj". Each file has 10,000 MD5 values from RDS 2.13, for a total of 100,000 values - just under 1% of the RDS 2.13 contents. You may run the "foo.sh" script to bulk load all 100,000 values. Each set of 10,000 takes under a minute on a decent computer. The resulting Bloom filter file after this loading has been done is provided in a zip file. You can extract "rds213-md5.bf.zip" to obtain that file. This file has had the header information updated to reflect the added bit map content. To determine if an MD5 value is known in the Bloom filter, you can use the "query-md5-32-16.pl" script. Again, it has command like or bulk file capability. Two examples you may try are "perl query-md5-32-16.pl -f rds213-md5.bf -i testknown" and "perl query-md5-32-16.pl -f rds213-md5.bf -i testunknown -U" The "testknown" and "testunknown" files were created from the results of "head loadMD5.ah" and some manipulation. The add and query code for SHA-1 values are included here. The SHA-1 implementation varies in the number of Bloom vectors used - 16 vectors for MD5, 20 vectors for SHA-1. The effects of changing the Bloom key size, vector count and number of inputs can be seen by using the "bf-math.pl" script. =================================== Binary tree search vs. Bloom filter search In brief, there are certain benefits and pitfalls to both of these methods; some include: Binary tree + exact matches, no false pos. nor neg. + easy to delete information + no overhead calculation - rebuild entire file to insert information - distribution file grows over time - maximum number of seeks grows with content growth for 10^n items, max seek = n * log(10) / log(2) e.g. RDS 2.13 10^7 items = 23 seek maximum - unknowns always need max seeks Bloom filter + easy to update information (bitwise OR) + maximum number of seeks is constant (number of vectors) + unknowns average max/2 seeks + distribution file never grows - some overhead calculation for Bloom vectors - may have false positives (no false negatives) - very hard to delete information Our math shows that a Bloom filter with 35-bit keys, using 20 vectors can store 1,000,000,000 SHA-1 values with a 1-in-100,000,000 false positive rate, and be stored in 4GiB. For comparison: binary tree 20GB file 15 seek average for knowns 30 seeks necessary for all unknowns bloom filter 4GiB file 20 seeks necessary for all knowns 10 seek average for unknowns If less than 80% of the tests are for unknowns in this case, then the Bloom filter is more efficient with respect to seeks. =================================== 7570 Aug 15 05:00 add-md5-32-16.pl 8690 Aug 15 05:01 add-sha-32-20.pl 3164 Aug 14 15:09 bf-math.pl 7062 Aug 15 04:42 binsrch.pl 521377 Aug 14 19:16 blank.bf.zip 650 Aug 15 03:50 foo.sh 2666 Aug 14 13:56 init-bf-file.pl 330000 Aug 15 03:47 loadMD5.aa 330000 Aug 15 03:47 loadMD5.ab 330000 Aug 15 03:47 loadMD5.ac 330000 Aug 15 03:47 loadMD5.ad 330000 Aug 15 03:47 loadMD5.ae 330000 Aug 15 03:47 loadMD5.af 330000 Aug 15 03:47 loadMD5.ag 330000 Aug 15 03:47 loadMD5.ah 330000 Aug 15 03:47 loadMD5.ai 330000 Aug 15 03:47 loadMD5.aj 5187 Aug 14 13:59 mod-bf.pl 5689 Aug 15 04:49 query-md5-32-16.pl 6698 Aug 15 04:58 query-sha-32-20.pl 3852279 Aug 15 05:22 rds213-md5.bf.zip 179152528 Aug 15 04:11 rds213-md5.bin 330 Aug 15 04:18 testknown 330 Aug 15 04:27 testunknown MD5(add-md5-32-16.pl)= eaeed71b7f54ee1fee1abeb1d3e0f9db MD5(add-sha-32-20.pl)= 82c1e0c826017e056f811e333d247272 MD5(bf-math.pl)= 2e7b018e47f670943ab54702b8b9ef62 MD5(binsrch.pl)= 69b6e77fa2c4976268cfb9f6a6d4b35d MD5(blank.bf.zip)= 1e745d9f5d9a31afe468482f5f06f37c MD5(foo.sh)= 70906aaf7f74bf29e14421bfb2aa0fda MD5(init-bf-file.pl)= aeb7f1406d51ea99df59d27d35cb3287 MD5(loadMD5.aa)= 3b7f96ca1d3a44e4000327eec53af0fa MD5(loadMD5.ab)= a339694fc77baec435b513c994749963 MD5(loadMD5.ac)= 2dc611e380b2d61a6fa383cf3d67e8bc MD5(loadMD5.ad)= ec4dbb8f0cc20957bd0b3465dee3356b MD5(loadMD5.ae)= 105ef19671591c0ccfcd0975937697a0 MD5(loadMD5.af)= c75145e9e7b65823728784aeeeb23e5a MD5(loadMD5.ag)= 19bcce862818d6fa8ffd76b842a9ad13 MD5(loadMD5.ah)= 1e15a1b08e3ba6d11f789d7243d12a1a MD5(loadMD5.ai)= ef7fac6aa7cc50fb13449eebc4b3f058 MD5(loadMD5.aj)= 5b13cd07a3de8b77dbe519aa51aad13f MD5(mod-bf.pl)= 773d27b02f50b693c57af8dc7d2506f2 MD5(query-md5-32-16.pl)= 6c32f1bb41eaac8e8061992bec81f96f MD5(query-sha-32-20.pl)= ecbe16cb32038d1cb53a8959041ab181 MD5(rds213-md5.bf.zip)= d5b4a0f9e9a1c201a7fb2dfbe67249a0 MD5(rds213-md5.bin)= 5a9e1ac4cb16c37e17bf20dba42609e9 MD5(testknown)= c4633ba2f8c8efb9cca6c4e5acf7fe3d MD5(testunknown)= f447e2fc79fcf7d29080204d006d32a0 SHA1(add-md5-32-16.pl)= e2b3a12a1d2cafb9f7bf55029c2ccaba91af2f83 SHA1(add-sha-32-20.pl)= d2c7ec3ec72adfbf0d27a49392e90230ea57639e SHA1(bf-math.pl)= 6bf28f9a61d4b0b6a2fa7e14f099a066adf2d749 SHA1(binsrch.pl)= 2d8cb2db644197dbd25e57255ff373705aa214cf SHA1(blank.bf.zip)= ae9465d17df0913e43ae40e0c4d58ee514880cff SHA1(foo.sh)= 7199bc93035741d745ae5539de0aa1e8a2571f03 SHA1(init-bf-file.pl)= b85718033d7434655f62e4724bb49203693db845 SHA1(loadMD5.aa)= c014a6ff2029c68972259879579144cdee2fcf73 SHA1(loadMD5.ab)= c2498493e51fe4dcae050973eb64fe972f41f122 SHA1(loadMD5.ac)= 8559d419b33bed3a3cc95e26570c87eb59e1b2f4 SHA1(loadMD5.ad)= 88fcc7a4dc39060276b813faa2989c5b8d3308a8 SHA1(loadMD5.ae)= 0c54d15a2b882552d00b85455f69bd43adb60ac0 SHA1(loadMD5.af)= 378deeed404e80f20f7df06d14539f51fcd90c38 SHA1(loadMD5.ag)= 8b10d43add02b228445ac78edffcbfcd04f1683e SHA1(loadMD5.ah)= eebf107b10f33400d58988ff0517f7e9f867703a SHA1(loadMD5.ai)= 5ec023818dc254d0ed5937ad8a31ccf55d387359 SHA1(loadMD5.aj)= c3b4129d6e1ecca94dbd16cf727c462fcffa0f02 SHA1(mod-bf.pl)= eaebd819bb819c2330bdc8e0b6aadfa6fee98bef SHA1(query-md5-32-16.pl)= caf85badac016554cf52b94cdfde9f332b419a91 SHA1(query-sha-32-20.pl)= 3c8b4768b12aad4cb6ed6ebfce4c1a57905de1fa SHA1(rds213-md5.bf.zip)= b5aae2bb20db1e8aefdf1ef1932dd04deee83338 SHA1(rds213-md5.bin)= d4375aadbea26e8598081a3428b5bf299c3a6b98 SHA1(testknown)= fb3f1014d6fec8d99cc26c949fb969eef00c4725 SHA1(testunknown)= 07b7a61ab707cfa88160cbc8855a31242269285c