|
|
Dan Mares wrote an insightful article, "Using File Hashes to Reduce Forensic Analysis" for SC Magazine that was published in May, 2002. The article compared the sets of file hashes from the NDIC Hashkeeper project and the NIST National Software Reference Library (NSRL) project. Mares succinctly described a scenario in which an analyst could save time by not processing 'known' files, and outlined a hashkeeper paradigm. The concept of known file versions - the files on the distribution media, or the files resulting from the install process - was described and had bearing on metrics in a comparison table.
The original article had comparison tables, generated with a Hashkeeper data set on the order of 700,000 hashes and an NSRL data set on the order of 1,000,000 hashes.
Version 1.1 of the NSRL data set had a nearly balanced combination of late Microsoft, Solaris, Linux, Oracle, Adobe, popular games and clip art collections. The NSRL at that time did not have Windows 98 nor NT, which is clearly apparent in Mares' second table.
Mares found "NDIC usually included file hashes after the programs had been 'installed,' while NIST apparently included the hashes of the 'uninstalled' files as found in the .CAB files."
This was true for the NSRL at that time. The focus was on automatically harvesting the file hashes from the media, including the process of hashing the contents of .CAB files, .ZIP , .TAR , etc. recursively. We expected that many of the files would not change, therefore the hashes would not change. After NSRL 1.1 was released, we were able to hash a wider collection of earlier Microsoft operating systems and other popular applications, and we specifically performed installations to gather metrics. Here are tables similar to Mares' using the NSRL 1.2 data set and Hashkeeper data set 001-243.
Using Hashkeeper 001-243 and NSRL 1.2 (June 2002):
| Source | Unique MD5s listed in data file |
MD5s in Hashkeeper NOT in NSRL |
MD5s in NSRL NOT in Hashkeeper |
Common to Both |
|---|---|---|---|---|
| NSRL | 4,022,258 | 3,777,082 | 245,176 | |
| Hashkeeper | 766,854 | 411,962 | 245,176 |
| OS/Apps | Files installed on HD |
HD Files not in Hashkeeper |
HD Files not in NSRL |
Files on distribution CD(s) |
|---|---|---|---|---|
| Virgin Win 98 | 4,266 | 142 (3%) | 297 (7%) | 18,662 |
| Virgin NT4 WS | 1,659 | 1,211 (72%) | 239 (14%) | 17,904 |
| Virgin Win 2Kpro | 5,963 | 783 (13%) | 839 (14%) | 16,539 |
| Virgin Win ME | 5,169 | 2,973 (57%) | 383 (7%) | 11,512 |
| Win 98+Office 2K | 23,464 | 313 (1%) | 596 (2%) | 43,327 |
| Win ME+Office 2K | 24,112 | 3,119 (13%) | 526 (2%) | 32,758 |
| NIST PC #1 W2K | 18,048 | 13,137 (72%) | 11,839 (65%) | N/A |
| NIST PC #2 W2K | 59,135 | 46,277 (79%) | 47,124 (80%) | N/A |
| NIST PC #3 WNT | 14,186 | 7,543 (53%) | 6,618 (46%) | N/A |
| NIST PC #4 W98 | 16,397 | 8,360 (51%) | 7,404 (45%) | N/A |
| NIST PC #5 W98 | 34,220 | 8,366 (25%) | 8,667 (25%) | N/A |
| Lower percentage is better | ||||
In the second table, we compare the Hashkeeper and NSRL data sets to four virgin operating system installations, to two virgin OSes plus MS Office, and to five NIST PCs in daily use.
A Windows 98 distribution was taken off the NSRL shelf and installed. This distribution had gone through our hashing process. We hashed the installation PC's hard drive, and compared the installation hashes to the Hashkeeper and NSRL sets. Both Hashkeeper and NSRL identified over 90% of the installed files.
Similar processes were used with a Windows NT4 workstation distribution, a Windows ME distribution and a Windows 2000 distribution with the results shown above.
On two of the virgin systems, we then installed MS Office with a distribution from our shelves.
Further investigation of the 'unknown' files remaining in the virgin installations may allow us to identify the reason they are unknown - distribution files that were changed, or new files that were created, etc.
While it is useful to compare a 'reference' hashed medium with an installed counterpart, we also wanted to get a feel for a more real world application of the hash sets. We identified five PCs on the floor of our building that were in daily use. These ranged from being used mainly as an email console with few applications to a code developer's workstation to a manager's computer which contained a proportionately large amount of data versus applications including many non-commercial, NIST-specific applications not hashed by the NSRL.
In the five NIST PCs, the NSRL 'knew' 20% to 75% of the files, and Hashkeeper 'knew' 21% to 75% of the files. The results here are slightly better with the NSRL though with these data sets, an investigator might not notice the numerical difference. The range itself - a span of 55% - is to be expected, as the data sets are relatively small compared to the software universe and are focused on popular software.
One could draw the conclusion that both of the Hashkeeper and NSRL data sets perform similarly at this time. Given that performance, one would then need to weigh other tangible aspects of the data sets: Hashkeeper may reflect installation hashes better than the NSRL and may react to user feedback quicker, the NSRL is based on traceable distribution media and has the capability to add millions of hashes per quarterly release.
"This information is made available through NIST facilities. However, the views expressed and the decisions reported do not necessarily connote NIST agreement with, or endorsement of them. Further, NIST does not endorse any commercial products that may be mentioned. Please address comments about this page to nsrl@nist.gov."