National Institute of Standards and Technology Homepage National Software Reference Library Homepage

Update to "Using File Hashes to Reduce Forensic Analysis"

Douglas White, July 2002

Dan Mares wrote an insightful article, "Using File Hashes to Reduce Forensic Analysis" for SC Magazine that was published in May, 2002. The article compared the sets of file hashes from the NDIC Hashkeeper project and the NIST National Software Reference Library (NSRL) project. Mares succinctly described a scenario in which an analyst could save time by not processing 'known' files, and outlined a hashkeeper paradigm. The concept of known file versions - the files on the distribution media, or the files resulting from the install process - was described and had bearing on metrics in a comparison table.

The original article had comparison tables, generated with a Hashkeeper data set on the order of 700,000 hashes and an NSRL data set on the order of 1,000,000 hashes.

Version 1.1 of the NSRL data set had a nearly balanced combination of late Microsoft, Solaris, Linux, Oracle, Adobe, popular games and clip art collections. The NSRL at that time did not have Windows 98 nor NT, which is clearly apparent in Mares' second table.

Mares found "NDIC usually included file hashes after the programs had been 'installed,' while NIST apparently included the hashes of the 'uninstalled' files as found in the .CAB files."

This was true for the NSRL at that time. The focus was on automatically harvesting the file hashes from the media, including the process of hashing the contents of .CAB files, .ZIP , .TAR , etc. recursively. We expected that many of the files would not change, therefore the hashes would not change. After NSRL 1.1 was released, we were able to hash a wider collection of earlier Microsoft operating systems and other popular applications, and we specifically performed installations to gather metrics. Here are tables similar to Mares' using the NSRL 1.2 data set and Hashkeeper data set 001-243.

Using Hashkeeper 001-243 and NSRL 1.2 (June 2002):

Source Unique MD5s listed
in data file
MD5s in Hashkeeper
NOT in NSRL
MD5s in NSRL
NOT in Hashkeeper
Common
to Both
NSRL 4,022,258 3,777,082 245,176
Hashkeeper 766,854 411,962 245,176

OS/Apps Files installed
on HD
HD Files not
in Hashkeeper
HD Files not
in NSRL
Files on
distribution CD(s)
Virgin Win 98 4,266 142 (3%) 297 (7%) 18,662
Virgin NT4 WS 1,659 1,211 (72%) 239 (14%) 17,904
Virgin Win 2Kpro 5,963 783 (13%) 839 (14%) 16,539
Virgin Win ME 5,169 2,973 (57%) 383 (7%) 11,512
Win 98+Office 2K 23,464 313 (1%) 596 (2%) 43,327
Win ME+Office 2K 24,112 3,119 (13%) 526 (2%) 32,758
NIST PC #1 W2K 18,048 13,137 (72%) 11,839 (65%) N/A
NIST PC #2 W2K 59,135 46,277 (79%) 47,124 (80%) N/A
NIST PC #3 WNT 14,186 7,543 (53%) 6,618 (46%) N/A
NIST PC #4 W98 16,397 8,360 (51%) 7,404 (45%) N/A
NIST PC #5 W98 34,220 8,366 (25%) 8,667 (25%) N/A
Lower percentage is better

In the second table, we compare the Hashkeeper and NSRL data sets to four virgin operating system installations, to two virgin OSes plus MS Office, and to five NIST PCs in daily use.

A Windows 98 distribution was taken off the NSRL shelf and installed. This distribution had gone through our hashing process. We hashed the installation PC's hard drive, and compared the installation hashes to the Hashkeeper and NSRL sets. Both Hashkeeper and NSRL identified over 90% of the installed files.

Similar processes were used with a Windows NT4 workstation distribution, a Windows ME distribution and a Windows 2000 distribution with the results shown above.

On two of the virgin systems, we then installed MS Office with a distribution from our shelves.

Further investigation of the 'unknown' files remaining in the virgin installations may allow us to identify the reason they are unknown - distribution files that were changed, or new files that were created, etc.

While it is useful to compare a 'reference' hashed medium with an installed counterpart, we also wanted to get a feel for a more real world application of the hash sets. We identified five PCs on the floor of our building that were in daily use. These ranged from being used mainly as an email console with few applications to a code developer's workstation to a manager's computer which contained a proportionately large amount of data versus applications including many non-commercial, NIST-specific applications not hashed by the NSRL.

In the five NIST PCs, the NSRL 'knew' 20% to 75% of the files, and Hashkeeper 'knew' 21% to 75% of the files. The results here are slightly better with the NSRL though with these data sets, an investigator might not notice the numerical difference. The range itself - a span of 55% - is to be expected, as the data sets are relatively small compared to the software universe and are focused on popular software.

One could draw the conclusion that both of the Hashkeeper and NSRL data sets perform similarly at this time. Given that performance, one would then need to weigh other tangible aspects of the data sets: Hashkeeper may reflect installation hashes better than the NSRL and may react to user feedback quicker, the NSRL is based on traceable distribution media and has the capability to add millions of hashes per quarterly release.


National Institute of Standards and Technology
ATTN: NSRL Project
100 Bureau Drive, Stop 8970
Gaithersburg, MD 20899-8970 USA
E-mail: nsrl@nist.gov Phone: 301-975-3262 FAX: 301-948-6213

"This information is made available through NIST facilities. However, the views expressed and the decisions reported do not necessarily connote NIST agreement with, or endorsement of them. Further, NIST does not endorse any commercial products that may be mentioned. Please address comments about this page to nsrl@nist.gov."