Digital forensics XML and the DFXML toolset

Digital Investigation 8 (2012) 161?174

Contents lists available at SciVerse ScienceDirect

Digital Investigation

journal homepage: locate/diin

Digital forensics XML and the DFXML toolset

Simson Garfinkel*

Naval Postgraduate School, 900 N. Glebe, Arlington, VA 22203, USA

article info

Article history: Received 6 May 2011 Received in revised form 18 November 2011 Accepted 25 November 2011

Keywords: Digital forensics xml DFXML Forensic tools Forensic tool validation Forensic automation

abstract

Digital Forensics XML (DFXML) is an XML language that enables the exchange of structured forensic information. DFXML can represent the provenance of data subject to forensic investigation, document the presence and location of file systems, files, Microsoft Windows Registry entries, JPEG EXIFs, and other technical information of interest to the forensic analyst. DFXML can also document the specific tools and processing techniques that were used to produce the results, making it possible to automatically reprocess forensic information as tools are improved. This article presents the motivation, design, and use of DFXML. It also discusses tools that have been creased that both ingest and emit DFXML files.

Published by Elsevier Ltd.

1. Introduction

Digital Forensics XML (DFXML) is an XML language designed to represent a wide range of forensic information and forensic processing results. By matching its abstractions to the needs of forensics tools and analysts, DFXML allows the sharing of structured information between independent tools and organizations. Since the initial work in 2007, DFXML has been used to archive the results of forensic processing steps, reducing the need for re-processing digital evidence, and as an interchange format, allowing labeled forensic information to be shared between research collaborators. DFXML is also the basis of a Python module (dfxml.py) that makes it easy to create sophisticated forensic processing programs (or "scripts") with little effort.

Forensic tools can be readily modified to emit and consume DFXML as an alternative data representation format. For example, the PhotoRec carver (Grenier, 2011) and the md5deep hashing application (Kornblum, 2011) were both modified to produce DFXML files. The DFXML output contains the files identified, their physical location within the disk image (in the case of PhotoRec), and their

* Corresponding author. Tel.: ?1 617 876 6111. E-mail address: slgarfin@nps.edu.

1742-2876/$ ? see front matter Published by Elsevier Ltd. doi:10.1016/j.diin.2011.11.002

cryptographic hashes. Because these programs now both emit compatible DFXML, their output can be processed by a common set of tools.

DFXML can also document provenance, including the computer on which the application program was compiled, the linked libraries, and the runtime environment. Such provenance can be useful both in research and in preparing courtroom testimony.

DFXML's minimal use of XML features means that the forensic abstractions, APIs and representations described in this paper can be readily migrated to other object-based serializations, including JSON (Zyp and Court, 2010), Protocol Buffers (Google, 2011) and the SQL schema implemented in SleuthKit 3.2 (Carrier, 2010). Indeed, it is possible to readily convert between all four formats.

1.1. The need for DFXML

Today's digital forensic tools lack composability. Instead of being designed with the Unix approach of tools that can be connected together to solve big problems, most commonly used forensic tools are monolithic systems designed to ingest a small number of data types (typically disk images and hash sets) and produce a limited set of output types (typically individual files and final reports).

162

S. Garfinkel / Digital Investigation 8 (2012) 161?174

This is true both of tools with limited functionality (e.g., simple file carvers), as well as complex GUI-based tools that include integrated scripting languages. The lack of composability complicates automation and tool validation efforts, and in the process has subtly limited the progress of digital forensics research.

Although there are existing file formats and a few XML languages used in digital forensics today, they are confined to specific applications and limited domains. The lack of standardized abstractions makes it difficult to compare results produced by different tools and algorithms. This lack of standardization similarly impacts tool developers, who must frequently implement functions in their tools that exist in others.

1.3. Contributions

This paper makes several specific contributions to the field of digital forensics. First, it describes the motivation and design goals for DFXML. Second, the paper presents specific examples of how DFXML can be used to describe forensic artifacts. These examples make it easy for developers of today's forensic tools to adapt their tools to emit and ingest DFXML as a complement to their current file formats. Next, it presents an API that allows for the rapid prototyping and development of forensic applications. Finally, it describes how the DFXML abstractions can be used as a building block for creating new automated forensic processes.

1.2. Specific uses for DFXML

DFXML improves composability by providing a language for describing common forensic processes (e.g., cryptographic hashing), forensic work products (e.g., the location of files on a hard drive), and metadata (e.g., file names and timestamps).

Various prototype DFXML implementations have been used by the author since 2007 for a variety of purposes:

2. Prior work

Although file formats, abstractions, and XML are all used in digital forensics today, they are rarely themselves the subject of study. Mainly, these topics arise when practitioners discover that they cannot share information with one another, or even between different tools, because data are stored in different formats.

2.1. Digital evidence containers

A tool based on SleuthKit called fiwalk (x5.1) ingests disk images and reports the location and associated file system metadata of each file in the disk image. This tool was used by students for Masters' theses (Migletz, 2008; Huynh, 2008), and a project that applied machine learning to computer forensics (Garfinkel et al., 2010).

A DFXML file was created for each disk image in a corpus of more than 2000 disk images acquired around the world (Garfinkel et al., 2009). Each DFXML file contains information regarding the disk's purchase, physical characteristics, imaging process, allocated and deleted files, and metadata extracted from those files (e.g., Microsoft Office document properties, extracted JPEG EXIF information, etc.).

The DFXML Python module (x4.1) makes it possible to write small programs that perform complex forensic processing on DFXML files (Garfinkel, 2009). In contrast, the learning curve for tools such as EnCase EnScript (Guidance Software, 2007) and SleuthKit (Carrier, 2010) can be quite steep.

The XML files make it dramatically easier to share data with other organizations. In some cases it has only been necessary to share the XML files, rather than the disk images themselves. This is more efficient, as the files are much smaller than the disk images, and helps protect the privacy of the data subjects.

The XML format makes it easy to identify and redact personal information. The resulting redacted XML files can be shared without the need for Institutional Review Board (IRB) or Ethics Board approval; they can even be published on the Internet.

Finally, because the DFXML files record which version of which tool produced each file, it is easy to have tools automatically reprocess disk images when the toolset improves.

Broadly speaking, digital evidence containers are files designed to hold digital evidence. Most common are disk image files that hold sector-for-sector copies of hard drives and other mass storage devices. The simplest disk image is a raw format (also called dd format after the Unix dd program).

Modern disk image formats can use lossless compression and de-duplication to decrease the amount of storage space required, while still allowing the regeneration of the original disk image. Although disk image formats such as Norton Ghost, VMWare VMDK, Apple DMG and Microsoft WIM have been used for years within the IT community, forensic practitioners have mostly standardized on the Expert Witness Format (EWF) used by Guidance Software's EnCase program. (The format is also known as the .E01 format after the file extension.) EWF includes limited support for representing metadata such as the date that a disk image was acquired and the name of the examiner who performed the acquisition, as well as a free-format "notes" field, but does not support the representation of structured forensic information.

Kloet et al. (2008) presented an open source implementation of EWF in C; Allen presented an EWF implementation in Java (Allen, 2011a) and C# (Allen, 2011b). These open source implementations make it possible to read any sector of a disk image in EWF format as well as the limited metadata that accompanies the disk image. Of course, these implementations must be combined with software such as SleuthKit (Carrier, 2010) in order to extract individual files from the disk image.

Turner proposed a "wrapper" or metaformat called "Digital Evidence Bags" (DEB) to store digital forensic evidence from disparate sources (Turner, 2005). The DEB consists of a directory that includes a tag file, one or more index files, and one or more bag files. The tag file is a text file

S. Garfinkel / Digital Investigation 8 (2012) 161?174

163

that contains metadata such as the name and organization of the forensic examiner, hashes for the contained information, and data definitions. Turner created a variety of prototype tools, including a Digital Evidence Bag Viewer and a Selective Imager.

Cohen, Garfinkel and Schatz introduced AFF4 (Cohen et al., 2009), a redesign of Garfinkel's Advanced Forensic Format (AFF) (Garfinkel, 2006). Both AFF and AFF4 store disk images and associated metadata. AFF4 uses a flat RDF schema to store this auxiliary information. Although the RDF schema can be used to store file and file system metadata, this is not frequently done in practice, and tools to create such RDF files are not generally available.

does not describe the physical location of a file on a hard drive or the MD5 hash values of individual sectors.

The National Information Exchange Model is an effort by the US Department of Justice, the US Department of Homeland Security, and the US Department of Health and Human Services to create standardized data models for the sharing of structured information between different federal agencies. Of interest to forensics practitioners is the Terrorist Watchlist Person Data Exchange Standard, which provides a schema for describing identity information (US Department of Justice and US Department of Homeland Security, 2011).

2.5. XML languages for computer security

2.2. Representing registry information

There has been considerable forensic research aimed at recovering allocated data from Windows Registry hive files (Howell, 2009) and from unallocated space inside the hive (Thomassen, 2008; Tang et al., 2009).

Because of limitations of the ASCII-based registry file format defined by Microsoft's RegEdit tool, several developers created tools for extracting Registry entries from hive files and representing the resultant information as XML (Rodriguez, 2003; Shayne, 2001; Jones, 2009).

The National Institute of Standards and Technology's WIRED project has developed a program called reg-diff.rb, which ingests two ASCII files generated by RegEdit and produces an XML file describing the differences (Dima, 2006).

2.3. File system metadata standards

File system metadata is the name given to information within a file system other than file contents, including file names, timestamps, access control lists and disk labels. File system metadata is widely used in computer forensics as the primary tool for navigating file system information and reconstructing event timelines.

To date there has been little effort to develop standard descriptions of file system metadata. The Coroner's Toolkit (Farmer and Venema, 2005) introduced a "body file" format containing 16 entries for each file including file name, size, MAC times, allocation status, and other metadata that can be recovered from a file system. Individual fields were separated by pipe symbols (j) to allow for easy parsing by programs written in Perl. Body files were designed for moving data from one tool to another in the Toolkit, but not for data archiving or exchange between examiners. Carrier preserved the file format in SleuthKit 2.0 but modified it in SleuthKit 3.0 by reducing the number of fields to 11, rendering old files incompatible with the new tools and vice-versa.

2.4. File metadata and extracted features

The Electronic Discovery Reference Model (EDRM) (Socha, 2011) is an XML-based data interchange format for describing metadata of interest to e-discovery practitioners, including the Microsoft proprietary metadata fields embedded within Word and PowerPoint office files, and the To:, From: and Subject: fields of email messages. EDRM

Frazier (2010) of MANDIANT developed Indicators of Compromise (IOCs), an XML-based language designed to express signatures of malware such as files with a particular MD5 hash value, file length, or the existence of particular registry entries. There is a free editor for manipulating IOC files. MANDIANT has a tool that can use these IOCs to scan for malware and the so-called "Advanced Persistent Threat."

MITRE's "Making Security Measurable" project has developed three XML languages for describing items of importance to computer security practitioners and researchers. The project includes the Open Vulnerability and Assessment Language (OVAL?), the Common Event Expression (CEE?), and the Malware Attribution Enumeration and Characterization (MAEC?) languages.

Both MANDIANT's IOC and MITRE's MAEC are similar to DFXML in that they can describe file names and file system properties. Both are able to express items not envisioned by DFXML; IOC can even contain conditional logic. But both lack the ability to express specific features of forensic interest, including hash values that correspond to specific byte runs within an object, the ability to specify the physical location on a piece of media, and the ability to specify a variety of file system attributes such as allocation status.

2.6. XMLs for media forensics

There has been limited work developing XML languages specifically for digital forensics.

Alink et al. presented XIRAF (an XML Information Retrieval Approach to digital forensics) at NLPXML 2006 (Alink et al., 2006b) and DFRWS 2006 (Alink et al., 2006a). The authors stressed the importance of having "a clean separation between feature extraction and analysis" and the importance of having "a single, XML-based output format for forensic analysis tools." XIRAF stores XML documents in an XML-aware database; examiners conduct forensic investigations through the use of XML queries.

Levine and Liberatore (2009) presented DEX (Digital Evidence Exchange) at DFRWS 2009; DEX had the goals of making it possible to reproduce the original evidence from the XML description, and of enabling tool comparison and validation. DEX made extensive use of XML attributes that required complex parsing rules. The authors released a DEX tool written in Java under a BSD-like license.

Grenier designed a XML log file for the PhotoRec (Grenier, 2011) carver. Grenier did not implement his

164

S. Garfinkel / Digital Investigation 8 (2012) 161?174

original design, but instead graciously accepted patches from the author of the present article and incorporated DFXML into PhotoRec 6.12.

3. Digital forensic abstractions and digital forensics XML

Today the most common ways for forensic practitioners to exchange forensic data are disk images and text files. For example, an investigator might give an analyst a disk image of a captured USB drive and an ASCII list of MD5 hash values and ask if any of the files in the list are on the drive. Although this approach works in practice, it does not lend itself to evolutionary growth. For example, there is no standard way to annotate that list of MD5 hash values with SHA1 hash values, similarity digests, or classification levels. Instead, every person that wishes to annotate a list needs to develop their own ad-hoc format, and every tool that would interpret such a list needs to be able to handle such formats. Analysts, most of whom cannot program, spend a lot of time in Microsoft Excel adding and removing columns to overcome the diversity of formats that have evolved in recent years.

Other areas of information technology have successfully outgrown similar exercises in babble. For example, the growth of the World Wide Web is often attributed to the development of the HTML and HTTP standards, which made it possible for different groups to write software that interoperated without prior arrangement. Clearly, the Web also owes its birth to POSIX, TCP/IP, and the Berkeley Sockets API.

Digital forensics can similarly benefit from standardized abstractions, representations and interfaces. Such abstractions can leverage existing concepts and further enable digital forensics processes, allowing tools, practitioners and organizations to communicate more efficiently about forensic processes, while simultaneously providing an evolutionary path to exchanging increasingly sophisticated representations.

3.1. Example 1: using DFXML to describe file locations

Consider a JPEG file on a FAT32 SD card. Agreed upon abstractions, conventions and standards allow the SD card to be moved from a digital camera to a PC running Windows or a Macintosh running MacOS. These computers can use the same name to access the same sequence of bytes that make up the JPEG file, and when desktop computers display the file on their computer screens, the pictures look virtually indistinguishable.

Forensic tools do not enjoy the same level of interoperability when it comes to describing deleted JPEGs or carving artifacts that might be found on the same SD card. The only way to determine if a deleted file recovered by SleuthKit and EnCase are the same is to compare the files byte-by-byte or to compare the sector numbers from which the deleted files were recovered. Other approaches, such as comparing hash values of the two files, may not be satisfactory, as there are now multiple documented cases of different files that have the same MD5 hash value (Diaz, 2005; Selinger, 2009; Microsoft, 2008). Another disadvantage of using hash value comparision is that file similarities

may be inadvertantly obscured. This can happen because the length of a carved file cannot be unambigiously determined. If two carvers identify the same file with the same starting point but the lengths are off by one byte, a hash value comparision will report that the files are different, while a byte-run comparision will report that one file is a subset of the other.

File systems have an advantage over forensic tools: Whereas standards and convention clearly define the mapping between an allocated file and a set of disk blocks, "undelete" is not a well-defined operation. Different tools undelete differently, because the information on the hard drive required to perform the undelete operation may be incomplete, ambiguous, or contradictory. CarvFS attempts to solve this problem through the use of file names interpreted by the file system as pointers to specific disk blocks (Meijer, 2011). But CarvFS is limited to representing the location of data on the drive ? attempts to encode other information in the file names would result in prohibitively long names, and such encoding would ultimately result in names with structured attributes similar to what has been developed for DFXML.

An alternative approach employed by DFXML is to create a high-level language for describing where on a disk a file's content resides within a forensic disk image. For example, a JPEG file split into three pieces can be described as a set of three byte runs, each with a logical offset within the file, a physical offset within the disk image, and a length, as shown in Fig. 1.

The byte_runs approach is readily extended to describe logical byte runs that are zero-filled (and thus do not appear on the physical media) by replacing the img_offset?attribute with a fill?"0" attribute. Likewise, NTFS compression is represented with the attributes transform?"NTFS_DECOMPRESS"raw_len?"155".

DFXML expresses all sizes and extents in bytes, as runs do not necessarily start on sector boundaries (for example, small NTFS files are resident within the MFT) and because the sector numbers cannot be interpreted without knowing the sector size ? extrinsic information that may be missing or incorrect.

It is straightforward to modify existing programs to generate the tag. Once these modifications are made, it is trivial to compare the output of different versions of a program for regression testing, or to compare the results of processing the same data with different tools for conformance testing and certification.

The complete element for the JPEG in question (taken from Garfinkel et al. (2009)) appears in Fig. 2.

3.2. Example 2: using DFXML for hash lists

While today it is common to distribute a set of file hashes as a tab-delimited file containing file names and MD5 hash values, a DFXML file of hashes can be expanded to include SHA1 and SHA256 hashes, descriptions of each file, classification levels, partial hashes of key sectors, and even the email address of an individual who should be contacted if the file is encountered. The use of XML means that adding such fields does not impact older programs that

S. Garfinkel / Digital Investigation 8 (2012) 161?174

165

Fig. 1. Each byte_run XML tag specifies a mapping of logical bytes in a file to a physical location within a disk image. They can be combined in the byte_runs tag to specify fragments of a fragmented file.

do not expect such data. As such, DFXML makes it possible to gradually evolve interchange formats, giving researchers and practitioners the ability to put increasingly sophisticated analysis results or new annotations in their interchange and archive files.

3.3. Goals for DFXML

Previous efforts aimed at developing new formats for computer forensics have largely failed. For example, DFRWS launched a project in 2007 to create standardized abstractions for digital evidence containers; this project was abandoned within a year due to the lack of support and funding (Common Digital Evidence Storage Format Working Group, 2007). Based on the DFRWS experience, it seems reasonable that any effort to create an XML language for digital forensics should be envisioned as a low-cost project that nevertheless can produce significant savings or provide new capabilities. The following goals are compatible with such financial realities:

1. Complement existing forensic formats. Rather than replacing existing formats, the new language should augment them. This is accomplished by making it easy to convert between legacy and new formats, and by developing techniques so that the new formats can be used to annotate legacy data.

2. Be easy to generate. It must be easy to modify existing tools to generate the new representations. An open source C and C?? library aids in the modification process.

3. Be easy to ingest. Likewise, it must be easy to modify existing tools to read and process DFXML. An open source DFXML Python module based on the Python SAX XML parser makes it possible to efficiently read and process very large DFXML files (see Section 4).

4. Provide for human readability. A forensic analyst with no training should be able to look at a conforming DFXML file and make sense of it without the need for a special viewer. To this end, many tools produce DFXML that is pretty-printed.

5. Be free, open, and extensible. Both the representation and reference implementation must available for all to use, without a license fee. Developers should be able to add new tags without the need for central coordination (accomplishable through the use of XML namespaces).

6. Provide for scalability. The representation must be usable at both ends of forensic scale. Small amounts of information must have short descriptions, while it must be possible to efficiently process XML documents tens of gigabytes in size (which might result from processing multi-terabyte drives). As such, it must be possible to process DFXML using event-based XML parsers (e.g., Python Software Foundation (2010); Cameron et al. (2008); Zhang and van Engelen (2006)), rather than requiring the use of tree-based parsers such as those based on the Document Object Model.

7. Adhere to existing practices and standards. Where possible, DFXML should follow existing standards rather than inventing new ones. Where multiple, conflicting

Fig. 2. The completed XML element for IMG_0044.JPG. Notice that the create and modify times are accurate to 2 s, while the access time is only accurate to one day. All times are given without a UTC offset, since FAT32 file systems store time in local time. (Linebreaks and pretty-printing added for legibility.)

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download