Working with Binary Files in Java



Working with Binary Files in Java

Introduction

Java contains an extensive array of classes for file access. A series of readers, writers and filters make up the interface to the physical file system of the computer. The advantage to this sort of system of classes is that the programmer is freed from the overhead of dealing with the physical layout of files. The main disadvantage to this architecture is that the programmer is isolated from the physical details of how a file is stored. Java programs have a distinct, and well-defined, way in which they store data to files. Unfortunately, this complicates matters when dealing with files created by other languages.

This article presents a reusable class that deals with binary files. Methods are provided which allow the programmer to read a variety of standard numeric and string formats. Additional methods are provided which take into account signed/unsigned, little/big-endian storage as well as file alignment. Using this class the programmer can read nearly any sort of binary file. An example program is provided that will read the header from a GIF file.

One of the first problems to overcome is reading an unsigned byte. Java treats nearly all types as signed. In order to do the mathematics later required to convert bytes into larger data types the bytes must be unsigned. A protected method is provided to read bytes in an unsigned form. Converting the byte to a short and then trimming all but the least significant eight bits does this. This is done with the following lines of code:

protected short readUnsignedByte()

{

return (short)(_file.readByte() & 0xff);

}

Using the BinaryFile Class

The BinaryFile class can be seen in BinaryFile.java. To use the BinaryFile class create a RandomAccessFile class to the file that you would like to work with. This file can be opened for read or write access. Then construct a BinaryFile object, passing in your RandomAccessFile object to the constructor. The following two lines prepare to read/write to a file called “test.dat”.

file=new RandomAccessFile("test.dat","rw");

bin=new BinaryFile(file);

Once this is complete you can call the various methods provided to access different data types. The methods to access the various data types are prefixed with either read or write and then the type. For example, the method to read a fixed length string is readFixedLengthString. The complete class is shown in Listing 1.

Listing 1: Reading Java Binary Files (BinaryFile.java)

import java.io.*;

/**

* @author Jeff Heaton()

* @version 1.0

*/

class BinaryFile

{

/**

* Use this constant to specify big-endian integers.

*/

public static final short BIG_ENDIAN = 1;

/**

* Use this constant to specify litte-endian constants.

*/

public static final short LITTLE_ENDIAN = 2;

/**

* The underlying file.

*/

protected RandomAccessFile _file;

/**

* Are we in LITTLE_ENDIAN or BIG_ENDIAN mode.

*/

protected short _endian;

/**

* Are we reading signed or unsigned numbers.

*/

protected boolean _signed;

/**

* The constructor. Use to specify the underlying file.

*

* @param f The file to read/write from/to.

*/

public BinaryFile(RandomAccessFile f)

{

_file = f;

_endian = LITTLE_ENDIAN;

_signed = false;

}

/**

* Set the endian mode for reading integers.

*

* @param i Specify either LITTLE_ENDIAN or BIG_ENDIAN.

* @exception java.lang.Exception Will be thrown if this method is

* not passed either BinaryFile.LITTLE_ENDIAN or BinaryFile.BIG_ENDIAN.

*/

public void setEndian(short i) throws Exception

{

if ((i == BIG_ENDIAN) || (i == LITTLE_ENDIAN))

_endian = i;

else

throw (new Exception(

"Must be BinaryFile.LITTLE_ENDIAN or BinaryFile.BIG_ENDIAN"));

}

/**

* Returns the endian mode. Will be either BIG_ENDIAN or LITTLE_ENDIAN.

*

* @return BIG_ENDIAN or LITTLE_ENDIAN to specify the current endian mode.

*/

public int getEndian()

{

return _endian;

}

/**

* Sets the signed or unsigned mode for integers. true for signed, false for unsigned.

*

* @param b True if numbers are to be read/written as signed, false if unsigned.

*/

public void setSigned(boolean b)

{

_signed = b;

}

/**

* Returns the signed mode.

*

* @return Returns true for signed, false for unsigned.

*/

public boolean getSigned()

{

return _signed;

}

/**

* Reads a fixed length ASCII string.

*

* @param length How long of a string to read.

* @return The number of bytes read.

* @exception java.io.IOException If an IO exception occurs.

*/

public String readFixedString(int length) throws java.io.IOException

{

String rtn = "";

for (int i = 0; i < length; i++)

rtn += (char) _file.readByte();

return rtn;

}

/**

* Writes a fixed length ASCII string. Will truncate the string if it does not fit in the specified buffer.

*

* @param str The string to be written.

* @param length The length of the area to write to. Should be larger than the length of the string being written.

* @exception java.io.IOException If an IO exception occurs.

*/

public void writeFixedString(String str, int length)

throws java.io.IOException

{

int i;

// trim the string back some if needed

if (str.length() > length)

str = str.substring(0, length);

// write the string

for (i = 0; i < str.length(); i++)

_file.write(str.charAt(i));

// buffer extra space if needed

i = length - str.length();

while ((i--) > 0)

_file.write(0);

}

/**

* Reads a string that stores one length byte before the string.

* This string can be up to 255 characters long. Pascal stores strings this way.

*

* @return The string that was read.

* @exception java.io.IOException If an IO exception occurs.

*/

public String readLengthPrefixString() throws java.io.IOException

{

short len = readUnsignedByte();

return readFixedString(len);

}

/**

* Writes a string that is prefixed by a single byte that specifies the length of the string. This is how Pascal usually stores strings.

*

* @param str The string to be written.

* @exception java.io.IOException If an IO exception occurs.

*/

public void writeLengthPrefixString(String str) throws java.io.IOException

{

writeByte((byte) str.length());

for (int i = 0; i < str.length(); i++)

_file.write(str.charAt(i));

}

/**

* Reads a fixed length string that is zero(NULL) terminated. This is a type of string used by C/C++. For example char str[80].

*

* @param length The length of the string.

* @return The string that was read.

* @exception java.io.IOException If an IO exception occurs.

*/

public String readFixedZeroString(int length) throws java.io.IOException

{

String rtn = readFixedString(length);

int i = rtn.indexOf(0);

if (i != -1)

rtn = rtn.substring(0, i);

return rtn;

}

/**

* Writes a fixed length string that is zero terminated. This is the format generally used by C/C++ for string storage.

*

* @param str The string to be written.

* @param length The length of the buffer to receive the string.

* @exception java.io.IOException If an IO exception occurs.

*/

public void writeFixedZeroString(String str, int length)

throws java.io.IOException

{

writeFixedString(str, length);

}

/**

* Reads an unlimited length zero(null) terminated string.

*

* @return The string that was read.

* @exception java.io.IOException If an IO exception occurs.

*/

public String readZeroString() throws java.io.IOException

{

String rtn = "";

char ch;

do

{

ch = (char) _file.read();

if (ch != 0)

rtn += ch;

} while (ch != 0);

return rtn;

}

/**

* Writes an unlimited zero(NULL) terminated string to the file.

*

* @param str The string to be written.

* @exception java.io.IOException If an IO exception occurs.

*/

public void writeZeroString(String str) throws java.io.IOException

{

for (int i = 0; i < str.length(); i++)

_file.write(str.charAt(i));

writeByte((byte) 0);

}

/**

* Internal function used to read an unsigned byte. External classes should use the readByte function.

*

* @return The byte, unsigned, as a short.

* @exception java.io.IOException If an IO exception occurs.

*/

protected short readUnsignedByte() throws java.io.IOException

{

return (short) (_file.readByte() & 0xff);

}

/**

* Reads an 8-bit byte. Can be signed or unsigned depending on the signed property.

*

* @return A byte stored in a short.

* @exception java.io.IOException If an IO exception occurs.

*/

public short readByte() throws java.io.IOException

{

if (_signed)

return (short) _file.readByte();

else

return (short) _file.readUnsignedByte();

}

/**

* Writes a single byte to the file.

*

* @param b The byte to be written.

* @exception java.io.IOException If an IO exception occurs.

*/

public void writeByte(short b) throws java.io.IOException

{

_file.write(b & 0xff);

}

/**

* Reads a 16-bit word. Can be signed or unsigned depending on the signed property.

* Can be little or big endian depending on the endian property.

*

* @return A word stored in an int.

* @exception java.io.IOException If an IO exception occurs.

*/

public int readWord() throws java.io.IOException

{

short a, b;

int result;

a = readUnsignedByte();

b = readUnsignedByte();

if (_endian == BIG_ENDIAN)

result = ((a 8);

_file.write((int) (d & 0xff0000) >> 16);

_file.write((int) (d & 0xff000000) >> 24);

}

}

/**

* Allows the file to be aligned to a specified byte boundary.

* For example, if a 4(double word) is specified, the file pointer will be

* moved to the next double word boundary.

*

* @param a The byte-boundary to align to.

* @exception java.io.IOException If an IO exception occurs.

*/

public void align(int a) throws java.io.IOException

{

if ((_file.getFilePointer() % a) > 0)

{

long pos = _file.getFilePointer() / a;

_file.seek((pos + 1) * a);

}

}

}

String Datatypes

There are many ways that strings are commonly stored in a binary file. The BinaryFile object supports four different string formats. The null-terminated and fixed-width null-terminated types used by C/C++ are supported. Additionally fixed-width and the length-prefixed string used by Pascal are also supported.

Null terminated strings are commonly used with C/C++ and other languages. In this format the characters of the string are stored one by one, with an ending zero character. This allows strings to be of any length. Strings stored in this format can contain any character, except for the zero character. Two types of null-terminated strings are supported.

The readZeroString and writeZeroString methods are used to read and write null terminated string. This is an unlimited length string that ends with a null(character 0). The readZeroString accepts no parameters and returns a String object. The writeZeroString accepts a String object to be written.

The readFixedZeroString and writeFixedZeroString methods are used to read and write fixed-length null terminated strings. This is the type of string most commonly used by the C/C++ programming language. The amount of memory held by this sort of string is fixed. But the length of this string can vary from zero up to one minus the amount of memory reserved for this string. In C/C++ this type of string is written as:

char str[80];

This means that the str variable occupies eighty bytes. But its length can vary from zero to seventy-nine. No matter how long this string is, it is always stored to a disk file as exactly eighty bytes.

The Pascal language uses length-prefixed strings. The Macintosh operating system is based on Pascal strings and as a result length-prefixed strings are commonly found in files generated from the Macintosh platform. The readLengthPrefixString and writeLengthPrefixString methods are used to read and write length-prefixed strings. The writeLengthPrefixString accepts a string and writes it out to the file. The readLengthPrefixString returns a String object read from the file. Length-prefixed strings occupy their length plus one byte in memory.

The last, and simplest, string type supported by the BinaryFile object is the fixed-width string. A fixed-width string is simply an area of memory reserved for the string. The string occupies the beginning bytes of this buffer and any remaining space is padded with either zeros or spaces. It is not unusual to have to do a trim on a string just read in from this format. The readFixedString and writeFixedString methods are used to read and write fixed-width strings. The readFixedString method accepts a parameter to specify the length of the string and returns a String object read from the file. The writeFixedString method accepts a length parameter and a String object. The String object is then written to the file. If the string is longer than the specified length then the string is truncated. If the string length is less than the specified length then the string is padded.

Numeric Datatypes

In Jonathan Swift’s Gulliver’s Travels the nations of Lilliput and Blefuscu find themselves at war over which end of a hardboiled egg to cut before eating. Lilliput preferred the Little Endian approach of starting with the little end of the egg. Whereas Blefuscu preferred to start with the large end. An inane controversy indeed, but one that mirrors our own computer industry. When an integer is stored in memory occupies more than one byte it is necessary to decide which byte to place first. Take for example the number 1025. This number would have to be stored in two bytes. The high-order byte would be four. The low-order byte would be one. This is because the integer division of 1025 by 256 using is four, with a modulus of one. So we have the bytes of four and one. Is this stored as 04 00 or as 00 04? Computer scientists call the two notations little-endian and big-endian respectively. The same words as those used by Swift to describe the dilemma of the Lilliputians. The two systems can be seen in figure one.

So which one is predominant in the industry? Unfortunately it’s a near dead heat. Most of the UNIX variants and the Internet standards are big-endian. Motorola 680x0 microprocessors (and therefore Macintoshes), Hewlett-Packard PA-RISC, and Sun SuperSPARC processors are big-endian. The Silicon Graphics MIPS and IBM/Motorola PowerPC processors support both little and big-endian. As a result, the binary file class presented in this article will handle both standards.

In order to accommodate the little and big endian numbers integers are first read in byte by byte and then converted into the correct data type. For numbers that are four bytes the next four bytes from the file are read into the variables a, b, c and d. Then to convert to big-endian or little-endian the following equation is used.

result = ((a ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download