Compiler-Driven Memory Management for Improving I/O ...



A CASE STUDY OF COMPILER DRIVEN I/O MINIMIZATION

w. Zhang, T. Beaubouef

Department of Computer Science and Industrial Technology

Southeastern Louisiana University

Hammond, LA 70402

(985) 549-2189

wzhang@selu.edu

J. Feng, E. Leiss

Department of Computer Science

University of Houston

Houston, TX 77204

(713) 743-3350

jennyfeng@, eleiss@uh.edu

Abstract

Most scientific programs have large input and output data sets that require out-of-core programming or the use of virtual memory management (VMM). Often, VMM is not an effective approach because it often results in substantial performance reduction. In contrast, compiler driven I/O management allows a program’s data sets to be retrieved explicitly in parts. In this paper a modified Comanche (Compiler Managed Cache) system is presented.

Introduction

The speed gap between processors and disks continues to increase as VLSI technology advances at a tremendous rate while disk technology is relatively stagnant. As a result, disk I/O has become a serious bottleneck for many high performance computer systems. Hence, it is critically important to be able to construct I/O minimal programs [5].

The size and complexity of applications, both in the scientific and the commercial world, have increased. As the data sets required by those applications exceed the capacity of the main memory, the computation becomes an out-of-core computation. Processing out-of–core data requires staging data in smaller granules that can fit in the main memory. Data required for the entire computation have to be fetched from files on disk so that improving disk I/O becomes extremely important. Much research has been done on virtual memory management (VMM) and other related operating systems (OS) concepts, I/O subsystem hardware, and parallel file systems. Each contributes to some degree to I/O performance, but they all lack a global view of application behavior, which limits their effectiveness.

Parallel I/O is a cost-effective way to address some I/O issues. The wide availability of inexpensive powerful PC clusters with high-speed networks makes parallel I/O a viable approach. Parallel I/O subsystems have increased the I/O capabilities of parallel machines significantly but much improvement is still needed to balance the CPU performance. The variety (private disks, shared disks or a combination of both) in the I/O architectures makes it difficult to design optimization techniques that reduce the I/O cost. The problem has become more severe since the size and complexity of applications have increased tremendously [7].

OS designers offer the handling of I/O activity via virtual memory management (VMM). Research on this approach considers the use of smart virtual memory, techniques which reshape the data reference patterns to exploit the given hardware facilities and system software, and replacement policies. Overall these techniques assume a considerable amount of help from the hardware. Current virtual memory management systems (VMM) provided for scalable computer systems typically offer poor performance for scientific applications when an application’s working data set does not fit in main memory [1].

A number of run-time libraries for out-of-core computations and a few file interfaces have been proposed, among them SIO [6], MPI-IO [2], and an extension of the traditional Unix file I/O interface for handling the parallel accesses to parallel disk subsystems [3]. The parallel file systems and run-time libraries for out-of-core computations provide considerable I/O performance, but they require much effort from the user; also they are not portable across a wide variety of parallel machines with different disk subsystems.

The difficulty of handling out-of-core data and writing an efficient out-of-core program limits the performance of high performance computers. Some out-of-core programs do not perform well when they rely on the virtual memory management (VMM) system. There is a clear need for compiler directed explicit I/O for out-of-core computations.

The design and implementation of compilers for high performance systems require a thorough understanding of the target architecture to deliver the highest level of performance. It is important to know the I/O behavior on memory hierarchies. In this paper, we concentrate on the compiler-based approach to the I/O problem. The main rationale behind this approach is the fact that the compiler has unique information about the data needs of the program. The compiler can examine the size and shape of the data and the overall access pattern of the application. Compiler driven I/O management should generate code to restructure out-of-core data, computation, and the management of memory resources. A compiler combined with a user level runtime system can replace and outperform standard virtual memory management for out-of-core problems [7]. A compiler with a good combination of file layouts on disks and loop transformations is successful at optimizing programs, which depend on disk-resident data in distributed-memory machines [4].

Writing an efficient out-of-core version of a program is a difficult task; managing it will require a combination of memory management, out-of-core compilation, and the interaction of file layouts on disks and loop transformations. Comanche [8] is based on a compiler-managed cache where the main memory accommodates the cache. The compiler has unique information about the data access patterns of the program and the size and shape of arrays. Note that it is this information that typically is not available to the operating system, which is customarily charged with the management of I/O. Comanche restructures the out-of-core program to move data more efficiently between main memory and disk.

It is important to maximize the locality of data references in order to reduce I/O transfers between layers of the memory hierarchy. We may take advantage of program locality by focusing our optimization on loops. One way to overcome the problem is to transform the program and the data sets together such that the localities between those two are maintained. It was observed during initial testing of Comanche that memory mapping a large file has a significant impact on system resources. We are mainly interested in how to access large data sets instead of how to manipulate these data after we get them. In our experiments, we observed that a major part of the execution time is spent on I/O instead of on calculation. Once we get the data, relatively little time is needed for doing the calculations.

COMANCHE

The compiler runtime I/O management system used here is called Comanche (an acronym for Compiler Managed Cache) [8]. It is a compiler combined with a user level runtime system that effectively replaces virtual memory management by allowing direct control over which pages are retained in the active memory set. It is written in the C programming language and uses the standard I/O library calls.

The Comanche runtime system supports a simple application program interface (API). The standard entity in the Comanche runtime system is a two-dimensional array. Higher dimensions can be supported, or else the accesses can be translated into two-dimensional accesses. The runtime system supports APIs to map and unmap files of arbitrary two-dimensional data. In these experiments, data files are constructed and filled with random double precision floating point values. Data are assumed to be in row major layout on disk.

There are two structures declared in the Comanche system, modified to allow a section of a row to be attached to the buffer instead of the whole row. One (subrow_s) is a block structure used to buffer a sub-row of data, the other structure is a matrix structure (array_s) that holds the information about the matrix on disk and has buffers to hold several sub-rows in memory. There are several other functions in Comanche. Two major functions are comanche_attach and comanche_release, used to perform sub-row mapping.

typedef structure subrow_s

{

double *data;

int wflag;

int refcnt;

int subrowindex;

} Row;

typedef struct array_s {

int nrows, nelems, elesize;

int subrows, nsubelems;

FILE *fp;

long offset;

char *name;

int *map;

int nbuffs, buffsize;

Row **buffers;

int victim, wflag, total_mem;

} Matrix;

Figure 1 Data Structure of API

When a data set is too large to fit into memory, VMM will map the array into a one-dimensional vector and then that vector is cut up into subrows. Comanche will take a row of an array and cut it up into subrows. The goal is that only the useful data be read into buffers. In this way, the limited buffers can be used to store more useful data reducing dead weight and decreasing the memory of page fault. Through the use of the functions provided by Comanche, the I/O behavior is under the control of the resulting out-of-core program.

The comanche_attach and comanche_release functions tell the runtime system that a block is to be mapped into memory, returning an address to the cached data. The runtime system will not reclaim attached memory as long as it remains attached, no matter how much time has passed since the buffer was last referenced. The out-of-core program manages the duration of mapped data and ensures that the number of attach operations will not over-fill the available memory before the data’s subsequent release.

TESTING

Consider a square on which there are many equal-sized “tall and lean” rhombuses as illustrated in Figure 2.

Figure 2 Rhombus

This example task traverses each rhombus, calculates the sum of all the elements on it, and writes the total value to a file. Here the length of the side of the square is many times larger than the rhombus’s width, so each rhombus strides across all the rows, but straddles a very limited number of columns. It begins from the element on the apex of the first rhombus and proceed toward its next row until it reaches the last row in the square. During this process, it adds up all the elements that happen to be on the rhombus. After finishing traversing the first rhombus, it must go back to the first row of the square again and choose the next element, which is the apex of the second rhombus and repeat the whole traversing process until it finishes the rightmost one. An important observation is that in the traversing it needs almost every row even if just for one or two elements and then needs to go back and reference some other elements of those rows again after finishing the previous traversal.

The rotation method for optimization of the rhombus problem is performed by allocating two groups of subrows in the buffers, and assuming that the width of the rhombus is shorter than the length of the subrow. The purpose is that at any time the elements on each rhombus must be in the buffers. As illustrated in Figure 3, for each row on which rhombuses have elements, there are two groups of subrows. They keep moving forward toward the next column while rotating their positions. Whenever the rhombus is to move out subrow #1 into subrow #2, subrow #1 will be shifted right next to subrow #2, and similarly for those subrows on the next rows, such as #3, #4, #5, #6, and so on. Every row on which rhombuses have elements repeats the same process. This movement gives the illusion that each rhombus is in the buffers all the time since the buffers are moving along as the rhombus references are moving. Buffer movement is always one step ahead of the rhombus reference movement, which guarantees that buffers always have the ability to provide the data that are to be referenced soon. At the end of the program, all subrows in buffers are released.

Figure 3 Rhombus Optimization

Experiments

The algorithm discussed in the previous section was implemented in an out-of-core program written in C. The C compiler generates working code using the Comanche runtime system. A set of experiments was run for a double precision matrix of size 3600 x 3600 involved in the Rhombus application. The total data set space is 3600 x 3600 x 8 = 103MB. Data files were initialized with random values from the interval (-1, +1).

For each test case, the result shown in Table 4 is the average of five tests. Two applications were executed simultaneously for the multitasking case. The time used for executing the whole program is measured in seconds. The ratio represents the time of the virtual memory version over the time of the Comanche version. Values greater than one favor Comanche while values less than one favor VMM. From Table 4, it is obvious that the system performance is optimized dramatically under Comanche.

|Execution Type |Data Set Size |Virtual Memory |Comanche |Ratio |

| |(MB) |(Second) |(Second) |(V/C) |

|Out-of-core | 103 | 287.4 | 28.6 | 10.5 |

| | | | | |

| | | | | |

| | | | | |

| | | | | |

| | | | | |

| | | | | |

|Multitasked | 103 | 619.1 | 28.9 | 21.42 |

Table 4 Rhombus out-of-core and multitasking experiments

Conclusion

Data sets arising in scientific and engineering programs are often too large to fit into the available memory, so data must be transferred from and to remote storage, such as disk. These I/O transfers can quickly dominate the execution time of the program. It becomes critically important to minimize the I/O between main memory and disk. The main issue with VMM is its lack of information about the application’s actual memory needs and access patterns. As the application continues to fault, the VMM system will start thrashing. Comanche uses the source code to determine the maximum number of data sets needed at any moment in time; this becomes the working set for the application.

This paper describes one of many test cases. Optimization techniques with the modified Comanche API data structure has been applied. Experimental data demonstrate that the Comanche system achieves much better I/O performance than the VMM systems, especially for the multitasking application of several out-of-core applications.

Comanche could be enhanced by using mathematical expressions to represent the data access patterns and assumptions of each access pattern. The ability to determine which optimization method can be chosen and how to use it based on the mathematical expression defining its access pattern is an obvious goal.

REFERENCES

1. Brezany, P., Choudhary, A., and Dang, M., Language and Compiler Support for Out-of-Core Irregular Applications on Distributed Memory Multiprocessors. Proceedings of LCR98: 4th Workshop on Language, Compilers, and Run-time Systems for Scalable Computer, Pittsburgh, PA, May 1998.

2. Corbert, P., Fietelson, D., Fineberg, S., Hsu, Y. Nitsberg, B., Prost, J., Snir, M., Traversat, B. and Wong, P. Overview of the MPI-IO Parallel I/O Interface. 3rd Workshop on I/O in Parallel and Distributed System, IPPS’95, Santa Barbara, CA, April 1995.

3. Kotz, D. Multiprocessor File System Interfaces. Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems, 1993.

4. Leiss, E. L. Out-of-core Programming: Memory and I/O Management, Annual Progress Review 6, W. M. Keck Research Computation Lab, Univ. of Houston, 1990.

5. Leiss, E. L. Parallel and Vector Computing: A Practical Introduction. McGraw-Hill, Inc. New York, 1995.

6. The Scalable I/O Low-level API: A Portable Programming Interface for Parallel File Systems. Presentation in Supercomputing’96, Philadelphia, PA, 1996.

7. High Performance Computing and Communications: Grand Challenges 1993 Report. A Report by the Committee of Physics, Mathematical and Engineering Sciences, Federal Coordinating Council for Science, Engineering and Technology.

8. Robinson, E. M. Compiler Driven I/O Management. Ph.D. Dissertation, Department of Computer Science, University of Houston, 1998.

-----------------------

#1 #2 #1 #2 #1 #2

#3 #4 #3 #4 #3 #4

#5 #6 #5 #6 #5 #6

#9 #10 #9 #10 #9 #10

#7 #8 #7 #8 #7 #8

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download