An Exploration of the MPEG algorithm using Latency ...



An Exploration of the MPEG Algorithm using Latency Insensitive Design

By Trevor Meyerowitz {tcm@eecs.berkeley.edu}

Mentored by: Luca Carloni {lcarloni@eecs.berkeley.edu}

EE249 Final Report

Professor Alberto Sangiovanni-Vincentelli {alberto@eecs.berkeley.edu}

December 10, 1999

Abstract:

This is the partial exploration of the latency insensitive design techniques developed by Luca Carloni to the application of MPEG-2 Decode. Upon further examination this has gone from a simple case study in hardware design to a difficult and interesting hardware-software partitioning problem.

1. Background:

The 1998 International Technical Roadmap for Semiconductors [1] predicts that in 2005 the average microprocessor will contain 200 million transistors fabricated in 0.1 micron technology with a chip area of roughly 550 mm2. In a chip this size the global clock, running at 2 GHz, will require multiple cycles to cross the chip, making interconnect delay increasingly important. Additionally, SOC (System On a Chip) systems will become more prevalent in the future. These designs are often constructed connecting different IP (intellectual property) blocks. Given the interconnect delays and the unfamiliarity with the IP blocks, an approach that reliably handles the interconnect delay is needed. One such approach is that of correct by construction [2] utilizing the relay stations. The relay station idea has been implemented and has been tested on a microprocessor, but more tests on a data-oriented problem are needed. MPEG is an ideal algorithm because it is a highly data-dependant algorithm and is used in the popular applications DVD and HDTV.

1.1 Relay Stations

The main idea behind latency insensitive design is to provide a rigid structure to allow signals to propagate across the chip in more than one clock-cycle in a predictable fashion. The IP-cores are surrounded by logic that allows them to stall and to stall other blocks. These blocks are connected by an arbitrary number of data storage elements known as relay stations (see Figure 1[1]). These stations have 3 inputs: the input data, a signal that indicates if it’s void, and a stop signal from the block following it. If the stop signal is received then the block stalls and transmits a stop signal to the previous block. If data is void or the block is stalled then the void output is transmitted as true. The cores are surrounded by structures called shells that facilitate the interaction of the blocks with the relay stations. The shells are written to ensure that if they are waiting for a non-void input or are receiving a stop signal. The shells stall the cores by gating the clock when it receives the stall signal. Using this methodology ensures that blocks don’t execute until all of the inputs have been received, effectively making the blocks in the system insensitive to latency.

1.2 MPEG-2 Decode [3]

The MPEG-2 standard is increasing in importance as bandwidth and storage have each reached the point where video can be distributed digitally. The DVD and HDTV standards are based around the MPEG-2 standard. The standard allows for a variety of video resolutions, color depths, and compression schemes. To provide backward compatibility, the standard provides support for interlaced video, and the MPEG-1 standard is a subset of MPEG-2.

Aside from the header, the file is broken up into 4 types of blocks: D-Blocks, I-Blocks, P-Blocks, and B-Blocks. The D-Blocks contain non-image data. The I-Blocks contain full images. The P and B blocks respectively contain images dependant upon previous and future blocks. For the following explanation please refer to Figure 2 below[2]. After reading the header, the decoder sets about decoding the frames in 8 by 8 slices. First the Run-Length and Variable-Length Coding is decoded, which boils down to a de-clustering the zeroes and performing an inverse Huffman decode. Next the inverse quantize is applied, which multiplies the matrix by a weight constant and then by a weighting matrix. Next the block is reordered using the inverse zigzag function. Then an IDCT (Inverse Discrete Cosine Transform) is performed on the block. Finally the slices are arranged into the block, which is then fed into the addition unit that performs motion estimation, and combines the block with its dependant blocks. All of this is repeated until the movie has been fully decoded.

The decoder is much simpler than the encoder because it only decodes the data, and doesn’t decide the best compression scheme. Additionally, the encoder must include a decoder to properly encode the P and B blocks. Given the wide variability and additional complexity of the encoder, only a decoder was considered for the purposes of this project.

2. Implementation

2.1 Reference Models

We referred to 2 separate models as a basis for this project. The first model is a hardware-software implementation of an MPEG encoder designed by the Nippon Telegraph and Telephone group [5]. It provides a good background on the MPEG model as well as an indication of a reasonable partitioning between hardware and software and what the performance requirements are for each block. The partitioning information was soon found to be incredibly useful.

We then used a C program implementation of an MPEG-2 decoder written by the MPEG Software Simulation Group [6], as our baseline model to translate into hardware. Some of the code was a relatively straightforward translation into hardware, whereas other parts used multiple pointers and memory accesses implying that they are best suited for software.

2.2 Partitioning

Currently the 3 most suitable blocks have been implemented in Verilog [4]. These blocks are the IDCT, Inverse Zigzag, and the Inverse Quantization. All of these share a fixed input size and function. The only other block that might be straightforward to implement is the inverse VLC/RLC block. The other blocks extensively used pointers and memory access,

2.3 Verilog Implementation

Each of these blocks was derived directly from the software code. Then each was tested against the corresponding code to ensure correctness. The only significant obstacle in this area was overcoming the differences in how Verilog and C handle negative numbers and shift operations. Once this was finished the blocks were connected and then there operation was tested. Then a shell was written for each block and tested. Finally the three blocks were wrapped in their shells and connected with relay stations in the manner shown in Figure 3. The execution times of the non-relay station and relay station systems are shown in Figure 4. This shows an initial overhead of 4.5 cycles (1 cycle for each relay station and 0.5 for each shell). Then for each block after that there is an overhead of 1 cycle. There shouldn’t be any overhead here. The additional overhead is the result of a currently unresolved coding error.

| |# clock cycles for 1 block of|# clock cycles for 2 blocks |

| |data |of data |

|W/O RS’s |1.5 |2.5 |

|W/ RS’s |6 |8 |

Figure 4: Results

3. Future Work and Conclusions:

3.1 Future Work

Much remains to be done on this project. In the immediate future, the current work should be integrated with the PLI so that true co-simulation can take place. In addition, the current Verilog needs to be revised to be synthesizeable, and to reflect reasonable performance (current every block completes its task in 1 clock cycle, and all of the data is passed from block to block in 1 clock cycle). Once this is done further performance analysis is warranted to see what the performance penalty of the latency insensitive design is. Finally synthesizing the design to find the area penalty would also be an interesting exercise.

3.2 Conclusions

When this project started it appeared to be a fairly straightforward task of taking a C program and turning it into a Verilog implementation. Upon further examination, I found that it would be both difficult and inefficient to move the entire algorithm into hardware, and that the problem is actually a hardware software co-design problem. This opens up the issue of how the software and hardware should interact, especially how software should interact with the relay station. So, although it was somewhat disheartening not to have finished porting MPEG to hardware, the stage is set for a very interesting case study. One hurdle to this might be simulation time, [7] says that their RTL simulation of the encoding of 1 frame takes about 7 hours, it remains to be seen if MPEG decode will be similarly slow.

4. References:

1. Sematech, The 1998 International Technical Roadmap for Semiconductor, or

2. L.P. Carloni, K.L. McMillan, A. Saldanha and A.L. Sangiovanni-Vincentelli,

A Methodology for Correct-by-Construction Latency Insensitive Design

The Proceedings of the International Conference on Computer-Aided Design, 1999.

3. John Watkinson. MPEG-2. Focal Press, Woburn, MA, first edition, 1999

4. Samir Palnitkar, Verilog HDL: A Guide to Digital Design and Synthesis, SunSoft, Mountain View, CA, 1996

5. T. Knodo et al., Two Chip MPEG-2 Video Encoder, IEEE Micro, Vol. 16, No. 2, Apr. 1996, pp. 51-58

6. MPEG Software Simulation Group, (software program)MPEG-2 decoder, 1996

7. Mitsuo Ikeda et al., An MPEG-2 Video Encoder LSI with Scalability for HDTV based on Three-layer Cooperative architecture. Nippon Telegraph and Telephone Corporation, IEEE 1999

-----------------------

[1] Figure taken from: [1]

[2] Figure taken from [3]

-----------------------

Figure 1. Sketch of Relay Station1

Stop_In

Stop_Out

Data_out,

Void_out

Data_in,

Void_in

control

Figure 3: Implementation

Figure 2: MPEG-2 Decoder Block Diagram2

Memory Bus

Decoded

Picture

Backward Predictor

Forward Predictor

Picture Store (P)

Picture Store (I)

Input Stream

Add Picture

Inverse ZigZag

Inverse DCT

Inverse Quantize

VLC/RLC

Decode

Header

Decode

Relay Station

Relay Station

Relay Station

Inverse DCT

Inverse ZigZag

Inverse Quantize

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download