Programming for the Cell Broadband Engine Architecture on ...



Programming for the Cell Broadband Engine Architecture on the PlayStation 3

By Faisal Rabbani

Table Of Contents

Table Of Contents 2

CBEA -Cell Broadband Engine Architecture 3

The PPE -Power Processing Element 3

The SPE -Synergistic Processing Elements 3

Language Extension Examples for CBEA 4

Vector Data Types 4

Signals and Mailboxes 5

MFC Direct Memory Access 6

SPU Intrinsic Calls 6

Working with IBM Cell BE SDK 2.1 7

Supported Platforms 8

SDK Components 8

Installing the SDK 8

Additional Commands 9

The SDK and Yellow Dog Linux 5.02 (and PS3) 10

Updating The GNU C Library 10

The “Hello World” App 10

The Next Step 14

CBEA -Cell Broadband Engine Architecture

The Cell Broadband Engine is a new family of microprocessors conforming to the Cell Broadband Processor Architecture (CBEA). The CBEA is a new architecture that extends the 64-bit PowerPC Architecture. The CBEA and the Cell Broadband Engine are the result of collaboration between Sony, Toshiba, and IBM (known as STI, formally started in early 2001).

The Cell Broadband Engine microprocessor is comprised of nine processors operating on a shared, coherent memory. Their function is specialized into two types: the PowerPC Processor Element (PPE), and the Synergistic Processor Element (SPE). The Cell Broadband Engine has one PPE and eight SPEs.

It should be noted that on the PlayStation 3 one of the SPEs is disabled and another one is locked, providing the programming access to 6 SPEs. The locked SPE is dedicated to running the Game Operating System, and the disabled SPE is for improved chip yields (i.e. Cell BE chips do not have to be discarded if one of the SPE is/becomes defective).

The PPE -Power Processing Element

The PPE is a general-purpose, dual-threaded, 64-bit RISC processor that conforms to the PowerPC Architecture (version 2.02) supporting the Vector/SIMD Multimedia Extensions (VMX). (Programs written for the PowerPC 970 processor can run on the Cell Broadband Engine without modification).

The PPE can run both 32-bit and 64-bit operating systems and applications. It has a 32KB L1 instruction cache, a 32 KB L1 data cache, unified 512 KB L2 data and instruction cache, and 32 128 bit vector (16 8-bit components, 8 16-bit components or 4 32-bit components format) registers.

The SPE -Synergistic Processing Elements

Each of the 8 SPEs is a 128-bit RISC processor specialized for computationally intensive and data rich SIMD applications. The SPE is comprised of the SPU (Synergistic Processing Unit) and the MFC (Memory Flow Controller).

The SPU deals with instruction control and execution. It includes a single register file with 128 registers (each one 128 bits wide), a unified (instructions and data) 256 KB local store (LS) and a DMA (Direct Memory Access) interface.

The MFC contains a DMA controller that supports DMA transfers. Programs running on the SPU, the PPE, or another SPU, use the MFC’s DMA transfers to move instructions and data between the SPU’s LS and main storage. Each DMA transfer can be up to 16 KB in size, but the SPU can issue DMA-list commands that can represent up to 2048 DMA transfers (each one up to 16 KB in size).

This autonomous execution of MFC DMA commands and SPU instructions allows DMA transfers to be conveniently scheduled to hide memory latency.

Note: storage of data and instructions in the Cell Broadband Engine is big-endian.

Language Extension Examples for CBEA

Vector Data Types

Though the PPU (the PowerPC Processing Unit of the PPE) has built in support for VMX (Vector/SIMD Multimedia extension) instructions, certain types of vectors are only available to the SPU. The CBEA supports following vector declarations for the SPU, PPU or both:

• vector unsigned char: 16 8-bit unsigned chars -both

• vector signed char: 16 8-bit signed chars -both

• vector unsigned short: 8 16-bit unsigned half-words -both

• vector signed short: 8 16-bit signed half-words -both

• vector unsigned int: 4 32-bit unsigned words -both

• vector signed int: 4 32-bit signed words -both

• vector unsigned long long: 2 64-bit unsigned double-words –SPU only

• vector signed long long: 2 64-bit signed double-words –SPU only

• vector float: 4 32-bit single-precision floats -both

• vector double: 2 64-bit double-precision floats –SPU only

• vector bool char: 16 8-bit booleans – 0 (false) 255 (true) -PPU only

• vector bool short: 8 16-bit booleans – 0 (false) 65535 (true) -PPU only

• vector bool int: 4 32-bit booleans – 0 (false) 232 – 1 (true) – PPU only

• vector pixel: 8 16-bit unsigned half-word, 1/5/5/5 pixel – PPU only

Both PPU VMX instructions and SPU vector instructions are supported by C/C++ language extensions that define vector data types and vector intrinsics (intrinsics are commands in the form of C-language function calls mapped to one or more inline-assembly instructions).

However, these extensions are different for the PPU and SPU - for example, given vectors v1, v2, and v3 of the same data type:

1) the vector addition intrinsic (which supports short, int and float) on the PPU will look like:

• v3 = vec_add( v1, v2 )

And the addition intrinsic on the SPU will look like:

• v3 = spu_add( v1, v2 )

2) Where as the vector multiply intrinsic (supports all data-types) exists only for the SPU in the form:

• v3 = spu_mul(v1, v2)

Signals and Mailboxes

The PPE can use signals to send information to the SPE and mailboxes to send and receive information from the SPEs along with DMA transfers.

Signals are information sent on the signal-notification channel. These channels are inbound (to an SPE) registers only. Each SPE has two 32-bit signal-notification registers, each of which has a corresponding memory-mapped I/O (MMIO) register into which the signal-notification data is written by the sending processor.

Mailboxes are queued in a SPE’s MFC for exchanging 32-bit messages between the SPE and the PPE or other devices. Two mailboxes (the SPU Write Outbound Mailbox and SPU Write Outbound Interrupt Mailbox) are provided for sending messages from the SPE. One mailbox (the SPU Read Inbound Mailbox) is provided for sending messages to the SPE.

In the following example, the SPU reads the channel count register to check for inbound mailbox messages before invoking a (blocking) read command to read the message from the register. The SPU then writes to the outbound channel the same message incremented by one:

“if(spu_readchcnt(SPU_RdInMbox) )

“{

“ unsigned int var = spu_readch(SPU_RdInMbox);

“ spu_writech(SPU_WrOutMbox, ++var );

“}

MFC Direct Memory Access

SPEs rely on asynchronous DMA transfers to hide memory latency and transfer overhead by moving information in parallel with synergistic processor unit (SPU) computation. In the following code, the SPE issues a DMA transfer GET command to receive 4KB (4096 bytes) of information, and perform computations while waiting on the DMA transfer to complete. (The SPU receives the effective address via mailbox.)

“unsigned int tag_id = 31;

“unsigned int tag_mask = 1 ctx, &p->entry, p->runflags,

p->argp, p->envp, &p->stop_info ) < 0 )

puts ("Failed running context");

pthread_exit( NULL );

}

int main()

{

int i, num_spus;

/* Determine number of SPEs */

num_spus = spe_cpu_info_get( SPE_COUNT_USABLE_SPES, -1 );

if( num_spus > MAX_SPUS )

num_spus = MAX_SPUS;

for( i = 0; i < num_spus; i++)

{

/* Create SPE context */

thread_args[i].ctx = spe_context_create( 0, NULL );

if( NULL == thread_args[i].ctx )

{

puts( "Failed creating context" );

return 0;

}

/* Load program into context */

if( spe_program_load( thread_args[i].ctx, &spu_main ) )

{

puts( "Failed loading program" );

return 0;

}

thread_args[i].entry = SPE_DEFAULT_ENTRY;

thread_args[i].runflags = 0;

thread_args[i].argp = NULL;

thread_args[i].envp = NULL;

/* Create thread for each context */

if( pthread_create( &threads[i], NULL,

&thread_function, &thread_args[i] ) )

{

puts( "Failed creating threads" );

return 0;

}

}

/* Wait for each thread to join/terminate */

for( i = 0; i < num_spus; i++ )

{

if( pthread_join( threads[i], NULL ) )

puts( "Failed joining threads" );

}

puts( "\nThe program has successfully executed.\n" );

return 0;

}

7) Copy the following source into a file labeled Makefile and save it in this directory as well:

DIRS := spu

PROGRAM_ppu := main

IMPORTS := spu/lib_spu_main.a -lspe2 –lpthread

include /opt/ibm/cell-sdk/prototype/make.footer

8) Compile and run using the following commands:

• make

• ./main

The output of the “make” command will look similar to this:

make[1]: Entering directory `/home/celluser/Projects/hello/spu'

/usr/bin/ccache /opt/cell/toolchain/bin/spu-gcc -W -Wall -Winline -Wno-main -I. -I /opt/cell/sysroot/usr/spu/include -I /opt/cell/sysroot/opt/cell/sdk/usr/spu/include -O3 -c spu_main.c

/usr/bin/ccache /opt/cell/toolchain/bin/spu-gcc -o spu_main spu_main.o -Wl,-N

/opt/cell/toolchain/bin/ppu-embedspu -m32 spu_main spu_main spu_main-embed.o

/opt/cell/toolchain/bin/ppu-ar -qcs lib_spu_main.a spu_main-embed.o

make[1]: Leaving directory `/home/celluser/Projects/hello/spu'

/usr/bin/ccache /opt/cell/toolchain/bin/ppu32-gcc -W -Wall -Winline -m32 -I. -I /opt/cell/sysroot/usr/include -I /opt/cell/sysroot/opt/cell/sdk/usr/include -mabi=altivec -maltivec -O3 -c main.c

/usr/bin/ccache /opt/cell/toolchain/bin/ppu32-gcc -o main main.o -L/opt/cell/sysroot/usr/lib -L/opt/cell/sysroot/opt/cell/sdk/usr/lib -m32 -Wl,-m,elf32ppc -R/opt/cell/sdk/usr/lib spu/lib_spu_main.a -lspe2 -lpthread

And the output of the “./main” command will look like:

Hello Cell (0x1820008)!

Hello Cell (0x1820688)!

Hello Cell (0x1820900)!

Hello Cell (0x1820b78)!

Hello Cell (0x1820df0)!

Hello Cell (0x1821068)!

The program has successfully executed.

The SPU code is self-explanatory and the make-files are simple. The PPU code is multi-threaded. Since the PPU invokes the each SPU program through individual blocking function calls (the “spe_context_run()” blocks the thread invoking it until the SPE program terminates), in this example each instance of the SPE code is invoked by a separate thread.

In this program, the PPU determines the number of available SPEs by calling “spe_cpu_info_get()”. For every available SPE, the PPE creates its context using “spe_context_create()”, loads the externally linked SPE program using “spe_program_load()” into the context and then runs the context in a separate thread by calling “spe_context_run()”. These functions are part of the “libspe2.h” header file.

Additionally, the PPE sample code can access PPU intrinsics by including the appropriate header-file in the PPE source file:

#include //fo low level OS instructions

#include //for VMX instrinsics

And the SPE sample code can access SPU intrinsics by including multiple header files in the SPE source file:

#include //for vector/channel manipulation

#include //for MFC composite intrinsic calls

The Next Step

The IBM’s Developer Works website contains a wealth of CBEA resources such as tutorials, articles, technical documents, and samples code all available for free from at website: .

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download