ELEG 652 Principles of Parallel Computer Problem Set # 3



ELEG 652 Principles of Parallel Computer Problem Set # 3

Issued: Monday, October 30, 2005

Due: Monday, November 20, 2005

Please begin your answer to every problem on a new sheet of paper. Be as concise and clear as you can. Make an effort to be legible. To avoid misplacement of the various components of your assignment~ make sure that all the sheets are stapled together. You may discuss problems with your classmates, but all solutions must be written up independently.

The POSIX Threads Laboratory

Exercise 0:

For this lab, you will need the following:

• Make sure that you have access to stimpy.eecis.udel.edu. If you don’t, then you can use the machine called Copland (copland.udel.edu) but be aware that your results may be a little bit skewed

• Change to the bash shell

• Get a small sample of a Pthread program. It can be downloaded with this lab’s files.

• Type cc –xO3 sample2.c –o sample -lpthread

• Run the program by typing “./sample”. If you receive “The sum is: 210 on 4 threads” as output, then you are done with this step.

Exercise 1:

The Systolic Parallel Exercise

Introduction

In Computer Science, there are many types of data structures. The tree data structure can be considered as one of the most interesting one. A tree in computer science can be represented in many forms. The most common way to represent a tree data structure is with a recursive structure with two or more children. However, the simplest representation of a tree is usually a linear array. In this case, a tree can be recovered from an array with the following formulas:

[pic]

Figure 1: Formulas to recover a tree from a Linear Array.

Where the “ith” array element represents a node in your tree data structure and its children can be recovered with the above formulas. An example of how the logical structure of a tree would look like when it is recovered from the array is given by figure 2.

[pic]

Figure 2: The process of recovering a tree from a linear array.

Main Exercise

Now, using this simple representation of a tree, we are going to implement a very simple tree operation on all its nodes. Each node will update its value by adding its old value with its parent’s. This means that the root will not update its value (since it doesn’t have a parent), but its left and right children will inherent the root’s value and add it to their own. This process continues down the tree in a level by level fashion. This process is illustrated in figure 3.

[pic]

Figure 3: Creating a wave of updates down the tree.

This process (the update of all nodes in the tree) happens N times in a loop. You need to parallelize this process with Pthreads[1]. The following pseudo code is provided:

[pic]

Figure 4: Pseudo code for updating the nodes of the array / tree structure. N is the amount of repetitions and SIZE is the number of elements in the array.

Since you need to take advantage of parallelism, consider very carefully how this code works and its dependencies (i.e. you cannot update the child of a parent until that parent has been updated by its parent and so on). You also have to consider that you need to make the code “fair” for all threads. Moreover, this code implies a special “order”[2] to the operations that synchronization (mutex based) can assure but it is very difficult to achieve.

1. Code this method with Pthreads API

2. Compare the performance of the parallel and serial versions with different number of threads and data sizes.

Normal Exercises

1. The goal of this assignment is to gain experience in fine tuning code for performance on a particular platform. At this point we are still working with single processors.

The link is where you will go to retrieve files used in this assignment. In this link, you should download the files matrix.c.

Your assignment, basically, is to make this code run as fast as possible on one platform: a Sun Ultra-10. You should consider both compiler switches and code changes. Write a report describing your efforts. You should show evidence of a comprehensive search, e.g., describe each optimization or code change you try and why you think it will improve performance, and then show the actual results.

First, compile the code with base optimization, using cc (Sun's compiler) on the Sun. You are also encouraged to try GCC.

Experiment with various compiler options. See which combination produces the best results. For SUN's cc compiler, the options are available in the materials you can get from TA. If you also determine to try gcc, you should read the man page of gcc.

Now, experiment with modifying the code itself, e.g., to make better use of the memory hierarchy. Again, try out various compiler options to improve your results. You are required to get at least 180 MFLOPS for matrix multiply of 1024 by 1024 size matrix on the SUN workstation. To get higher performance than 280 MFLOPS is encouraged. And please put the highest number of MFLOPS you get on the cover of your homework report.

In order to make meaningful comparisons of results, I would like you to produce your numbers on one machine types only: the Suns in 133.

The machines in 133 are:

Magnemite pikachu lickitung wigglytuff magneton zapdos

rattata snorlax jolteon electabuzz raichu jigglypuff

tauros electrode voltorb pidgey

All hostnames end with .acad.ece.udel.edu (e.g., pidgey.acad.ece.udel.edu).

You can SSH into these machines from any other machine. Be sure to type ``finger'' to see if anyone else is on the machine. Try to find unoccupied machines as this will give more meaningful numbers (and you will get them faster).

For completing this assignment, you should read the materials you copied from TA, and you may need to read the contents about memory hierarchy in textbooks.

2. Consider the following conditions proposed as sufficient conditions for SC:

• Every processor issues memory requests in the order specified by the program.

• After a read or write operation is issued, the issuing processor waits for the operation to complete before issue its next operation.

• Before a processor Pj can return a value written by another processor Pi,, all operations that were performed with respect to Pi before it issued the store must also be performed with respect to Pj. (In other words, All operations “visible” to Pi must also be “visible” to Pj)

Are these conditions indeed sufficient to guarantee SC executions? If so, say why? and say why the conditions that were listed in the chapter are indeed sufficient (Culler, 289) in this case.

3. Below is a multiprocessor system M with:

• Two processor P1 and P2;

• Memory location x;

• Registers R1, R2 in P2;

• Initial Conditions: x = 0;

• The following program P is executing on M:

P1 P2

x ( 1 R1 ( x

x ( 2 R2 ( x

Below is a list of all potential combinations of values that R1 and R2 may have after the program P finish its execution:

R1: 000111222

R2: 012012012

But of course not every combination in the list is legal under a specific memory architecture/model.

Assuming that multiprocessor M follows SC model, Mr. Y makes the following claim

R1: 000111222

R2: 012012012

A : oxooxoxoo

Where “o” is possible and “x” is impossible

a. Do you agree? Why or Why not?

b. Can you outline your claim? (give the cases that are possible and impossible under SC model)

4. The Multiprocessor M and the program P are the same for this part. However, the memory model description is different. Please list all the possible cases under the following conditions:

• The network between processor and memory is such that two write operation from the same processor can not arrive out of order

• The network between processor and memory is such that two read operations from the same processor can not arrive out of order.

5. The MESI question: A multiprocessor system is executing the following code:

[pic]

Processor 1 will execute 6 instructions and Processor 2 will execute 7 instructions at the same

time. The following assumptions about the system are given:

a) All loads and stores have no delay so they execute in one cycle each.

b) Processor 1 and 2 executes instructions at the same frequency with no interruption from OS or other programs. Thus, instructions in processor 1 will execute in the same cycle as processor 2. For example, the load instruction (LD) from Processor 1 will execute at the same time as the move instruction (MOV) from Processor 2 in cycle 1.

c) Operations in registers DO NOT change the state of cache only load and stores can do that.

d) Load and stores (and pretty much all operations in these codes) are considered atomic with respect to each other.

e) All cache lines and blocks are initially in an invalid state (I).

Your job in this question is to fill the following table with the correct information about the state of the cache in Processor 1 and Processor 2 during the execution of these codes.

[pic]

In this table, “Pn” (where n is either 1 or 2) represent the processor. The “Curr OP” and “Next OP” are the current instruction and next instruction respectively. The “Status” column is the state of the cache line housing the variable AFTER the current instruction has finished. The “Flush?” Column is a simple no or yes column that tells us if the cache line value has been sent to the main memory and who sent this copy (either P1 or P2). An “add R”, “mul R” or “inc R” relates to instructions that operate on registers. The “ld mem” and “st mem” are load and store instructions operating on address _a. You need to fill out the “Status” columns and the “Flush?” column assuming that this machine is using an implementation of the MESI protocol with cache to cache enable.

6. Define the following terms:

a. Determinacy;

b. Well-formed dataflow graphs;

c. Well-behaved ness and well-behaved dataflow graphs;

d. Well-formed dataflow graphs;

7. Synchronization Cost Question: Imagine a machine in which a bus transaction to memory takes around 145 clock cycles to complete. Now, this machine has 20 processors which have the same frequency. A small program is run such that a lock is trying to be accessed. Times for cache misses, read, loads and the critical region time (in which the lock is held by any given processor) can be safely ignored since they are extremely small. The code for the small code is given below:

loop: ll r5, 0, r1 ; Load Link a word from memory

xor r0, r4, r5; Lock or unlock the variable

; The time of this instruction is

; minimal and shouldn’t be added

; to the calculation

sc r0, 0, r1 ; Store conditional the value if it

; hasn’t changed

bne- loop ; Loop back if the Store

; conditional failed

• Assuming that you don’t cache the copies of the lock, provide the overhead (in cycles) of all processors trying to access this lock at the same time. Explain carefully this process and how this can be calculated with a formula. Your formula should assume the use of load linked and store conditional for checking the state of the lock.

• Now give a short explanation of how this can be improved if the lock is keep spinning in cache instead of going to main memory to fetch it.

8. Construct a static dataflow graph to compute the following conditional expression:

1. if x+y > 0 then x*x+y*y else x*(x-y) endif

2. if x > 5 then x*x

else if x > 0 then x + 5

else x -7

endif

-----------------------

[1] Hint: Consider that levels are independent after an iteration of the tree traversing loop (which means after level i has finished and level i+1 has begun, level i-1 can begin to work on the elements of the next iteration of the N loop) and that all sub-trees are not dependent to each other.

[2] Hint: Inter-thread / Process Communication is achieved in POSIX threads by the use of conditional variables; signal and wait).

-----------------------

Processor 1: (a2+1)

LD R1, _a

MUL R6, R1, R1

ADD R1, R1, 1

ST R1, _a

INC R3

LD R7, _a

Processor 2: (a+2)a

MOV R2, 5

ST R2, _a

INC R2

INC R2

LD R6, _a

MUL R2, R2, R6

ST R2, _a

2*i + 1

2*i + 2

i

Left Child

Right Child

Parent / Array index

0

1

2

3

4

5

6

7

8

9

Original Array

0

1

2

3

4

5

6

7

8

9

0

1

2

3

4

5

6

7

8

9

0

1

2

4

5

7

8

7

8

9

0

1

2

4

5

7

8

11

12

14

0

1

2

4

5

7

8

11

12

14

Array after 1 time step

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download