University of Utah



The Face Replace Story

FaceReplace was the final project for the University of Utah’s Fall 2010 offering of PPCP: Practical Parallel and Concurrent Programming (you may visit here: ). The course is designed to teach students how to write parallel / concurrent programs using the .NET framework and the C# and F# languages (with a lot of emphasis on language extensions and libraries added to .NET 4.0).

FaceReplace is used to demonstrate and learn a couple of different methods of parallel/concurrent programming with the .NET framework (especially the TPL, or Task Parallel Library, and both the Pipeline and Producer/Consumer Patterns), and is a useful example for learning how to use tools for development and testing of PC (parallel/concurrent) programs such as Alpaca and the tools made available through Visual Studio 2010 (such as the profiler and debugger).

Some of the main topics that are discussed in this paper will be tasks, threads and the concurrent pipeline stages of the algorithm. Later topics will include what was found by Alpaca and the Visual Studio profiler.

Figure 1-- the FaceReplace GUI

What the application does

On the application window, there are two picture boxes to load an image into. The left image is the source image, where there will be one or more face. On the right is the target image, also consisting of one or more faces. This is where all the new faces from the source image will be drawn to. But before the drawing occurs, all the faces from the source image must be compared to all of the target faces. The closest matching faces get drawn from the source image to the target image. There is no limit on the number of faces, which are defined by a rectangle, but more faces means more work for each task and more concurrency. Rectangles can either by loaded from a file or created by clicking and dragging on the image. The fine details of the algorithm will be discussed later, but here is a very high level look at how it works.

This is a pretty simple algorithm if one was to code a sequential version of it, but getting it all to work concurrently with tasks is more of a challenge.

Tasks and Threads

Before we dive right into the guts of the algorithm, let's discuss some of the individual parts in some more detail. First up, threads and tasks. We never directly create our own threads in the face replace application, but a general understanding of threads is needed to grasp what a task is and how they work. When the application is run, a process is started that begins execution of the code. Processes have their own memory space and are separate from one another. Inside a process there are threads, or a set of executable instructions and memory. Each thread shares the memory space of the process it is running in, but has its own stack. At any given instant, a processor can be executing at most one thread. Multiple processors means multiple threads can be running simultaneously. A task represents executable code and is executed by a thread. But before this execution can begin, a task must be assigned to a thread. This is handled by the TaskScheduler class, provided by the .NET framework. Unless otherwise coded, the default task scheduler is used. When a task is started, it does not immediately begin executing, it is first added to the queue of tasks inside the task scheduler where it is then assigned to an available thread. Once that thread starts, the task is executed. If no threads are available, the task must wait until one becomes free. Threads become free and available when they finish executing their task. Threads that can have tasks

assigned to them are found inside a thread pool, conveniently provided by the ThreadPool class. By default, tasks are automatically executed by a thread in the thread pool. Depending on the number of tasks, the thread pool will automatically create and destroy threads for optimization.

Producer-Consumer Pattern with Tasks

Our face replace algorithm uses basic producer-consumer and pipeline patterns. Using the BlockingCollection class and tasks make these patterns relatively simple to design. A BlockingCollection is essentially a list that can only be accessed by a single thread at one time. This guarantees that an item in the collection can be removed at most one time when accessed concurrently by multiple threads. Let's look at some code that implements a very general producer-consumer pattern using tasks and a BlockingCollection.

static void Main(string[] args)

{

BlockingCollection items = new BlockingCollection();

int numOfConsumers = 2;

var tasks = new Task[numOfConsumers + 1];

Task task1 = Task.Factory.StartNew(() =>

{

for (int i = 0; i < 1000; i++)

{

items.Add("String Number " + i);

Console.WriteLine("Added String Number " + i);

}

pleteAdding();

});

tasks[0] = task1;

for (int i = 0; i < numOfConsumers; i++)

{

Task task2;

task2 = Task.Factory.StartNew(() =>

{

foreach (string s in items.GetConsumingEnumerable())

{

Console.WriteLine(s);

}

});

tasks[i+1] = task2;

}

Task.WaitAll(tasks);

}

Notice that each task is accessing the same collection concurrently, but because its a BlockingCollection the read and write operations to the collection are thread-safe. Other methods we use for thread-safety include using locks and methods from the Interlocked class. Once each of these tasks are assigned to a thread, they begin running concurrently with each other. Two of the tasks are taking items from the BlockingCollection while the first task is adding items to it. The WaitAll method blocks the main thread (the thread that made the WaitAll method call) until all tasks are finished running. This is one of the reasons why tasks are more manageable than threads, their status is known. While our tasks are running, they must be able to notify the UI of what they are currently working on so it can be visualized. This will be discussed, along with more on our algorithm, in the next section.

Face Replace Algorithm (each stage is a task in Data Flow Diagram)

Stage 1 -- produceSourceRegions and produceTargetRegions

The first stage of the face replace pipeline consists of two independent tasks, one for producing an ImageRegion (a class we created for holding data about a face, such as its rectangle, image and heuristic.), located in ImageRegion.cs) for each of the source rectangles and one to do the same for the target rectangles. Once an ImageRegion has been produced, it is added to a BlockingCollection. These tasks run concurrently with each other and consist of just one worker. We refer to workers as tasks that are working on the same thing,

such as the two tasks in the task code sample that were both writing out a string. They are both executing the same code. The tasks at this first stage of the algorithm are producers and consumers; they are given some data and use that data to produce items for the next stage. In our case they are given rectangles to consume and produce ImageRegions for the second stage. Right after the tasks in the first stage begin, the second stage tasks are started.

The type we use to store images is the Bitmap class, which is essentially the pixel

data of an image. The next code segment is the part of our stage one task that creates each individual face image from the rectangles. Each stage one task runs this same code.

Bitmap b = new Bitmap(r.Width, r.Height);

using (Graphics g = Graphics.FromImage(b))

{

g.DrawImage(img, 0, 0, r, GraphicsUnit.Pixel);

}

This code is inside a for loop that is going through the list of either the source or target rectangles. The variable r is the current rectangle inside the loop. We first create a Bitmap instance that is the exact width and height of the current rectangle. We then initialize a Graphics object that is created from our Bitmap. The Graphics class has methods for drawing onto images, which is exactly what we do here. We use the DrawImage method to draw the face specified by the rectangle to our bitmap. The first parameter, img, is either the target or source image, depending on the task. The next two parameters are the x and y location of the upper left hand corner of where the drawing starts, we specify it to start in the top left corner of our bitmap. The next parameter is the source rectangle; this is the portion of img that we will be drawing to our bitmap. The rectangle will be somewhere inside of the image. The last parameter is the unit of measurement used for the rectangle parameter, we use pixel measurements.

Stage 2 -- calculateSourceHeuristics and calculateTargetHeuristics

Unlike the previous stage, there are multiple workers for each image at this stage of the pipeline. Because there are multiple workers, we use some methods in the PatternFactory class we created (located in ArchitecturalPatterns, PatternFactory.cs) to make it simpler. And as before, the workers working on the source image are independent of the workers working on the target image. Each worker, or task, takes an ImageRegion from the collection and then computes the heuristic of the region. The heuristic being computed is the average RGB color values. It is not the best face comparison heuristic, but for our purposes it works just fine. We simply use a double for loop to iterate through each pixel in the image; the GetPixel method of the Bitmap class will return the RGB color values of a pixel. The two output collections of this stage become the two inputs for the next stage.

The user can select how many workers for each image. If more than one worker is selected, then you can visualize that the stage two workers quickly catch up to the stage one worker. This is only because we tell the threads executing our tasks to sleep after each region to allow for a good visualization of what is happening; otherwise it is too fast to see. Sleeping allows our stage 2 workers to catch up. Without the UI, stage two does not catch stage one, even with 8 workers. For this reason, we do not use multiple workers for the first stage.

Stage 3A -- produceRegionPairs

The first part of stage three is responsible for producing all of the pairs of source and target image regions. Each pair consists of one source region and one target region. As target regions have their heuristic created and are added to the target output collection in stage two, they are added to a separate list inside the task of this stage. They are taken from the stage two output collection using the GetConsumingEnumerable method, which pulls items from a BlockingCollection until that collection has been marked as completed adding. This separate list is needed because we need to loop through every element in it for each source region to create the pairs. This is not possible with a BlockingCollection. You cannot access elements in a BlockingCollection without removing them, so looping through a collection of this type over and over will not work. Once this list is populated with every target region, the task can move on and loop through all the source regions using GetConsumingEnumerable and inside this loop, loop through each target region in our newly populated list and create each pair.

These pairs are added to the output collection of this stage. Unlike the previous stages, there is nothing visual happening on the UI side for this stage.

Stage 3B-- calculateSourceToTargetScores

This stage could have easily been part of stage 3A but we added it so there would be a longer pipeline and to keep things more organized. The input of this stage is the output of stage 3A, the collection of pairs. The user selects one to eight workers for this stage. As pairs come in, they have their scores computed and are added to a new output collection. The scores are the difference in color values between the two image regions.

The user can see which images are currently being scored, but if there are multiple workers for this stage, image regions might overlap with each other if different worker's pairs contain the same target or source image region.

Stage 4-- findBestMatchesAndImage

This is the final stage of our face replace algorithm. The input is the output of stage 3B, the scored pairs. As pairs come in, they are taken one by one using GetConsumingEnumerable and are added to a new array and sorted by ascending score. A lower score means a better match between two image regions. The task at this stage will draw the matched source regions to the rectangle of the target region that they matched up with. The total number of faces that will get drawn to the target image is the minimum of the target regions and the source regions. Any source or target region can be matched at most one time.

There is nothing to be visualized at this stage, but once all the drawing is finished, regions that were matched will be green while every other region is red. Hovering the mouse over a green region will show which region in the opposite image it was matched with.

Updating the UI

As mentioned earlier, our tasks must notify the UI of what they are currently doing. Each stage keeps track of the regions it is currently working on with either an array of ImageRegions, or in the case of stage one, a single ImageRegion variable. Our class that does all the face replace algorithm stuff has an event handler (called WorkitemProcessed) that is responsible for notifying the UI when it must redraw and calling the Thread.Sleep method to slow things down so everything can be visualized. The stages that interact with the UI trigger this event after every item that gets processed.

All while our tasks are running and updating their respective regions that they are working on, a timer is ticking in our UI class that will trigger a tick event to be called every twenty milliseconds. The timer is part of Windows Forms and the tick interval can be set to any number of milliseconds. The thread that updates this timer is separate from the thread that starts the face replace algorithm. The thread starting the algorithm is a background worker, also part of Windows Forms. Every time this tick event (refreshUITimer_tick in our code) gets triggered it checks if any of the stages have triggered the WorkitemProcessed event. If any have triggered it, then the regions being worked on must be drawn. Otherwise nothing gets redrawn on the current tick event and we return out of the method.

Profiling Data

With Visual Studio 2010 Premium and above, there is a flexible profiling tool provided to analyze different metrics of a program created in Visual Studio. You can find detailed information on running this tool here , especially the link to 'Threads View' here: . Please at least skim this to become familiar with the basic features of the Below is a screen shot of the thread view of the Concurrency Visualizer. We will begin this section by taking a general look at one of the included profiles with the FaceReplace.zip package. For this section, we are using the 'Profiling' project, which is basically an .exe test driver for the FaceRplace.Core dll (which is in the FaceReplace.Core project).

Here is a checklist of things you will need to work through this section:

The FaceReplace package -- can be downloaded from here:

Visual Studio 2010 Premium or Ultimate

A suitable workstation -- should probably have 2Ghz or faster processor with 2GB of RAM, and at least 2GB of free disk space

Once you have these items, perform the following steps:

Unzip FaceReplace.zip

Open the FaceReplace solution (FaceReplace.sln, located in the top level of the directory structure you just unzipped)

Build the solution (Menu: Build->Build Solution)

Now you are ready to follow along while we explore several profiles that were build using the Visual Studio Profiler, and especially the Concurrency Visualizer tools.

Now, if you look to the top of the Solution Explorer, you will see a 'Solution Items' folder -- expand this, and then double-click the item under it, 'Performance1.psess'. This is a profiler session.

Now, you should see a 'Performance Explorer' window open. This will have two items under 'Performance1', 'Reports' and 'Targets'. Underneath 'Reports' are the generated profiles that we are going to explore.

To begin, we will explore the report '1_1_1_1800.vsp' (which was named this because we used 1 worker per image in stage 2, calculating heuristics, and 1 in stage 3b, calculating the source to target scores). So go ahead and double click this now.

At this point, you should have a tab open in the main Visual Studio window area with the title 'Concurrency Profiling Report':

[pic]

There are many areas of information provided by the profiler, but we are going to focus our attention on the 'Threads' view and the 'CPU Utilization' areas. We will begin in the 'Threads' area -- so click on 'Threads' in the top center of the view. Now you will have a view of the 'Threads' window. For now, slide the 'Zoom' slider, at the top of the window, all the way to the left. This will give us the full 'bird's eye' view of the threads.

Our purpose in examining this view is to see the general state of our threads throughout the execution of FaceReplace. The horizontal axis represents time and is displayed in different units depending upon the level of zoom you have selected; running vertically are the threads which are labeled on the left. In addition to the threads belonging to our FaceReplace process, there are also two disk channels, for input and output. The color of a thread bar at any instant of time indicates the general state of the thread. This state ranges from a few types of execution and several different states of being blocked.

Since we are studying tasks and the threads that they are assigned to, we may safely ignore all the threads that aren't 'CLR Worker Thread's, which are the ones our tasks are mapped to and where the 'bulk of the action' is going to happen where our algorithm manifests onto silicone. So now, while holding down the 'ctrl' button, left-click on each thread name other than a 'CLR Worker Thread' and then right-click 'hide'. This will simplify things slightly for this step through the thread events.

Now that we've eliminated the threads from our view that we're not interested in, let's spend some time identifying what these threads are doing, and what tasks, if any, are mapped to them at any particular segment of time.

Identify Threads

The first thread (when the view is sorted by the default category, start time), numbered 4860, as you can see (by the vertical stripes of color across its timeline) it begins rather actively (judging by the predominance of greenish stripes), and soon becomes blocked a long stretch for synchronization. What does this mean? Click on the right half of this thread stripe, the pink part, and observe the bottom of the profile window. Here, there are four tabs; we are now interested in the 'Current stack' tab:

This shows you that 'Synchronization' thread states are generally when they are in a wait state, such as a lock or semaphore (or in this case, WaitForSingleObject). This shows the stack leading up to the block (due to the depth of non-user callee functions that were set to display -- which you can change by going into your Visual Studio 'Tools->Options' menu and looking in the 'Performance Tools' section).

You may also click on the next tab in the set, 'Unblocking stack'. This will show the stack of the thread that signaled this blocked thread. However, in this case, it was probably unblocked by the system when it was time to shut down the FaceReplace process, and no stack is shown.

This is fun to explore, but we got side tracked trying to understand what the reason for this thread 4860 is. Now, let's zoom in on the more interesting part, the first section with all the green stripes: [pic]

If you zoom in, sliding the 'Zoom' slider to the right, until it is just short of center on the left (and the time scale shows about 350 milliseconds), place your cursor (and for this you will likely have to use the horizontal slider to adjust), find one of the first green segments, which is at 215.14ms. If you click on this, you can discover what this thread is used for. In the bottom window, with the 'Current stack' tab selected, you will see a rather deep stack, at the bottom of which is the call from FaceReplace: the call that created task 1, 'ProduceImageRegionsStageTask'.

The next thread, 1012, has a very similar pattern in vertical stripes of thread state. If you can manage to click on the segment at 218.89 ms or so, you will find a similar stack displayed for it as the first thread, beginning with a call to create a 'ProduceImageRegionsStageTask'.

The third thread (4200) is harder to identify from its early execution segments. However, if you click on the large stretch of green starting at about 244ms, you can find a stack trace that includes mention of the 'CalculateHeuristicsStageTask'. This must be a thread that is executing stage 2. The next thread down the list, 4468, seems similar and has a tell-tale stack trace at 264.12ms.

The next thread, 288, not even started until about 262ms, doesn't seem to execute much (except for some thread pool related calls) until later in the program's execution. Well, if you look at the synchronization segment at 5.4455 ms, you will see that it creates a thread pool timer (kernelbase.dll!_CreateThreadpoolTimer). This appears to be the thread pool manager. It is also interesting to note that this synchronization segments lasts an extended period until after 2 seconds of execution, at which point it wakes up for .46ms, only to return to synch state that holds for about 24 seconds! This long segment of 'WaitForMultipleObjects' is signaled when (by examining the 'Unblocking stack' tab) thread 3932 signals it when trying to run 'CreateProduceSourceToTargetRegionPairs', which is task 3a. [pic] The red segment is the highlighted segment at 2268ms, and the small bar connecting it to the thread below it (3932) is the signal.

Going down the list, thread 3932 (running task 3a, we deduced above) seems to be mostly in a synchronization state, save for periodic, razor thin green stripes a little under 100 times a second. Curious! However, if you click on one of these synchronization segments, such as the one at 2926ms, you will see that this thread is calling the 'GetConsumingEnumerable' that is being produced by thread 4468! These thin stripes of execution happen when there is actually something to consume. Here is a clue that stage 2 is starving stage 3a.

[pic]

Figure 4-- Profile 1_1_1_1800.vsp

Thread 2200 begins with an execution state for which we can get little information, other than it starts late at 263ms with an execution section that has no stack information. Next on its timeline is a synchronization section, with a very simple stack trace -- it basically enters a critical section, period. However, this critical section is linked to thread 3932, and the 'Unblocking stack' shows some very mysterious calls (I do not know the reason for the link to 3932 either). This is followed by another execution segment at 263.55ms, again lacking a stack trace. The next segment again is the entry to a critical section, but this one is signaled by the next thread, 6616. The 'Unblocking stack' shows the same strange stack as before. Next we have some long periods of sleep, interrupted by stack trace lacking, brief execution periods. The time from sleep to brief exec is about 1/30 of a second. However, the synch period (a call to WaitForSingleObjectEx) at 1327ms is signaled by thread 4860, which is one of the threads working for one of our stage one tasks! It signals it right after 4860 performs the write to stdout indicating that task 1 is complete. BTW, this event is at 1739.70 in 4860, and is marked with a magenta I/O event/state. Perhaps this is an inactive thread pool thread, and it was signaled by the completion of task 1's task. It is mysterious what this thread did when it was signaled!

The next thread in the order of their start event's times, is thread 6616. This thread actually starts out in the synchronization state (by calling System.Threading.Monitor.Wait), and stays there until 26 seconds into the program, when it is signaled by thread 3932, which we learned carried task 3a. It appears to be signaled by a call to BlockingCollection.Add, so is this the thread running task 3b, 'calculateSourceToTargetScores'? One mysterious part to that is how this thread got assigned to any task and would know to be signaled by the addition to a blocking collection, when it has been asleep the whole time! Just click on the first segment of the thread to see what I mean. It becomes more parent with further exploration: the synchronization segment at 26507ms appears to be unblocked by a task, and it is a thread that is running the task 'StartPipelineStageWorkers'! It is a task that starts other tasks . . .

Thread 3212 seems to do little else than synch. It receives a wakeup call from 1012 (a stage 1 task thread) but then goes back to sleep, and then again calls WaitForSingleObjectEx for the rest of its life. This must be an extra thread pool thread.

Finally, thread 5980 leads a very short life. It doesn't even life until 26 seconds into execution, and there it begins with a WaitForSingleObjectEx. It is then signaled by thread 3932, the thread that is executing (at that point) task 3b, or 'calculateSourceToTargetScores'. Perhaps another excess pool thread.

Now that we have a good idea how to identify our threads with regards to tasks we created in FaceReplace, let's step back a minute and try to understand a little more of the behavior that is being exhibited in the trace of these thread states across the program lifeline.

Understanding Program Behavior

We are starting out with a rather large rectangle specification file (that specifies rectangles in our images to process) so that we may stretch out the amount of work processed by our program and identify its features more easily under the profiler thread view (as smaller specified work regions run very quickly). We have four different profiles that we'll examine, and they are provided in the project so that you may follow along with our analysis and exploration. We have already examined the basic anatomy of FaceReplace as the profile trace reveals in the threads view; now we will examine the consequences of changing work load, adjusting the number of workers used per stage, and even forcing the thread pool (that keeps a reserve of live threads waiting to serve) to create more threads at startup.

We'll first figure out how the pieces are working together in our first profile, 1_1_1_1800.vsp (which you should already have loaded in the profile analyzer).

1800 rectangles of size 50 by 50

Stage 2: 1 Worker per image Stage 3B: 1 Worker, 1_1_1_1800.vsp

To begin this section, follow the instructions above for loading the profile and hiding the non-CLR threads; then move the 'Zoom' slider fully to the left to have the complete view of the thread timelines without needing to scroll in either direction (if your monitor permits).

Here is a small table for your reference of the tasks we deduced were assigned to which threads:

TID Task Notes

|4860 |1 | |

|1012 |1 | |

|4200 |2 | |

|4468 |2 | |

|288 |? |Thread pool mgr |

|3932 |3 | |

|2200 |? |Extra pool thread |

|6616 | |startPiplineStageWorkers |

|3212 |? |Extra pool thread |

|5980 |? |Extra pool thread |

Now that we've identified the threads, it is fairly clear how the program is behaving. It also helps to examine the Data Flow Diagram above, because it ties a task stage with the previous and next one by the data that they each produce.

We know that the top two threads on the chart, 4860 and 1012, are doing the work of task 1. Obviously, this stage of the program's execution is a slim part of the whole. You can also see that stage 1 provides data to stage 2 through a BlockingCollection, and that there is a high level of concurrency between the two stages, as there is a lot of green going on at the same point of time in stage 2 (threads 4200 and 4468) that there is in stage 1.

However, it is clear that stage 2 is completely dominating the time of FaceReplace's execution! Of about 31.5 total seconds of execution for this profile, roughly 26 seconds of that is stage 2.

From a performance standpoint, our bottleneck is stage 2.

Let's attempt to improve our performance, now that we know where we're spending all our time.

1800 rectangles of size 50 by 50

Stage 2: 2 Workers per image Stage 3B: 1 Worker, 2_2_1_1800.vsp

[To set up for the analysis of this profile, double-click the profile file 2_2_1_1800.vsp in the Performance Explorer and hide the non-CLR threads as described at the top of the Profiling Data section]

As mentioned earlier, most of the execution time of our algorithm is spent in stage two. So we would expect the biggest improvement in speed to come from adding more workers to that stage. But how much improvement is expected? One can easily calculate this using the Amdahl's law formula, which is 1 / ((1-p) + p/s), where p is the percentage of the program's run-time that is being parallelized and s is the number of worker threads. Stage two takes up about 82% of the execution time, and with two workers (per image) the expected speedup is about 1.69: 1 / ((1-.82) + .82/2) With one worker per stage, the algorithms execution time was a little under thirty-two seconds, as shown by figure 4. So our expected execution time with two workers on stage two should be about 18 1/2 seconds. The next image shows the results.

[pic]

We came close to the expected time of 18 1/2 seconds -- perhaps not bad for not running our experiments in a closely controlled execution environment. There are two workers for each image in stage two (Threads 4756, 5020, 6620, and 6352). In the first profile analysis, stage two took about 26 seconds to execute with one worker per image and with two workers per image the execution time was about 16 seconds. In theory, we would expect the execution time to be 50% compared to before because we have twice as many workers. But because of the overhead of tasks and threads, and the non-deterministic nature of a machine running many uncontrolled processes, we do not achieve perfect results.

[pic]Figure 5-- CPU Utilization, 2_2_1_1800

1800 rectangles of size 50 by 50

Stage 2: 8 Workers per image Stage 3B: 1 Worker, 8_8_1_1800.vsp

[To set up for the analysis of this profile, double-click the profile file 8_8_1_1800.vsp in the Performance Explorer and hide the non-CLR threads as described at the top of the Profiling Data section]

We tried running the algorithm with eight workers on each image in stage one to see if the improvement in speed was anywhere near eight times. It ended up being only slightly faster than two workers per image.

[pic]

There are two reasons for the lack of improvement. First, as mentioned before, there is overhead when dealing with more tasks and threads. And second, it looks like there are only nine CLR threads that are available when the program starts, one of which is used for something other than our algorithm. This means that only eight threads are working on stage two, as opposed to the sixteen that we wanted.

But what if we wanted more than eight worker threads to execute our tasks? We tried two things that would hopefully allow us to use as many worker threads as we would like. First we tried using more rectangles of larger size, hoping that more work would trigger the thread pool to provide more worker threads.

This is one line of thought! However, there is another factor at work here. Let's zoom in on an area of the chart of threads, say 500 to 1500 ms. Take a look at figure 6, Oversubscription of threads. Notice the tiling of green executing segments and yellow preemptions? Now, in order to give all our threads time, we must start preempting threads in 'large' chunks to give time elsewhere! This is also illustrated in our CPU utilizations, though we are utilizing the cores better, hence the slight speedup at the cost of too many context switches. Compare the CPU Utilization of the previous profile, 2_2_1_1800 versus this one. We have over saturated our processors! We have passed the point of diminishing returns.

[pic]

Figure 7 -- CPU Utilization, 8_8_1_1800

With this many threads running at the same time, there is a larger percentage of yellow segments, these are preemption segments. There is also more synchronization time, most likely due to threads competing for locks. Preemption time occurs either when a thread gets switched out and is paused for a higher priority thread or when a thread's execution quantum expires and another thread resumes execution. All of our worker threads have the same priority, so most likely their execution quantum is expiring. Even with more preemption and synchronization time, however, a slight improvement in speed is achieved compared to using the default minimum number of threads and the thread pool's algorithm for creating new threads.

Parallel For Loop

In an attempt to speed up our algorithm, we tried turning the outer loop of the double for loop that calculates the color averages into a parallel for loop, this initially seemed like a great idea, but it ended up hurting the speed of the algorithm. The first parallel for loop we tried is shown below. It consists of data race errors; there will be multiple processors reading from and writing to our color variables at the same time. The img variable is a bitmap of an ImageRegion; the three total color variables are integers.

ParallelOptions po = new ParallelOptions();

po.MaxDegreeOfParallelism = 8;

Parallel.For(0, sz.Height, po, (y) =>

{

for (int x = 0; x < sz.Width; x++)

{

Color color;

lock (img)

color = img.GetPixel(x, y);

totalRed += color.R;

totalGreen += color.G;

totalBlue += color.B;

}

});

The ParallelOptions class allows you to set the maximum degree of parallelism, the maximum amount of logical cores that will be used to execute the loop. The first parameter is the starting value of the loop, the second is the exclusive maximum value of the loop, following that parameter is our ParallelOptions object and finally the delegate containing the work to be done. If ran on a multi-core machine, this delegate will be executed in parallel by up to eight cores, depending on the amount of cores on the machine. Ideally, each core will execute an equal amount of iterations of the parallel for loop to minimize the amount of time a core is not doing any work.

Even though this parallel loop does have data races, we still wanted to test the speed of the algorithm compared to before. With a maximum degree of parallelism of two, there was about a ten percent increase in run-time, and with a maximum degree of eight, the run-time increased by nearly forty percent. There are two reasons why the parallel for loop did not help. First, there is a lot of synchronization time due to the lock inside the inner loop. There are multiple threads competing for that lock. And second, there is not enough work inside this parallel for loop to compensate for the overhead of the parallel for loop. We tried adding busy work inside the inner loop to determine if more work will overcome the overhead. Our original double for loop also contains the busy work for this test. The busy work is just an empty for loop inside the inner loop.

First, the busy for loop had one thousand iterations; with a maximum degree of parallelism of eight, a speedup of about 50% was achieved. When the number of iterations of the busy loop was increased to five thousand, there was a speedup of more than two and a half times. If more and more work was added by increasing the number of iterations in this busy for loop, the speedup from the parallel for loop would most likely begin to approach eight times. When there is a lot of work being done, it is obvious that more work inside a parallel for loop will lead to more speedup. And if there is not enough work, parallel for loops are a bad choice, which is the case of our method of calculating color averages. A more practical use of the parallel for loop in our case would be to have the parallel for loop through all of the images and inside the loop have a regular double for loop to compute color values.

A thread-safe implementation of the parallel for loop from before is given below.

object redLock = new object();

object greenLock = new object();

object blueLock = new object();

Parallel.For(0, sz.Height, po, (y) =>

{

for (int x = 0; x < sz.Width; x++)

{

Color color;

lock (img)

color = img.GetPixel(x, y);

lock(redLock)

totalRed += color.R;

lock (greenLock)

totalGreen += color.G;

lock (blueLock)

totalBlue += color.B;

}

});

This version does not have any data races, but with this many locks, the amount of parallelism will be very small and the speed will be much slower.

-----------------------

Figure 3-- Solution Explorer

Figure 2-- Data Flow Diagram

Figure 6-- Oversubscription of threads

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download