July, 2008
by Claire Cates
Today, processors are getting faster, but not at the rate that they were in past years. Since processor speed is not increasing as rapidly as in the past, the primary way to increase the speed of an application is to take advantage of the multi-core and multi-processor machines that are now the norm. To accomplish this, an application must become multi-threaded. Unfortunately, threading an application can also cause problems and can even cause programs to run more slowly.
This is the second of a three part series. Threads - the Good was published in the May edition of MeasureIT.
Most applications in the world today are still not threaded. Why? The answer is simple; it is difficult to write threaded code and; therefore, the development and debug cycles are lengthened.
There are several things that must be considered when writing code that will take advantage of parallelization:
The first thing a developer needs to determine is what, if any, part of the application should be threaded. The best way to accomplish this is to run the application under a code profiler to determine any hotspots in the system. Once the hotspots are identified, ask these questions about the application:
Looking at the code in the application may also help in the process of determining what to parallelize.
A coding example of Task Parallelism is:
Since AnalyzeA, AnalyzeB and AnalyzeC are independent, these three actions could be performed in parallel.
A coding example of Data Parallelism is:
Since Process data is working on different areas of the array, each iteration of the loop could be executed in a separate thread.
A coding example of PipeLine Parallelism is:
The data must first be obtained, then analyzed, then updated. One thread could be just getting each data block. The next thread takes that data block and does the analysis and finally the output from the analysis is updated back into the database.
Complex applications can be a combination of multiple techniques; therefore, do not limit the types of parallelism to look for when searching the application code. The thread diagram for a multi-level parallelized system can become complex but, when the application is programmed correctly, it can provide the users with a fast, efficient, fully functional application.
After determining areas in the code that are candidates for parallelization, look at the amount of code that is going to be executed and compute how much time the application is likely to spend in that section of code. Does that amount of time greatly exceed the amount of time it takes to create and destroy a thread along with the time needed to communicate between the threads? Creating and destroying a thread can be very expensive and overwhelm any parallel savings. If the amount of time spent in the code to be threaded is minimal, then threading that section of code could actually slow the application down. It is essential to pick the portions of the code where threading improves performance.
Once it is determined that it would be beneficial to thread the section of code, the developer needs to determine the optimum number of threads to create. This, in itself, can be complicated. How many other threads are running in the application? Should the developer create more threads than the number of processors? The application as a whole should not create too many threads because if all the threads are ready to run at the same time, then thread thrashing can occur. Adding addition threads in this instance will only slow the system down. Remember the number of threads that can run at any one time is limited by the number of processors. Context switching (the OS putting one thread in a wait state so that another thread can run) can be expensive. Furthermore, memory and IO caches can be destroyed which could cause unnecessary delays when the thread once again gets a chance to execute. An application should only have more threads than processors if some of the threads don't execute often and are normally in a wait state. A developer should also realize that the application may be sharing the machine with other applications; therefore, the application may not want to use all CPUs and other resources.
The number of threads in an application should be flexible and it is actually best in many circumstances if the amount is decided at run time. If an application creates a fixed amount of threads and the data being processed is small, then the overhead of the thread creation, deletion and communication can nullify any threading benefits. If the application is written to determine the number of threads based on the data, the application can use the threads to gain performance and not affect overall performance when the data load is light. In light data load instances, it might be best for the application to not create any threads and to perform all work on the single thread.
After the developer has determined what to thread and how many threads should be created for optimum performance, the developer next needs to determine how the threads will communicate. Will the threads communicate between worker threads, or will the communication be handled by a manager thread which then talks with the worker threads? Determining the best way to thread depends on what is being threaded and the threading model used.
Since threads run asynchronously, no two runs of the application can guarantee that the two threads will be executed in the same order. Threaded code must therefore serialize any access to changeable shared resources in order to maintain the consistency of the shared data. If 2 or more threads are accessing updatable shared data, program crashes or incorrect results can occur. A data race or race condition occurs when multiple threads access a shared resource and the value of the shared resource is dependent on the execution order of the two threads.
Allocated memory, stack variables and global variables are the most common examples of a shared resource, but a developer must also consider files or other I/O devices as possible shared resources.
A crash can occur if two threads are executing at the same time and one thread is in the middle of removing an item from a linked list while another thread is searching that list.
Thread A is removing item from a linked list
The variable item had the value to be deleted.
Thread B is searching thru the linked list
Another reason shared memory accesses need to be serialized is to avoid a data race where results are unpredictable.
If both Thread A and Thread B are sharing the variable cnt, and each thread at some time needs to increment , a race condition could occur if the increment was not properly serialized.
To increment a value:
The result of cnt after both thread 1 and thread 2 could be different depending on the order of execution.
In both cases let's assume cnt is 0 before the process starts
Case 1:
Case 2:
The value of cnt would be 2 when Case 1 finished but would be 1 in Case 2 because the entire process of incrementing the value was not serialized. An application should not produce results that are unpredictable.
A developer should first try to minimize the shared data between threads if at all possible to avoid these scenarios. If it is not possible to alleviate all instances of shared data, a developer must serialize all shared accesses. This is easier said than done. Every line of code that is executed by any thread must be checked to determine which shared resources can be written to. This can be complicated if the thread contains an abundance of code. It can be further complicated if the code calls subroutines or functions that are not written by the developer but use common shared resources. That code could be
It is essential that the developer determines that all executed code is thread safe. If the code called by a threaded application is not thread safe, then no matter how well the developer did threading in his own code, the application will still not perform correctly. So, what does thread safe mean? In order to make code thread safe, the developer must serialize access to shared resources.
Before a developer starts looking at how to serialize their non-thread safe code, they should look to see if an alterative local to the thread is available. Could a copy of the data be given to each thread, with the changes later updated in the master copy of the data? If so, this might be considered. If not, then the developer needs to determine how to serialize the code that accesses the shared data.
The mechanism to perform the synchronization can be any one of the following
An atomic operation is completed entirely before another operation is allowed to commence. A developer could replace the increment in the earlier example with an atomic increment to avoid the problem. In the atomic increment the process of the increment and storing the value into memory cannot be interrupted.
A critical section is a section of code that accesses a shared resource and that shared resource must not be accessed by any other thread while the critical section is being executed. A regular lock / mutex / semaphore can be used to control access to a critical section and can be used to bracket the critical section as follows:
EVERY place in the code where the shared resource is being used MUST be bracketed by a lock get/release. The lock prevents another thread from executing a piece of code that is similarly locked until the first thread releases the lock. The thread trying to obtain the lock that is currently being used will be put into a wait state until the thread holding the lock releases it. Therefore, the critical sections surrounding the shared resource will be guaranteed to execute serially and without interruption.
A Read/Write lock is similar to a regular lock, but the lock can be obtained either in Read or Write mode. Multiple threads can each obtain a Read lock without any of the threads being put into a wait state. It is only when a thread tries to obtain a write lock that a thread or threads will be put into a wait state. An example where a read/ write lock might be useful is an in-memory data base. If the majority of the access to the database is to read the data, then the data values are not changing and all threads can grab the data without having to serialize the threads, but when the data needs to be updated, that thread must request a write lock. That thread will be put into a wait state until all currently held read locks are released. The write lock will then be obtained and held during the update. During this timeframe, all threads requesting a read or write lock using the same lock will be put into a wait state. Once the update has completed, the lock is released and any waiting threads will be allowed to proceed with their execution. A R/W lock is most efficient if the number of reads far exceeds the number of writes.
Finally, a developer could use a spin lock. A spin lock is when a thread waits in a loop until the resource becomes available. If the amount of time to wait for the resource is typically VERY short, spin locks are cheaper because the thread does not have to go into a wait state and possibly be removed from a processor or incur the overhead of executing a context switch. But, if the amount of time a thread has to wait is not small or is not known, a spin lock should not be used. A spin lock does not release the processor for other threads to use, instead the processor is 100% busy -- doing nothing.
When deciding how much code should be locked, there are 2 important things to consider. First, the critical section of code should not be too large or other threads currently trying to access the lock will be put into a wait state and become idle. Suddenly the threaded portion of the code becomes serialized. If the critical section is too large, the application may not be candidate for threading since there is little natural concurrency. On the other hand, if the locks are obtained too often, the overhead of obtaining and releasing the locks will nullify the benefit of threading.
As a developer is analyzing code, one should move coding statements that require bracketing by a lock close to other statements also requiring the same lock to reduce the number of lock accesses. For instance, let's say code in an application is represented as the blocks in Figure 6 where the green represents code that can always run in parallel and red represents code that must be serialized. If the developer puts a lock obtain and release around each red block, the locking overhead could detrimentally affect performance. Instead, if the developer moved some of the code that must be serialized and placed it with other code that must be serialized, then the reduction in the number of lock instances will increase performance.
Once the threaded application has been written, it must be tested. This brings in another set of problems. The application should be tested on many different classes of machines. What happens if the application is run on a single processor machine, a dual processor machine, or a multi-core machine? What happens when the code is run under a different OS? Is the performance acceptable on all classes of machines and each OS? The tester needs to realize that the application must be executed multiple times with the same data because multi-threaded code paths are non-deterministic and can therefore produce inconsistent results and execution times. Even running the test multiple times does not uncover all the problems. Errors and performance problems may not be noticed until the system is load tested or even deployed to a customer site. Unfortunately, even if a problem is seen, it may not be easily repeatable.
There are some threading tools like Intel's Thread Checker that can help in testing threaded code. These should be incorporated into the testing environment so that as many thread coding problems are fixed before an application is shipped to the customer. Once code is shipped to a customer, uncovering problems may be very difficult.
If a problem is detected and the code must be debugged, another set of problems occur. Because the developer can not guarantee that the code will be executed in the same sequence, the developer may not be able to recreate the problem. This is especially true when trying to debug areas of the code that were not serialized properly.
A threaded application should be written so that it could be executed serially, if needed - with only one thread. If the tester or developer determines that there is a problem with the application's results, the first thing that should be done is to run the single threaded version of the application and see if it produces correct results. If it does not, this single threaded application will be much easier to debug than the multi-threaded version.
If the errors only occur on the multi-threaded version, then the developer will need to use a debugger that handles multiple threads so that the state of each thread can be examined. Unfortunately, the debugger itself often perturbs the system enough that the errors cannot be reproduced while using the debugger. Almost every form of debugging from instrumenting the code to putting in printf statements can perturb the system enough so that the threading errors no longer can be reproduced.
The developer may also be advised to run 3rd party threading tools like Intel's Thread Checker to see if it uncovers any threading issues or a code profiler like Rational Quantify, Intel's Vtune or Intel's Thread Profiler to determine performance problems. Finally, the developer needs patience since the failures and performance problems may be hard to reproduce and may be even harder to fix.
It is understandable why many software developers shy away from threading their code, given the fact that threaded code is harder to write, can introduce new scenarios that can make the application run incorrectly, is harder to test, and is harder to debug.
The last section, Threads-The Ugly will appear in the August issue.
Developing Multithreaded Applications: A Platform Consistent Approach
http://cache-www.intel.com/cd/00/00/05/15/51534_developing_multithreaded_applications.pdf
Multithreading Programming Guide
http://docs.sun.com/app/docs/doc/816-5137
Dr. Dobbs: Performance Analysis and Multicore Processors : March 30, 2006
http://www.ddj.com/dept/64bit/184417069
Dr Dobbs: Multitasking Alternatives and the Perils of Preemption: Sept 14, 2006
http://www.ddj.com/dept/embedded/193000965
Data Placement in Threaded Programs: from the Intel Software Network
Threading Methodology: Principles and Practices
http://cache-www.intel.com/cd/00/00/21/93/219349_threadingmethodology.pdf
Intel Software Developer Webinar Series Fall 2007
W. Brown, R. Malveau, H. McCormick III, T. Mowbray, "AntiPatterns: Refactoring Software, Architectures, and Projects in Crisis", John Wiley, New York(1998).
C. Smith, L.Williams, "More New Software AntiPatterns," Proceedings of 2003 Computer Measurement Group.