August, 2008
by Claire Cates
Today, processors are getting faster, but not at the rate that they were in past years. Since processor speed is not increasing as rapidly, the primary way to increase the speed of an application is to take advantage of the multi-core and multi-processor machines that are now the norm. To accomplish this, an application must become multi-threaded. Unfortunately, threading an application can also cause problems and can even cause programs to run more slowly.
This is the last of a three part series published in MeasureIT. Threads - the Good can be found in the May issue and Threads - The Bad is in the July issue.
After completing a new threaded application or finishing the threading of an already existing application, it is time to run the application to test how much faster it executes. Unfortunately, performance may not meet expectations. The application may actually execute slower than when single threaded or it may even lock up. These results constitute the ugly and are often caused by the use of anti-patterns.
A software anti-pattern occurs in the design of an algorithm when the intent of the algorithm is to perform a job in a beneficial way, yet the actual algorithm is ineffective. Anti-patterns are common practices that are repeated by many developers. Performance anti-patterns are also common solutions to problems, but their use (or misuse) produces negative performance consequences. There are several common anti-patterns that can be found in typical software systems. Locating and removing these common anti-patterns can significantly improve a systems performance.
The one lane bridge anti-pattern occurs when a resource is needed by multiple threads at the same instant, yet that resource must be serialized and can only be used by one thread at a time. All other threads must wait for the thread that is holding the resource to release it. Basically, this is the same as a four lane highway having to merge into one lane for a bridge. If the traffic is minimal, then there is no slow down, but if all four lanes of traffic are heavy, a major slowdown will occur at the bridge. Figure 7 shows 3 threads waiting as the first thread runs.
In applications, the traffic lanes are threads and the bridge could be a shared memory data area or file. A developer typically adds synchronization calls to the program to prevent two or more threads from updating the shared resource in an unpredictable way, as was discussed in the bad. [See July MeasureIT] But, if these synchronization calls are a common occurrence or if the code that is bracketed by the synchronization constructs is too large, then one thread can force one or more threads into a wait state. The wait may not be noticeable if only 2 threads are running. Yet, if many threads are sharing the same resource, it is more likely that one of those threads will block another thread. Lock contention will increase as the number of threads increases because the probability of one thread holding the lock increases with each added thread. This can also occur if the application is built to scale to the number of processors on a machine. As the number of processors increases, an application running on a 4, 8, or 32 processor machine could have one thread that blocks many of the threads that could be executing. If only one thread is allowed to run on multiprocessor machine and all the other threads are in a wait state, the extra processors will be idle and parallelism has not been achieved. Furthermore, if the one lane bridge scenario persists, a traffic jam can occur which will further degrade performance.
As seen in Figure 8, eight threads will lead to a further backlog since only one thread can run at a time. An extended backlog occurs because even with eight threads, more transactions are coming in than the eight threads can handle and a traffic jam occurs in the transaction queue feeding the threads.
In order to avoid this scenario, the developer should avoid using shared resources in common threads or, if that is not possible, should make the section of code that must be serialized as small as possible.
The excessive memory allocation anti-pattern can cause a one lane bridge. A developer must remember that memory allocations and frees require synchronization. If the memory manager used does not contain its own locking mechanism, then the developer must add serialization constructs to protect these routines. Further, an application should be written to avoid too many calls to memory routines since these routines will serialize the application and reduce performance. If the application requires a large number of memory routine calls, then the developer should try to find a memory allocator that has thread local pools (pools not used by other threads).
When a system spawns multiple threads in order to allow multiple processors to work on different areas of the data, the developer MUST make sure each chunk of data sent to the threads is distributed equally; otherwise, one processor may end up spending a much longer amount of time processing its chunk of data while all the other threads have completed their execution. The remaining threads will have to wait for this one thread to complete. The thread that takes the longest will be the thread controlling the total time of the entire system. This is known as the unbalanced processing anti-pattern.
In figure 9, 4 threads are running but thread 1 accounts for the majority of the execution time. To understand that, consider an application using an algorithm to determine all prime numbers between 1 and 8000. The application is running on an 8-processor machine; an example of a bad algorithm would be one that spawns 8 threads and asks each thread to take a range of 1000 numbers. There are many more prime numbers between 1 and 1000 than between 7000 and 8000. One of these threads will take much longer to finish than the other.
The larger the packets of data given to the threads, the more likely the threads will get into an unbalanced situation. A better algorithm would be to parcel off the data in much smaller packets and have the thread request new data when it finished analyzing its current range. For instance, the packets could be sized in ranges of 100. Then if one thread is spending 5 times as much time on one range of 100, another thread can process 5 packets of 100 numbers during the same time.
Determining, the size of the data is critical. If the data packets are too large, the threads can become unbalanced. If the data packet size is too small, too much time will be spent on the communication of the data between the controller and the worker threads.
If the thread model is task based, an unbalanced situation can also occur when the tasks assigned to the different threads have vastly different execution times. Again, one thread will be the determining factor for the overall performance of the application.
Some developers think that the more threads the better. Actually, having more threads can hurt performance. Adding threads can be like taking aspirin. A few aspirin will make you feel better, too many could kill you.
A few threads may improve the system, but too many threads can and will slow the system down. First, thread overhead can be costly. Thread overhead includes the creation and destruction of a thread, but it also includes the context switching an operating system has to perform. If a large number of threads are created but very little work is given to each thread, then the overhead of the thread creation and deletion can overwhelm the savings of spreading the processing across multiple threads. If threads are created and destroyed often, a developer may want to consider a pool of threads that just wait for work that is assigned to them via a controller thread. Pooling threads will save the time required to create and destroy the threads since these threads are used repeatedly.
Even with thread pooling, the developer must still consider the amount of work given to the worker thread; otherwise, the worker threads can become unbalanced or spend too much time communicating with the controller thread. A delicate balance of just the right amount of work to each thread must be achieved while making sure that the work a thread is given to process justifies the thread's existence.
Further, if the number of threads far exceeds the number of processors (oversubscription), there will be a lot of context switching between executing and waiting threads. The system overhead caused by context switching overwhelms the performance gained by the threads. Not only does the context switch add to the execution time but it can further increase execution time since both memory and I/O caches could become dirty. If a cache becomes dirty, the cache will need to be refreshed once the thread regains control. Refreshing the caches can be quite expensive and if this occurs on each context switch, the performance hit can be substantial.
Threading can cause the memory caches to become dirty requiring cache refreshes after a context switch. This is especially true when there are more threads than processors. Unfortunately, this is not the only memory cache problem that can occur; false sharing can also occur. False sharing occurs when two or more threads are accessing different pieces of data in the same cache line. This can occur when private data elements used by two different threads are in the same local memory area. It is called false sharing because the data is not really shared but the two threads share a common cache line. A cache line's interactions can be seen in figure 6.
If Thread A writes to its data area, the cache line for Thread B is invalidated (becomes dirty) even though the data for Thread B has not changed and is, in fact, clean. Thread B will lose performance because now the cache line will have to be fetched from main memory. If the data is aligned poorly and false sharing happens often, a major performance hit will occur.
To avoid false sharing, a developer should first separate data that can be written and data that can be read. One thread reading data will not destroy cache lines from other threads reading data. It is only when data is written, that false sharing can occur; therefore, all data that can be written should be placed in the same local area to minimize the interleaving of read and write data. Also, a developer should know typical cache line sizes for the systems on which the application will run and make sure that commonly used written data is aligned to the cache line boundary. If the data is aligned to a cache boundary (typically 64 or 128 bytes), the OS will not place the data area in a cache line shared by another thread.
A further consideration occurs when a master thread parcels out portions of a single data area that will be consumed by multiple worker threads. The data should be grouped and assigned to the worker threads so that no two threads can have data that could share the same cache line. For instance, if an application is using matrices, the developer should determine how the OS stores arrays, i.e. matrices. Knowing if the data is stored in row or column order is essential --because if the worker threads are given data in the wrong order, the access of the data in the array could cause false sharing. If the data is stored so that row 1, column 1 is the first item in memory followed by row 1 column 2, then false sharing could occur if each thread was asked to look at a particular column of data.
Another potential problem that can cause a system to fail is a deadlock. This occurs when two or more threads are accessing and attempting to obtain multiple locks. For instance, assume Thread A holds lock 1 and then requests lock 2. Unfortunately, lock 2 is being held by Thread B, so Thread A must go into a wait state. Thread B continues to run and then requests lock 1. Lock 1 is being held by Thread A. Each thread is holding a lock the other thread needs and neither thread can run until it obtains the needed lock; therefore, the system is deadlocked and no processing can be performed.
Deadlocks typically occur when multiple locks are obtained out of order. If all threads obtain a set of locks in the same order, this type of deadlock will not occur. So, in the above instance both Thread A and Thread B must obtain lock 1 before trying to obtain lock 2.
Deadlocks can be complicated by the fact that more than two threads may be the culprit. For instance, assume three threads are running concurrently
If Thread C then attempts to get lock 2, Thread C will go into a wait state. Next, Thread B runs and tries to get lock 1, it too will go into a wait state. Finally, Thread A runs and tries to get lock 3, and again, the system deadlocks. Of course, this deadlock is less obvious. The more threads and locks that are used, the harder it will be to discover potential deadlocks in the code.
Deadlocks can also occur when one thread terminates while holding a lock. As a thread terminates, it MUST release all locks or semaphores that are being held otherwise any thread requesting the lock will remain in a perpetual wait state.
There are tools in the market that can help uncover the problems described above. Rational Quantify, Intel Vtune, Intel Thread Profiler and HP Caliper are just a few that can be used. Each tool has its strengths and weaknesses, but all can help a developer determine why their threaded code is not performing as expected. It is imperative to test the application using profilers and debuggers. Things to look for are:
Given the fact that multi-core machines are now the norm and that processors are not increasing in performance at the rate they did during the 80's and 90's, applications must become threaded in order to increase their throughput and performance. Threading an application is not easy so a developer must not jump into the task without learning how to develop threaded code. Even seasoned developers should use as many tools as possible in order to avoid the common threading problems and to make sure the code that has been written performs adequately and scales well.
Developing Multithreaded Applications: A Platform Consistent Approach
http://cache-www.intel.com/cd/00/00/05/15/51534_developing_multithreaded_applications.pdf
Multithreading Programming Guide
http://docs.sun.com/app/docs/doc/816-5137
Dr. Dobbs: Performance Analysis and Multicore Processors : March 30, 2006
http://www.ddj.com/dept/64bit/184417069
Dr Dobbs: Multitasking Alternatives and the Perils of Preemption: Sept 14, 2006
http://www.ddj.com/dept/embedded/193000965
Data Placement in Threaded Programs: from the Intel Software Network
Threading Methodology: Principles and Practices
http://cache-www.intel.com/cd/00/00/21/93/219349_threadingmethodology.pdf
Intel Software Developer Webinar Series Fall 2007
W. Brown, R. Malveau, H. McCormick III, T. Mowbray, "AntiPatterns: Refactoring Software, Architectures, and Projects in Crisis", John Wiley, New York(1998).
C. Smith, L.Williams, "More New Software AntiPatterns," Proceedings of 2003 Computer Measurement Group.