Threads - The Good, The Bad, and The Ugly

May, 2008
by Claire Cates

Related Papers
Case Study Of Table-Top Sizing With Workload-Specific Estimates Of The Multi-Processor Effect
James W. McGalliard

Case Study Of Table-Top Sizing With Workload-Specific Estimates Of CPU Cache Performance
James W. McGalliard

Fellowship - Simulation And Modeling Of A Simultaneous Multithreading Processor
Dean M. Tullsen

VTAM Hash Function Evaluation and Tuning Using Simulation Modeling
Hossein H. Ghannad

To Normalize or Denormalize, That is the Question
Marsha Hanus

A Minimalist Approach to DB2 Application Tuning
Frank J. Ingrassia

See more
Join CMG

Today, processors are getting faster, but not at the rate that they were in past years. Since processor speed is not increasing as rapidly as in the past, the primary way to increase the speed of an application is to take advantage of the multi-core and multi-processor machines that are now the norm. To accomplish this, an application must become multi-threaded. Unfortunately, threading an application can also cause problems and can even cause programs to run more slowly.

Introduction: What is a Thread?

A process does not share resources with other processes running on a system and therefore processes are completely independent. For example, a word processor and a spreadsheet running on the same computer are two processes. The two processes do not share or interact with each other. Alternatively, threads within one process share that process' memory and other resources. A thread does have its own stack and a copy of the registers including the program counter and each thread within a process executes independently of other threads.

On a multi-core or multi-processor machine, each thread (or process) can run on a different processor and therefore one can complete more work in a given amount of time. Threads can also run on a single processor machine, but only one thread will be executing at a time. When an application has more threads than processors, the operating system will context switch between the different threads to give each thread a chance to execute. Having more threads than processors is not necessarily a bad thing and in many instances can be beneficial. Furthermore, a developer may not know what class of machine the application will be executed on so an application should be written to take advantage of the increasing number of cores that are becoming available on new hardware.

Why Use Threads? (The Good)

There are four main reasons why an application should be threaded:

  1. To improve application responsiveness
  2. To separate disjoint tasks within an application
  3. To improve application throughput
  4. To improve performance via parallelism

Many related events in the world can occur asynchronously; similarly, components within a software application may also be able to execute asynchronously. An example of reason 1 might be an interactive application which processes keyboard or mouse input in one thread and executes CPU bound tasks in an independent thread. Unlike legacy systems where the keyboard locked when a user selected a task, today's systems allow continuous input while other tasks are processing. Threading this type of application can be beneficial even on a single processor machine simply because the number of context switches (when the OS must let the other thread run) will normally be low and therefore the delays will not be noticed by the user. The user gets the advantage of instantaneous interaction with the GUI without greatly hurting the performance of other work running on the machine

Even on single processor machines, applications often have disjoint tasks that are being performed. These tasks might best be handled by individual threads. One example of reason 2 (to separate disjoint tasks within an application) could be an application that is processing work but is also waiting on a socket to accept more data. One thread simply listens on the socket while the other thread performs the processing of the data. Another example of reason 2 could be an application that is I/O-intensive yet must process the data that is read. One thread could be reading data from the disk while another thread is CPU-bound and is processing the data that has been read. Therefore, even on a single processor machine, threading some applications can increase performance. On multi-core machines, the CPU-bound portion of the application could be further threaded to improve performance.

An example of reason 3 (to improve application throughput) might be a web site that supports concurrent multiple users. One user does not have to wait until the first user has finished; all users can access the website simultaneously. Each user would be executing in a separate thread so that the work that they perform does not interfere with the work that other users are performing. Again, this type of threading can be beneficial even if there are minimal cores, especially since the majority of the user's time would be in reading the web page that was downloaded and not executing in the server searching for and downloading the page (i.e. while one user is reading the page another user can be using the processor to download a page).

An example of reason 4 (to improve performance via parallelism) can best be shown by looking at how to build a wooden table.

There are four main steps to building the table.

  1. Cut the table top and sand it. Assume that this takes 15 minutes.
  2. Turn the four legs in order to shape and smooth them. Assume this takes 30 minutes for all four legs.
  3. Assemble the table. Assume this takes 10 minutes.
  4. Stain the table. Assume this takes 10 minutes.

One table will take one person 65 minutes to build. Four tables would take this same person 260 minutes to build. Now let's assume we have four people and only one set of equipment. One person can run the saw and sander, one person knows how to turn the legs, one person knows how to assemble the table and one person knows how to stain the table. We could build the four tables in less time. Adding four people to this project will not cut the time by ¼; instead the new time to complete the 4 tables would be 140 minutes. In Figure 1, you can see that, the four tables can be built faster, but three of the employees are idle over half the time. Each block represents five minutes of time. Each color represents a different table.

Figure 1: Four People Using One Set of Equipment

You could teach employee 1 to not only cut and sand the table tops but also how to assemble and stain the tables. This would not reduce the total time but it would save two salaries and therefore the tables would be cheaper.

Figure 2:  Two People with One Set of Equipment

Alternatively, if you had four sets of each tool used, and four employees, and each employee knew how to perform all aspects of building a table, you would reduce the amount of time to build four tables by ¼, but this would make the tables very expensive and would take a lot of training for each of the employees.

Figure 3: Four People, Four Sets of Tools, Each Person Performs All Jobs

Next, we might consider the longest job in the system, the turning of the legs, and see how we could speed up this process. If we hired one more employee capable of turning the legs, and we bought an extra lathe, then we could cut the time for building four tables to 80 minutes and most of the employees would be 100% busy.

Figure 4:   Five People and Two Lathes

The same reasoning can be used with threading. Common types of program parallelism are data parallelism, task parallelism and pipeline parallelism.

Data Parallelism refers to distributing the data being processed across multiple threads. Each thread is performing the same function, but on a different set of data. The final result can be achieved when all threads processing the data have completed. In the above example, the turning of the legs is the portion of the activity that was data parallelized.

Task Parallelism refers to disjoint tasks being run on multiple threads. As long as the one task does not depend on the results of the other task, the 2 tasks can be run concurrently. In the above example, the different portions of the wood working can be done at the same time.

Pipeline parallelism comes into play when one task has a dependency on the output of a previous task. Basically, this is an assembly line. Each thread is dedicated to one particular task. Thread A starts and when it finishes computing iteration 1, these results are sent to Thread B. Thread B then starts its process on iteration 1 while Thread A starts a new computation - iteration 2. This process continues until all iterations are completed. The serial portion of this code occurs in Thread A iteration 1 and in Thread B, on the last iteration. In the above example, the assembling and staining must occur after the wood has been processed, but the workers staining and assembling can be working as the wood for the next table is being processed.

You can see that in the above example, the employees are equivalent to the different threads that can execute on a machine; the tasks these employees perform are thus equivalent to the code that is executed by the thread. Adding threads to the system can speed up an application, but it will not cut the total time by the number of threads. Furthermore, if the number of processors on the machine is less than the number of threads, then threads will have to share that processor and the execution time will increase.

Parallelism Threading Examples

Many common computer science tasks can benefit from parallelization.

Example 1 (Data):
An application that must perform a sort on a large amount of data could invoke threads to each sort a portion of the data. The sorted lists can be then merged to produce the final sorted list.

Example 2 (Data):
Adding two matrices is faster with two or more threads. One thread could be adding the top portion of the matrices while the other thread is adding the lower portions of the matrices

Example 3 (Data):
Using threads to help render graphics can be advantageous. When working with images, each thread can take a portion of the image. When working with ray tracing, each thread can work on computing the ray tracing for a portion of the picture.

Example 4 (Task):
An application that has its data already in memory could be performing a record by record print in one thread while performing computations on the data in a separate thread.

Example 5 (Pipeline):
One thread could be accepting the input for a customer's order. Another thread could process the order. The order can't be processed until the customer's order has been input.

Finally you can have multi-level parallelism. For example, a customer database could have a thread that is simply reading the customer data, several threads that are currently sorting the data after it has been read and another thread that produces the reports from the final sorted data. There could be other threads accepting customer input and processing those orders and another thread that simply updates an inventory database. In this case all versions of the parallelism are incorporated into the system As you can see threading models of an application can become quite complex.

How Much Can Threading Speed Up An Application?

Unfortunately, only portions of an application can be threaded. The portions that can be threaded must represent a significant part of the application's execution time in order to improve the performance of the system.

If the execution paths are drawn, an application that is threaded will look like this.

Figure 5:  Execution Path

As shown in Figure 5, part of the application path remains serialized and runs on only one thread. Other parts of the application can be threaded, but these areas should only be threaded if they are hot spots or take up a majority of the execution time. Remember, even if you have infinite processors, these processors will not decrease the execution time of the serial portions of an application. It is the serial portion of the application that ultimately limits the performance improvements that can be obtained. The execution path shown in Figure 5 is simple; in many applications some of the threads spawned to perform a task spawn other threads. An execution path picture of a complex application can be daunting.

Other criteria that should be considered when threading an application:

  • If an application is I/O-bound, threading the code may not make the application any faster since a shared disk itself is serialized. Yet, moving the I/O to its own thread while other processing occurs in alternate threads can increase performance. Furthermore, if the data is spread across multiple disks, it may then be advantageous to use multiple I/O threads as long as each thread is working off an individual disk.
  • If an application is memory bound, then threading the code will probably only speed up the threaded parts by about 50%. The memory cache can throttle down the performance if the cache is constantly being updated.
  • An application that is CPU-intensive is the best candidate for threading.

Taking Advantage of Multi-Cores and Multiple Processors

You've probably heard of Moore's law - that the number of transistors that can be placed on a processor is doubling every two years. Computers and applications running on these machines have been counting on the increased speed from these new processors. The increase in transistors has continued over the past several decades but the size of a transistor has become so small that it is approaching atomic sizes. There is a limit to how small silicon technology can become. The slowdown in the increase in transistors has already started and to compensate for this slow down, chip manufactures are starting to make multi-core chips and they hope that future performance increases will come from applications taking advantage of these multiple cores.

Today, even low end computers are dual or quad core and on the higher end machines you will find 8, 16, 32 or more processors. Since the processor speed is not improving as fast as in the past, the only way to increase the performance of an application is for the application to rely on threads so that the application can be parallelized to take advantage of all the cores available. Once an application has been threaded, it can continue to achieve performance gains as new multi-core chips are introduced. A single-threaded application will only execute on one processor; so, a single-threaded application will only use ¼ of the processing power available on a quad core machine. Users are not going to be happy if they buy new hardware and the applications are not executing any faster and the machine processors are under-utilized.

If an application is parallelized to use all processors available, then when the application is executed on a higher end machines with more processors, the application will be able to run more users, process more data, or execute faster without changes to the application. Therefore, users can again expect performance improvements simply by buying new hardware.

The June issue will contain "Threads: The Bad", the second in this three part series on threads.


References

Developing Multithreaded Applications: A Platform Consistent Approach
http://cache-www.intel.com/cd/00/00/05/15/51534_developing_multithreaded_applications.pdf

Multithreading Programming Guide
http://docs.sun.com/app/docs/doc/816-5137

Dr. Dobbs: Performance Analysis and Multicore Processors : March 30, 2006
http://www.ddj.com/dept/64bit/184417069

Dr Dobbs: Multitasking Alternatives and the Perils of Preemption: Sept 14, 2006
http://www.ddj.com/dept/embedded/193000965

Data Placement in Threaded Programs: from the Intel Software Network

Threading Methodology: Principles and Practices
http://cache-www.intel.com/cd/00/00/21/93/219349_threadingmethodology.pdf

Intel Software Developer Webinar Series Fall 2007

W. Brown, R. Malveau, H. McCormick III, T. Mowbray, "AntiPatterns: Refactoring Software, Architectures, and Projects in Crisis", John Wiley, New York(1998).

C. Smith, L.Williams, "More New Software AntiPatterns," Proceedings of 2003 Computer Measurement Group.