Amdahl's law can be used to calculate how much a computation can be sped up by running part of it in parallel. Amdahl's law is named after Gene Amdahl who presented the law in 1967. Most developers working with parallel or concurrent systems have an intuitive feel for potential speedup, even without knowing Amdahl's law. Regardless, Amdahl's law may still be useful to know.
I will first explain Amdahl's law mathematically, and then proceed to illustrate Amdahl's law using diagrams.
A program (or algorithm) which can be parallelized can be split up into two parts:
Imagine a program that processes files from disk. A small part of that program may scan the directory and create a list of files internally in memory. After that, each file is passed to a separate thread for processing. The part that scans the directory and creates the file list cannot be parallelized, but processing the files can.
The total time taken to execute the program in serial (not in parallel) is called T. The time T includes the time of both the non-parallelizable and parallelizable parts. The non-parallelizable part is called B. The parallizable part is referred to as T - B. The following list sums up these definitions:
From this follows that:
T = B + (T-B)
It may look a bit strange at first that the parallelizable part of the program does not have its own symbol in the equation. However, since the parallelizable part of the equation can be expressed using the total time T and B (the non-parallelizable part), the equation has actually been reduced conceptually, meaning that it contains less different variables in this form.
It is the parallelizable part, T - B, that can be sped up by executing it in parallel. How much it can be sped up depends on how many threads or CPUs you apply to execute it. The number of threads or CPUs is called N. The fastest the the parallelizable part can be executed is thus:
(T - B) / N
Another way to write this is:
(1/N) * (T - B)
Wikipedia uses this version in case you read about Amdahl's law there.
According to Amdahl's law, the total execution time of the program when the parallelizable part is executed using N threads or CPUs is thus:
T(N) = B + (T - B) / N
T(N) means total execution with with a parallelization factor of N. Thus, T could be written T(1) , meaning the total execution time with a parallelization factor of 1. Using T(1) instead of T, Amdahl's law looks like this:
T(N) = B + ( T(1) - B ) / N
It still means the same though.
To better understand Amdahl's law, let's go through a calculation example. The total time to execute a program is set to 1. The non-parallelizable part of the programs is 40% which out of a total time of 1 is equal to 0.4 . The parallelizable part is thus equal to 1 - 0.4 = 0.6 .
The execution time of the program with a parallelization factor of 2 (2 threads or CPUs executing the parallelizable part, so N is 2) would be:
T(2) = 0.4 + ( 1 - 0.4 ) / 2
= 0.4 + 0.6 / 2
= 0.4 + 0.3
= 0.7
Making the same calculation with a parallelization factor of 5 instead of 2 would look like this:
T(5) = 0.4 + ( 1 - 0.4 ) / 5
= 0.4 + 0.6 / 5
= 0.4 + 0.12
= 0.52
To better understand Amdahl's law I will try to illustrate how the law is derived.
First of all, a program can be broken up into a non-parallelizable part B, and a parallelizable part 1-B, as illustrated by this diagram:
The line with the delimiters on at the top is the total time T(1).
Here you see the execution time with a parallelization factor of 2:
Here you see the execution time with a parallelization factor of 3:
From Amdahl's law it follows naturally, that the parallelizable part can be executed faster by throwing hardware at it. More threads / CPUs. The non-parallelizable part, however, can only be executed faster by optimizing the code. Thus, you can increase the speed and parallelizability of your program by optimizing the non-parallelizable part. You might even change the algorithm to have a smaller non-parallelizable part in general, by moving some of the work into the parallelizable part (if possible).
If you optimize the sequential part of a program you can also use Amdahl's law to calculate the execution time of the program after the optimization. If the non-parallelizable part B is optimized by a factor of O, then Amdahl's law looks like this:
T(O,N) = B / O + (1 - B / O) / N
Remember, the non-parallelizable part of the program now takes B / O time, so the parallelizable part takes 1 - B / O time.
If B is 0.4, O is 2 and N is 5, then the calculation looks like this:
T(2,5) = 0.4 / 2 + (1 - 0.4 / 2) / 5
= 0.2 + (1 - 0.4 / 2) / 5
= 0.2 + (1 - 0.2) / 5
= 0.2 + 0.8 / 5
= 0.2 + 0.16
= 0.36
So far we have only used Amdahl's law to calculate the execution time of a program or algorithm after optimization or parallelization. We can also use Amdahl's law to calculate the speedup, meaning how much faster the new algorithm or program is than the old version.
If the time of the old version of the program or algorithm is T, then the speedup will be
Speedup = T / T(O,N)
We often set T to 1 just to calculate the execution time and speedup as a fraction of the old time. The equation then looks like this:
Speedup = 1 / T(O,N)
If we insert the Amdahl's law calculation instead of T(O,N), we get the following formula:
Speedup = 1 / ( B / O + (1 - B / O) / N )
With B = 0.4, O = 2 and N = 5, the calculation becomes:
Speedup = 1 / ( 0.4 / 2 + (1 - 0.4 / 2) / 5)
= 1 / ( 0.2 + (1 - 0.4 / 2) / 5)
= 1 / ( 0.2 + (1 - 0.2) / 5 )
= 1 / ( 0.2 + 0.8 / 5 )
= 1 / ( 0.2 + 0.16 )
= 1 / 0.36
= 2.77777 ...
That means, that if you optimize the non-parallelizable (sequential) part by a factor of 2, and paralellize the parallelizable part by a factor of 5, the new optimized version of the program or algorithm would run a maximum of 2.77777 times faster than the old version.
While Amdahl's law enables you to calculate the theoretic speedup of parallelization of an algorithm, don't rely too heavily on such calculations. In practice, many other factors may come into play when you optimize or parallelize an algorithm.
The speed of memory, CPU cache memory, disks, network cards etc. (if disk or network are used) may be a limiting factor too. If a new version of the algorithm is parallelized, but leads to a lot more CPU cache misses, you may not even get the desired x N speedup of using x N CPUs. The same is true if you end up saturating the memory bus, disk or network card or network connection.
My recommendation would be to use Amdahl's law to get an idea about where to optimize, but use a measurement to determine the real speedup of the optimization. Remember, sometimes a highly serialized sequential (single CPU) algorithm may outperform a parallel algorithm, simply because the sequential version has no coordination overhead (breaking down work and building the total again), and because a single CPU algorithm may conform better with how the underlying hardware works (CPU pipelines, CPU cache etc).
In computer architecture, Amdahl's law (or Amdahl's argument[1]) is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. Specifically, it states that "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used".[2] It is named after computer scientist Gene Amdahl, and was presented at the AFIPS Spring Joint Computer Conference in 1967.
If the non-parallelizable part is optimized by a factor of O , then
It follows from Amdahl's law that the speedup due to parallelism is given by
Next, we consider the case wherein the non-parallelizable part is reduced by a factor of OI, and the parallelizable part is correspondingly increased. Then
It follows from Amdahl's law that the speedup due to parallelism is given by
The derivation above is in agreement with Jakob Jenkov's analysis of the execution time vs. speedup tradeoff.
Amdahl's law is often conflated with the law of diminishing returns, whereas only a special case of applying Amdahl's law demonstrates law of diminishing returns. If one picks optimally (in terms of the achieved speedup) what is to be improved, then one will see monotonically decreasing improvements as one improves. If, however, one picks non-optimally, after improving a sub-optimal component and moving on to improve a more optimal component, one can see an increase in the return. Note that it is often rational to improve a system in an order that is "non-optimal" in this sense, given that some improvements are more difficult or require larger development time than others.
Amdahl's law does represent the law of diminishing returns if on considering what sort of return one gets by adding more processors to a machine, if one is running a fixed-size computation that will use all available processors to their capacity. Each new processor added to the system will add less usable power than the previous one. Each time one doubles the number of processors the speedup ratio will diminish, as the total throughput heads toward the limit of 1/(1 − p).
This analysis neglects other potential bottlenecks such as memory bandwidth and I/O bandwidth. If these resources do not scale with the number of processors, then merely adding processors provides even lower returns.
An implication of Amdahl's law is that to speedup real applications which have both serial and parallel portions, heterogeneous computing techniques are required.
In computer architecture, Amdahl's law (or Amdahl's argument[1]) is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. Specifically, it states that "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used".[2] It is named after computer scientist Gene Amdahl, and was presented at the AFIPS Spring Joint Computer Conference in 1967.
Amdahl's law is often used in parallel computing to predict the theoretical speedup when using multiple processors. For example, if a program needs 20 hours to complete using a single thread, but a one-hour portion of the program cannot be parallelized, therefore only the remaining 19 hours (p = 0.95) of execution time can be parallelized, then regardless of how many threads are devoted to a parallelized execution of this program, the minimum execution time cannot be less than one hour. Hence, the theoretical speedup is limited to at most 20 times the single thread performance,
Amdahl's law can be formulated in the following way:
where
S latency is the theoretical speedup of the execution of the whole task;
s is the speedup of the part of the task that benefits from improved system resources;
p is the proportion of execution time that the part benefiting from improved resources originally occupied.
Furthermore,
shows that the theoretical speedup of the execution of the whole task increases with the improvement of the resources of the system and that regardless of the magnitude of the improvement, the theoretical speedup is always limited by the part of the task that cannot benefit from the improvement.
Amdahl's law applies only to the cases where the problem size is fixed. In practice, as more computing resources become available, they tend to get used on larger problems (larger datasets), and the time spent in the parallelizable part often grows much faster than the inherently serial work. In this case, Gustafson's law gives a less pessimistic and more realistic assessment of the parallel performance.
A task executed by a system whose resources are improved compared to an initial similar system can be split up into two parts:
a part that does not benefit from the improvement of the resources of the system;
a part that benefits from the improvement of the resources of the system.
An example is a computer program that processes files . A part of that program may scan the directory of the disk and create a list of files internally in memory. After that, another part of the program passes each file to a separate thread for processing. The part that scans the directory and creates the file list cannot be sped up on a parallel computer, but the part that processes the files can.
The execution time of the whole task before the improvement of the resources of the system is denoted as T. It includes the execution time of the part that would not benefit from the improvement of the resources and the execution time of the one that would benefit from it. The fraction of the execution time of the task that would benefit from the improvement of the resources is denoted by p. The one concerning the part that would not benefit from it is therefore 1-p. Then:
It is the execution of the part that benefits from the improvement of the resources that is accelerated by the factor s after the improvement of the resources. Consequently, the execution time of the part that does not benefit from it remains the same, while the part that benefits from it becomes:
The theoretical execution time T(s) of the whole task after the improvement of the resources is then:
Amdahl's law gives the theoretical speedup in latency of the execution of the whole task at fixed workload W, which yields
If 30% of the execution time may be the subject of a speedup, p will be 0.3; if the improvement makes the affected part twice as fast, s will be 2. Amdahl's law states that the overall speedup of applying the improvement will be:
For example, assume that we are given a serial task which is split into four consecutive parts, whose percentages of execution time are p1 = 0.11, p2 = 0.18, p3 = 0.23, and p4 = 0.48 respectively. Then we are told that the 1st part is not sped up, so s1 = 1, while the 2nd part is sped up 5 times, so s2 = 5, the 3rd part is sped up 20 times, so s3 = 20, and the 4th part is sped up 1.6 times, so s4 = 1.6. By using Amdahl's law, the overall speedup is
Notice how the 5 times and 20 times speedup on the 2nd and 3rd parts respectively don't have much effect on the overall speedup when the 4th part (48% of the execution time) is accelerated by only 1.6 times.
For example, with a serial program in two parts A and B for which TA = 3 s and TB = 1 s,
Therefore, making part A to run 2 times faster is better than making part B to run 5 times faster. The percentage improvement in speed can be calculated as