Some Data about Speed of Multiplication and Addition
1. Running 34,359,738,367 (0x7ffffffff) times
program | real time | user | system | instruction retired | cycles elapsed |
---|---|---|---|---|---|
mul | 10.80 | 10.77 | 0.03 | 206,174,586,404 | 34,460,405,906 |
add | 10.81 | 10.77 | 0.02 | 206,174,961,103 | 34,452,388,733 |
add | 10.80 | 10.76 | 0.03 | 206,174,196,589 | 34,463,690,145 |
mul | 10.81 | 10.77 | 0.03 | 206,174,331,889 | 34,466,289,391 |
2. Running 137,438,953,471 (0x1fffffffff) times
program | real time | user | system | instruction retired | cycles elapsed |
---|---|---|---|---|---|
add | 43.12 | 42.98 | 0.12 | 824,684,047,224 | 137,845,540,756 |
mul | 43.26 | 43.05 | 0.19 | 824,706,262,533 | 138,088,583,489 |
mul | 43.11 | 42.99 | 0.09 | 824,678,165,222 | 137,763,496,059 |
add | 43.11 | 42.99 | 0.10 | 824,680,759,525 | 137,775,698,560 |
3. Code
add.s
1 | .include "debug.s" |
mul.s
1 | .include "debug.s" |
Same points in these two programs
- Same registers
- w1, w2: operation register;
- x1: destination register;
- x3: index register;
- x0: kernel parameter;
- Same statements
- w1 = 1, w2 = 2;
- w4 = w1, w5 = w2;
- same loop logic;
- w1 and w2 are assigned by their initial value;
- same printReg:
- store x0-x18 and lr (general registers) to stack;
- put the parameters from x0 to x3;
- put x1-x3 to stack;
- invoke
_printf
; - load x0-x18 and lr (general registers) from stack;
- return 0;
The only difference between these two programs is add and smull.
4. First Conclusion
These two given tables illustrate the performance difference between multiplication and addition instructions for integers arithmetic. Overall, it proves that addition instructions execute faster tha multiplication instructions for integers.
Firstly take a look at the first table. This table sorted those records by time and these two programs compiled just once. Time usage of these executions are very near. The total time is about 10.80s and system time is 0.03s approximately. However, add
can be completed by 10.76s at least, which is 0.01s faster than mul
can. As for instruction retired, mul
has 206,174,586,404 retired instructions, which is above 400,000 less than add
has. If they execute again, they both can get about 400,000 retired instructions decrease. They both have similar cycle elapsed in their two executions. mul
needs more cycles than add
in order to complete its task.
Regarding the second tables, time, instruction retired and cycle elapsed all have a very huge growth because the loop round number is much larger. The total time in four recordings is about 43s and system time is 0.10s approximately. In first round(i.e. the first two records), add
needs 42.98s to run, which is 0.07s less than mul
needs. With the optimisation of system, they can have the same performance, 42.99s. They need 840 billions retired instruction and about 137 cycles to finish the task.
All in all, with the improvement made by processor, mul
and add
can have a similar performance. But at first time, add
performs better without any optimisation. Thus we can get the conclusion that addition instructions execute faster than multiplications.
5. Code Improve
I learn that C programming language provides efficient library to measure the clock and time used in the program. Therefore I update the code in order to get rid of the time command. All in all, using in-code method to measure the clock and time used is more accuracy and elegant than using time command provided by UNIX. Here are the two updated codes.
add.s
1 | .include "debug.s" |
mul.s
1 | .include "debug.s" |
About this Post
This post is written by Chen Li, licensed under CC BY-NC 4.0.