July 16, 2022

Some Data about Speed of Multiplication and Addition

Some Data about Speed of Multiplication and Addition

1. Running 34,359,738,367 (0x7ffffffff) times

program real time user system instruction retired cycles elapsed
mul 10.80 10.77 0.03 206,174,586,404 34,460,405,906
add 10.81 10.77 0.02 206,174,961,103 34,452,388,733
add 10.80 10.76 0.03 206,174,196,589 34,463,690,145
mul 10.81 10.77 0.03 206,174,331,889 34,466,289,391

2. Running 137,438,953,471 (0x1fffffffff) times

program real time user system instruction retired cycles elapsed
add 43.12 42.98 0.12 824,684,047,224 137,845,540,756
mul 43.26 43.05 0.19 824,706,262,533 138,088,583,489
mul 43.11 42.99 0.09 824,678,165,222 137,763,496,059
add 43.11 42.99 0.10 824,680,759,525 137,775,698,560

3. Code

add.s

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
.include "debug.s"
.global main
.p2align 2

main:
mov w1, #1
mov w2, #2
mov w4, w1
mov w5, w2
mov x3, #0xffff
movk x3, #0xffff, lsl #16
movk x3, #0x0007, lsl #32
movk x3, #0x0000, lsl #48
loop: subs x3, x3, #1
b.eq print
add w1, w1, w2
mov w1, w4
mov w2, w5
b loop
print: printReg 1

// return 0
mov x0, #0
ret

mul.s

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
.include "debug.s"
.global main
.p2align 2

main:
mov w1, #1
mov w2, #2
mov w4, w1
mov w5, w2
mov x3, #0xffff
movk x3, #0xffff, lsl #16
movk x3, #0x0007, lsl #32
movk x3, #0x0000, lsl #48
loop: subs x3, x3, #1
b.eq print
smull x1, w1, w2
mov w1, w4
mov w2, w5
b loop
print: printReg 1

// return 0
mov x0, #0
ret

Same points in these two programs

The only difference between these two programs is add and smull.

4. First Conclusion

These two given tables illustrate the performance difference between multiplication and addition instructions for integers arithmetic. Overall, it proves that addition instructions execute faster tha multiplication instructions for integers.

Firstly take a look at the first table. This table sorted those records by time and these two programs compiled just once. Time usage of these executions are very near. The total time is about 10.80s and system time is 0.03s approximately. However, add can be completed by 10.76s at least, which is 0.01s faster than mul can. As for instruction retired, mul has 206,174,586,404 retired instructions, which is above 400,000 less than add has. If they execute again, they both can get about 400,000 retired instructions decrease. They both have similar cycle elapsed in their two executions. mul needs more cycles than add in order to complete its task.

Regarding the second tables, time, instruction retired and cycle elapsed all have a very huge growth because the loop round number is much larger. The total time in four recordings is about 43s and system time is 0.10s approximately. In first round(i.e. the first two records), add needs 42.98s to run, which is 0.07s less than mul needs. With the optimisation of system, they can have the same performance, 42.99s. They need 840 billions retired instruction and about 137 cycles to finish the task.

All in all, with the improvement made by processor, mul and add can have a similar performance. But at first time, add performs better without any optimisation. Thus we can get the conclusion that addition instructions execute faster than multiplications.

5. Code Improve

I learn that C programming language provides efficient library to measure the clock and time used in the program. Therefore I update the code in order to get rid of the time command. All in all, using in-code method to measure the clock and time used is more accuracy and elegant than using time command provided by UNIX. Here are the two updated codes.

add.s

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
.include "debug.s"
.global main
.p2align 2

main:
mov w1, #1
mov w2, #2
mov w4, w1
mov w5, w2
mov x3, #0xffff
movk x3, #0xffff, lsl #16
movk x3, #0x0007, lsl #32
movk x3, #0x0000, lsl #48

// start clock
stp fp, lr, [sp, #-16]!
bl _clock
ldp fp, lr, [sp], #16
mov x8, x0
str x8, [sp, #-16]!

// do add for x3-1 times
loop: subs x3, x3, #1
b.eq after
add w1, w1, w2
mov w1, w4
mov w2, w5
b loop

// end clock
after: stp fp, lr, [sp, #-16]!
bl _clock
ldp fp, lr, [sp], #16
mov x9, x0
str x9, [sp, #8]

// calculate
mov x10, #0x4240
movk x10, #0xf, lsl #16
ldp x8, x9, [sp]
ldr x7, [sp]
scvtf d8, x8
scvtf d9, x9
scvtf d10, x10
fsub d8, d9, d8
fdiv d8, d8, d10
fmov x8, d8

// results
printStr "Add instructions speed test"
printStr "Start Clock:"
printReg 7
printStr "End Clock:"
printReg 9
printStr "Time used:"
printDouble 8
printStr ""


// return 0
mov x0, #0
ret

mul.s

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
.include "debug.s"
.global main
.p2align 2

main:
mov w1, #1
mov w2, #2
mov w4, w1
mov w5, w2
mov x3, #0xffff
movk x3, #0xffff, lsl #16
movk x3, #0x0007, lsl #32
movk x3, #0x0000, lsl #48

// start clock
stp fp, lr, [sp, #-16]!
bl _clock
ldp fp, lr, [sp], #16
mov x8, x0
str x8, [sp, #-16]!

// do add for x3-1 times
loop: subs x3, x3, #1
b.eq after
smull x1, w1, w2
mov w1, w4
mov w2, w5
b loop

// end clock
after: stp fp, lr, [sp, #-16]!
bl _clock
ldp fp, lr, [sp], #16
mov x9, x0
str x9, [sp, #8]

// calculate
mov x10, #0x4240
movk x10, #0xf, lsl #16
ldp x8, x9, [sp]
ldr x7, [sp]
scvtf d8, x8
scvtf d9, x9
scvtf d10, x10
fsub d8, d9, d8
fdiv d8, d8, d10
fmov x8, d8

// results
printStr "Mul instructions speed test"
printStr "Start Clock:"
printReg 7
printStr "End Clock:"
printReg 9
printStr "Time used:"
printDouble 8
printStr ""


// return 0
mov x0, #0
ret

About this Post

This post is written by Chen Li, licensed under CC BY-NC 4.0.

#experience