Horner's Method
Suppose you want to calculate the value of the degree-3 polynomial ax^3 + bx^2 + cx + d
for a particular x
.
The really inefficient way is to compute ax^3
, then bx^2
, then cx
, then d
, and then add all the intermediate values together. You do a lot of multiplications and need a bunch of space to write down all your temporary values. Even if you compute x^3
using the value of x^2
to avoid redundant work, you still have to do five multiplications: x*x
, x^2 * x
, a*x^3
, b*x^2
, and c*x
.
A much better way is Horner's Method which notes that the polynomial above is equivalent to d+x*(c+ x*(b + x*a))
. Now we don't need to store more than one intermediate result, and we do only three multiplications total.
Compiler Implementation
Using the Compiler Explorer at https://godbolt.org, we can write code that calculates the polynomial value and see what the compiler does with it. Let's write the code in the naive way:
// compute ax^3 + bx^2 + cx + d
int poly(int a, int b, int c, int d, int x) {
return a * x * x * x + b * x * x + c * x + d;
}
and gcc 8.1 with -O2 or -O3, on the x86-64 architecture, transforms it into Horner's method instead:
poly(int, int, int, int, int):
imul edi, r8d
add esi, edi
imul esi, r8d
lea eax, [rdx+rsi]
imul eax, r8d
add eax, ecx
ret
The assembly convention used here puts the destination on the left, so 'add esi,edi' means 'add edi to the value in esi, and put it in esi'.
In the System V AMD64 ABI calling convention (used by Linux and other Unix-like OSes including macOS), the arguments are given in the registers rdi
(a), rsi
(b), rdx
(c), rcx
(d), and r8
(x). This code uses the 32-bit versions of those registers. So, the first line is equivalent to 'a=a*x' (overwriting 'a', but we don't need it any more.) Then we add that to 'b', multiply by 'x' again, add 'c', multiply by 'x' again, and finally add 'd'. The rax/eax
register must contain the result.
The only less-than-straightforward choice on the compiler's part is to use the lea
instruction, which calculates an address (by adding its arguments), rather than a straightforward add. One reason a compiler may choose to do this is that address calculation uses a separate functional unit than addition. See https://stackoverflow.com/questions/6323027/lea-or-add-instruction In this context it may not make much difference, since we need the result right away, so it can't execute in parallel. A more compelling reason in that on x86 the add
instruction only takes two arguments, while lea
can add two registers and put the result in a third. Since we need to get a result into eax
somehow to return it, the choice of lea
gives the compiler the opportunity to avoid doing a move later.
Other examples
PowerPC architecture
gcc 6.3.0 on PowerPC implements the same optimization using just mullw
and add
instructions:
poly(int, int, int, int, int):
mullw 3,3,7
add 3,3,4
mullw 3,3,7
add 3,3,5
mullw 3,3,7
add 3,3,6
extsw 3,3
blr
.long 0
.byte 0,9,0,0,0,0,0,0
ARM architecture
The ARM64 architecture has a combined multiply-and-add instruction so it's even more straightforward with MSVC on ARM:
|poly| PROC
madd w8,w0,w4,w1
madd w9,w8,w4,w2
madd w0,w9,w4,w3
ret
ENDP
Swift
What about Swift? Here's the equivalent function
func square(a: Int, b:Int, c:Int, d:Int, x:Int) -> Int {
return a * x * x * x + b * x * x + c * x + d
}
and here's the compiled output from swiftc 4.1.2 on x86-64, with the -O flag:
output.square(a: Swift.Int, b: Swift.Int, c: Swift.Int, d: Swift.Int, x: Swift.Int) -> Swift.Int:
push rbp
mov rbp, rsp
imul rdi, r8
jo .LBB1_10
imul rdi, r8
jo .LBB1_11
imul rdi, r8
jo .LBB1_12
imul rsi, r8
jo .LBB1_13
imul rsi, r8
jo .LBB1_14
add rdi, rsi
jo .LBB1_15
imul rdx, r8
jo .LBB1_16
add rdi, rdx
jo .LBB1_17
add rdi, rcx
jo .LBB1_18
mov rax, rdi
pop rbp
ret
.LBB1_10:
ud2
.LBB1_11:
ud2
.LBB1_12:
ud2
.LBB1_13:
ud2
.LBB1_14:
ud2
.LBB1_15:
ud2
.LBB1_16:
ud2
.LBB1_17:
ud2
.LBB1_18:
ud2
Just counting the number of multiplications tells us that the Swift compiler did not implement Horner's method, and the presence of all those overflow checks might be a reason why. In Swift, multiplication overflow is a runtime error; that is, if you multiply two 64-bit numbers whose product does not fit into 64 bits, the program will terminate. In C or C++ it'll just keep going silently. So each native add or multiply has to be checked, and (probably) the compiler is not allowed to reorder things so that the overflow happens in a different place or does not happen at all.
Rust
Rust version of the function:
pub fn poly(a: i64, b: i64, c:i64, d:i64, x:i64) -> i64 {
a * x * x * x + b * x * x + c * x + d
}
rustc 1.26.0 with the -O flag performs Horner's method as well:
though for some reason the compiler included a stack prefix and postfix, even though this function does not use the stack. (I don't know if that is just a missed opportunity, or if rust strictly requires that RBP be stored on the stack for some reason, such as exception handling.)
To the question in your title, my Magic 8-Ball says:
Hi! I'm a bot, and this answer was posted automatically. Check this post out for more information.
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit