Do Compilers Implement Horner's Method?

Horner's Method

Suppose you want to calculate the value of the degree-3 polynomial ax^3 + bx^2 + cx + d for a particular x.

The really inefficient way is to compute ax^3, then bx^2, then cx, then d, and then add all the intermediate values together. You do a lot of multiplications and need a bunch of space to write down all your temporary values. Even if you compute x^3 using the value of x^2 to avoid redundant work, you still have to do five multiplications: x*x, x^2 * x, a*x^3, b*x^2, and c*x.

A much better way is Horner's Method which notes that the polynomial above is equivalent to d+x*(c+ x*(b + x*a)). Now we don't need to store more than one intermediate result, and we do only three multiplications total.

Compiler Implementation

Using the Compiler Explorer at https://godbolt.org, we can write code that calculates the polynomial value and see what the compiler does with it. Let's write the code in the naive way:

// compute ax^3 + bx^2 + cx + d
int poly(int a, int b, int c, int d, int x) {
    return a * x * x * x + b * x * x + c * x + d;
}

and gcc 8.1 with -O2 or -O3, on the x86-64 architecture, transforms it into Horner's method instead:

poly(int, int, int, int, int):
        imul    edi, r8d
        add     esi, edi
        imul    esi, r8d
        lea     eax, [rdx+rsi]
        imul    eax, r8d
        add     eax, ecx
        ret

The assembly convention used here puts the destination on the left, so 'add esi,edi' means 'add edi to the value in esi, and put it in esi'.

In the System V AMD64 ABI calling convention (used by Linux and other Unix-like OSes including macOS), the arguments are given in the registers rdi (a), rsi (b), rdx (c), rcx (d), and r8(x). This code uses the 32-bit versions of those registers. So, the first line is equivalent to 'a=a*x' (overwriting 'a', but we don't need it any more.) Then we add that to 'b', multiply by 'x' again, add 'c', multiply by 'x' again, and finally add 'd'. The rax/eax register must contain the result.

The only less-than-straightforward choice on the compiler's part is to use the lea instruction, which calculates an address (by adding its arguments), rather than a straightforward add. One reason a compiler may choose to do this is that address calculation uses a separate functional unit than addition. See https://stackoverflow.com/questions/6323027/lea-or-add-instruction In this context it may not make much difference, since we need the result right away, so it can't execute in parallel. A more compelling reason in that on x86 the add instruction only takes two arguments, while lea can add two registers and put the result in a third. Since we need to get a result into eax somehow to return it, the choice of lea gives the compiler the opportunity to avoid doing a move later.

Other examples

PowerPC architecture

gcc 6.3.0 on PowerPC implements the same optimization using just mullw and add instructions:

poly(int, int, int, int, int):
        mullw 3,3,7
        add 3,3,4
        mullw 3,3,7
        add 3,3,5
        mullw 3,3,7
        add 3,3,6
        extsw 3,3
        blr
        .long 0
        .byte 0,9,0,0,0,0,0,0

ARM architecture

The ARM64 architecture has a combined multiply-and-add instruction so it's even more straightforward with MSVC on ARM:

        |poly| PROC
        madd        w8,w0,w4,w1
        madd        w9,w8,w4,w2
        madd        w0,w9,w4,w3
        ret
ENDP

Swift

What about Swift? Here's the equivalent function

func square(a: Int, b:Int, c:Int, d:Int, x:Int) -> Int {
    return a * x * x * x + b * x * x + c * x + d
}

and here's the compiled output from swiftc 4.1.2 on x86-64, with the -O flag:

output.square(a: Swift.Int, b: Swift.Int, c: Swift.Int, d: Swift.Int, x: Swift.Int) -> Swift.Int:
        push    rbp
        mov     rbp, rsp
        imul    rdi, r8
        jo      .LBB1_10
        imul    rdi, r8
        jo      .LBB1_11
        imul    rdi, r8
        jo      .LBB1_12
        imul    rsi, r8
        jo      .LBB1_13
        imul    rsi, r8
        jo      .LBB1_14
        add     rdi, rsi
        jo      .LBB1_15
        imul    rdx, r8
        jo      .LBB1_16
        add     rdi, rdx
        jo      .LBB1_17
        add     rdi, rcx
        jo      .LBB1_18
        mov     rax, rdi
        pop     rbp
        ret
.LBB1_10:
        ud2
.LBB1_11:
        ud2
.LBB1_12:
        ud2
.LBB1_13:
        ud2
.LBB1_14:
        ud2
.LBB1_15:
        ud2
.LBB1_16:
        ud2
.LBB1_17:
        ud2
.LBB1_18:
        ud2

Just counting the number of multiplications tells us that the Swift compiler did not implement Horner's method, and the presence of all those overflow checks might be a reason why. In Swift, multiplication overflow is a runtime error; that is, if you multiply two 64-bit numbers whose product does not fit into 64 bits, the program will terminate. In C or C++ it'll just keep going silently. So each native add or multiply has to be checked, and (probably) the compiler is not allowed to reorder things so that the overflow happens in a different place or does not happen at all.

Rust

Rust version of the function:

pub fn poly(a: i64, b: i64, c:i64, d:i64, x:i64) -> i64 {
    a * x * x * x + b * x * x + c * x + d
}

rustc 1.26.0 with the -O flag performs Horner's method as well:

though for some reason the compiler included a stack prefix and postfix, even though this function does not use the stack. (I don't know if that is just a missed opportunity, or if rust strictly requires that RBP be stored on the stack for some reason, such as exception handling.)