Author Topic: Division Is Slow  (Read 3107 times)

Offline Hooman

  • Administrator
  • Hero Member
  • *****
  • Posts: 4955
Division Is Slow
« on: August 27, 2015, 09:29:03 AM »
In response to this:
Quote
I didn't know that division actually took longer for the computer to do than multiplication, addition or subtraction. Good to know.
I thought I'd include some information on how slow.


Addition and subtraction are quite easy. Even early chips handled that quite easily. It took 3 clock cycles on an 8086, assuming operands were already in registers. Timings aren't so clear cut for newer chips. Starting with early Pentium chips, things became superscalar (issuing multiple instructions of a single thread per clock cycle). Issuing 2, 3, or 4 additions in a single clock cycle is now possible, depending on the CPU model. The Pentium also had pipelining (multiple instructions at various stages of execution feeding through the processor at the same time), hence execution time overlapped with surrounding instructions. This made total execution time less relevant, and instead latency became the typical measure. If the next instruction depends on the result of the current instruction, the CPU might have to wait for the current result before the next instruction can be executed. Addition was simple in that it had a latency of 1 clock cycle to produce results.

Multiplication was slow on early chips, and even missing on others (RISC chips, 8-bit microprocessors, older mainframes) requiring a software multiply routine. If you want to try it, one simple and often used multiplication algorithm can be implemented in terms of repeated shifting and adding. Some early chips simply implemented this algorithm in hardware, reusing the same adder hardware for each loop. There are other algorithms that can be more efficient though. As chips increased in complexity, hardware multiplier circuits were built that could perform the multiplication in much less time. The 8086 could multiply two 16 bit integers in about 118-154 clock cycles. A Pentium, with increased hardware support, could issue 1 multiply per clock cycle, with a result latency of 4 clock cycles.

Division could by done on an 8086 in 144-184 clock cycles. The nature of division has made it somewhat resistive to speedup by simply throwing more hardware at it. A Pentium still took about 40 clock cycles. Throughput was also only about 1 division per 40 clock cycles.

Agner Fog has compiled tables of x86 Instruction Timings for various processors, based on empirical testing.
Here's another source for x86 Instruction Timings.

Offline Sirbomber

  • Hero Member
  • *****
  • Posts: 3238
Re: Division Is Slow
« Reply #1 on: August 27, 2015, 04:30:17 PM »
The one exception to this, I believe, is when dividing by powers of 2, which most compilers optimize down to a right shift (moving all the bits to the right and dropping the least significant bit; so 1001 [9b10] becomes 0100 [4b10] becomes 0010 [2b10] and so on).
"As usual, colonist opinion is split between those who think the plague is a good idea, and those who are dying from it." - Outpost Evening Star

Outpost 2 Coding 101 Tutorials

Offline Hooman

  • Administrator
  • Hero Member
  • *****
  • Posts: 4955
Re: Division Is Slow
« Reply #2 on: August 27, 2015, 11:23:37 PM »
Correct. Bit shifting is fast, so multiplication and division by constants that are powers of two are often optimized to bit shifts.

The optimization differs though depending on whether you are dealing with signed or unsigned integers. For unsigned the case is simple. A shift right (logical = zero fill the vacated bits) divides by 2, rounding odd numbers down. For signed numbers a simple shift right (arithmetic = sign extend the vacated bits) will also divide by 2, and round odd numbers down. However, signed numbers are generally rounded towards 0. This means positive numbers are rounded down, and negative numbers are rounded up. To do this, a correction must be performed in case the number was negative. Without the correction you'd get results like -1 / 2 = -1, or -1 / 4 = -1. With the correction, which the compiler will automatically insert for signed values, -1 / 2 = 0. To do this, the sign bit is added back into the low order bits of the value that are being shifted out before the shift is performed. The number of bits affected depends on the power of 2 you are dividing by. In general, you need to add (divisor - 1) to the value before shifting, if the number was negative. Because of this correction, you need an extra 2-3 instructions ahead of the shift, or more depending on compiler and optimization settings. All of this will still be much faster than an actual divide instruction, however the optimization is less for signed values than for unsigned values.

There are a number of other optimizations that also work better for unsigned values. A bit unfortunate, since many programmers are lazy and just type "int" when what they're really dealing with should be "unsigned int". I prefer environments that shorten this to things like "uint" or "sint" for signed or unsigned numbers. It makes it much more likely the programmer will declare what they actually need, rather than what's artificially convenient.


Of interesting note, since multiplication is so well behaved, even numbers that are not powers of two were often implemented with bit shifts and adds, since this used to be faster than a simple multiply instruction. To multiply a * 9 = a * (8 + 1) = a * 8 + a, you can shift left by 3 (multiply by 8 ), and add in the source (the 1) to obtain the result of multiplying by 9. This was aided by instructions such as LEA (Load Effective Address). LEA can be used to calculate memory addresses, using the memory addressing mechanism available to most instructions. Despite it's appearance the LEA instruction is an arithmetic instruction, not a memory access instruction. The LEA instruction makes use of the SIB (Scale/Index/Base) byte used by memory instructions to perform (scale * index + base). The SIB byte encodes two registers (the base and the index) in 3-bits each (there are 8 general purpose registers), and uses the remaining 2 bits to specify a scale of either 1, 2, 4, or 8. This was to aid in array address calculations, since elements sizes were typically 1, 2, 4, or 8 bytes. However, by setting the base and the index to the same register, this allows the CPU to multiply by 2, 3, 5, or 9, using only 1 fast simple instruction. The encoding of the shift instruction is smaller than LEA, so SHL makes a better choice than LEA when multiplying by 2.
« Last Edit: August 27, 2015, 11:34:40 PM by Hooman »

Offline lordpalandus

  • Banned
  • Hero Member
  • *****
  • Posts: 825
Re: Division Is Slow
« Reply #3 on: August 28, 2015, 12:43:55 PM »
Wow that is quite the difference in speed for sure. Though many these days I'm sure would probably say don't worry about it as CPUs have tons of cycles. Which is true, but those tons of cycles aren't just doing your division, and if you have multiple functions running division... or even more complex mathematics, then those cycles disappear quickly. And then, you get a bottleneck on your CPU limiting your framerate (assuming that something else isn't the bottleneck instead before it, such as RAM).

@Sirbomber; I'm assuming 9b10 means 9 base 10, so a decimal number (as bases commonly come up for logarithms, I'm assuming this is what b means in this context)

@Hooman; My C++ textbook is a bit old, and I got in back in 2007, and it doesn't mention signed or unsigned integers, just short integers or long integers. Was signed/unsigned added into C++ after 2007, or was the textbook leaving out some details?



Currently working on Cataclysm of Chaos, Remade.
Link to OPU page = http://forum.outpost2.net/index.php/topic,6073.0.html

Offline Hooman

  • Administrator
  • Hero Member
  • *****
  • Posts: 4955
Re: Division Is Slow
« Reply #4 on: August 29, 2015, 12:19:22 PM »
A single division you'll never notice, but if you use division in a loop, it can make a big difference if you can rewrite to avoid it. You'd have to measure to be sure though. There are so many factors that can affect performance.

Signed and unsigned integers have been a part of C and C++ for as long as I know. I can't imagine them ever not being a part of the language. There is actually both a "signed" and "unsigned" keyword, although signed is the default if not specified. An exception is that "char" technically defaults to neither, but will generally behave as one of the two, but which one is compiler dependent. If you want to be sure, you can explicitly specify signed or unsigned for char.

To make full use of the x86 flags register, and all the conditional jumps offered, you need a way of specifying signed or unsigned. The flags are always set as if doing both signed and unsigned arithmetic simultaneously, but the conditional jumps must specify which flags they're checking. If you do a one byte compare of 0x80 and 0x00, which one is greater depends on your interpretation of the bits as either a signed number (0x80 = -128) of an unsigned number (0x80 = 128). Hence the compiler needs to know what flag to check for a comparison such as "if (a > 0)".

Note that the CMP (compare) instruction functions like SUB (subtract), except that it throws away the result of the subtraction after setting the flags. The main flags of interest for arithmetic comparisons are the Zero Flag, the Carry Flag, the Sign Flag, and the Overflow Flag. The Zero Flag is used to test for equality (a-b == 0). The Carry Flag is used for unsigned comparisons (a < b, or a > b). The Signed Flag is set to the high order bit, and so determines if a signed value is less than 0 (a < 0). The Overflow Flag is used for testing if two signed numbers produce a result with an unexpected sign (two positives adding to a negative, or two negatives adding to a positive). The Overflow Flag is used in conjunction with the Signed Flag to do signed comparisons (a < b, or a > b). Some conditional jump instructions test multiple flags at the same time, so you can check for <= or >= easily, whether the data represents signed or unsigned integers.