Author Topic: A Note On Floating Point  (Read 2580 times)

Offline Hooman

  • Administrator
  • Hero Member
  • *****
  • Posts: 4955
A Note On Floating Point
« on: August 27, 2015, 09:28:38 AM »
Since the topic came up recently, I thought I'd say a few words about floating point.

People generally think to use floating point when they need non-integer numerical values. Indeed, most languages only offer a choice between integer types and floating point types. The choice to use floating point implies much more than expected though.

The binary representation of float values is divided into 3 components, the sign, the exponent, and the significand (sometimes called the mantissa or the coefficient). The value of the float is (sign)significand*2^exponent. For a 32-bit float 1 bit represents the sign, 8 bits represent the exponent, and 23 bits represent the significand. Note that the significand is said to have 24 bits of precision, but 1 bit is implied, and 23 are explicitly stored. If you think back to scientific notation in normalized form, the exponent is set so there is one leading digit in front of the decimal place that is from 1-9. In a binary system the digits are either 0 or 1, so the leading digit would be set from 1-1, so don't bother storing it, you know what it is. For a 64-bit float, 1 bit represents the sign, 11 bits represent the exponent, and 52 bits represent the significand. Note that most of the increase in space was given to the significand, which allows for greater precision, while only a few bits were given to the exponent, which allows for a greater overall range of magnitude.

Floating point is basically scientific notation. Fix the number of significant digits, fix a base, and fix the range of the exponent, and it now fits in a fixed amount of memory. The number of digits after the decimal point doesn't change. The magnitude of the digits after the decimal point changes based on the exponent. If you deal with very small numbers (a negative exponent) then the spacing between numbers (the unit in the last place) is a value of small magnitude. If you deal with large numbers (a positive exponent), like say astronomical distances, the differences been numbers (the unit in the last place) is large in magnitude. In both cases the change is the same relative to the magnitude of the number. It's the same number of decimal places. Floating point numbers are said to have dynamic range. This has some implications for rounding, particularly when adding very large and very small numbers together since the value of the unit in the last place can differ dramatically, but the number of significant digits is fixed.

Of course floating point numbers on a computer use a base 2 representation, not a base 10 representation. This has implications for what values can be exactly represented, and what values can be approximated closely, based on the number of bits of precision. The positive integers are represented in binary as sums of powers of 2. The same is true for non-integers as well, except now using negative powers of 2. 2^-1 = 0.5, 2^-2 = 0.25, 2^-3 = 0.125, .... Storing a value such as 0.75 can be done since it's the sum of 0.5 and 0.25, which are both (negative) powers of 2. Storing a value such as 0.1 however requires an infinite sum of powers of 2, but due to limited precision, this infinite sequence must be rounded or truncated. With enough precision, the approximate value should be very close. This is similar to how dividing by 3 or 7 often produces non-terminating sequences of digits, which must be rounded or truncated. Neither 3 nor 7 is a divisor of 10, the base of the usual number system, and so dividing by these numbers can produce values that are not exactly representable in a finite amount of space. For a binary system, the base is 2, so dividing by 5 also produces this issue. Note that 0.1 = 1/10 = 1/(2*5). It's that factor of 5 that throws things off in a base 2 system. If the denominator was a power of 2, there could be an exact representation, and so could be represented exactly in a finite amount of space. Also worth mentioning, is values that can be represented exactly can be converted from a textual representation in base 10, to a binary representation in base 2 without error.

Similarly, if an integer variable holds a value, it must be converted to become floating point. Now here 0.1 is not an integer, so you don't have to worry about that particular value. But consider this, if a float is 32 bits, and an integer is 32 bits, then they have the same number of bit combinations (2^32), and so have the same number of internally represented values. But of course a float can store the value 0.5 while an int can't, so these values do not all coincide. In particular, there must exist some integers within the range of a 32-bit int that can't be represented by a 32-bit float, just like there are 32-bit float values that can't be represented by a 32-bit int. Now if your int holds a value like 1, or 2, there is no problem. These both have exact representations for a 32-bit float and so the conversion can be done without rounding errors. Similarly 2^24 = 16777216 is representable by both 32-bit ints and 32-bit floats, so there is again no problem. But 2^24+1 = 16777217 is not representable by a 32-bit float (due to the 24-bit precision of the significand), and so must be rounded to something close that is representable. The choices are either 16777216 or 16777218.

One point I should make about rounding errors, is the errors are not random. Rounding occurs in well defined cases, and is done in well defined ways. It's also not something that happens without need. You should only encounter rounding issues when you're dealing with a value that has no exact representation, and so something close needs  to be used instead.

I should mention there is also an alternative to floating point, called fixed point. Fixed point operations would be done with the integer unit of a CPU, and a software convention on where the decimal place is. There is no ability for the decimal place to move about. This means the unit in the last place is a fixed value between all numbers. There is no possibility for rounding away of small values when adding or subtracting (but overflow is still possible). You also retain usual integer rules, such as associativity, which is lost with floating point.

Offline lordpalandus

  • Banned
  • Hero Member
  • *****
  • Posts: 825
Re: A Note On Floating Point
« Reply #1 on: August 28, 2015, 12:58:47 PM »
Interesting stuff, Hooman!

I'm assuming information is stored in base 2, because the computer operates on the 0 or 1. Or alternatively it stores data in 2 possible outcomes.

I didn't know that floating point was stored in scientific notation. That is quite useful to know actually. Especially as it isn't base 10 either. I had figured that 0.125 would be stored at 0.125 in the RAM exactly as written.

If base 2 is used, then there is technically some numbers that it couldn't store in memory, and thus could result in problems if a coder wasn't careful; ie they use a specific value to determine a switch choice or a specific value to determine the result of an Else-If condition, and the value chosen couldn't be stored in memory precisely as they wanted it. Unless I'm understanding that wrong?

Non-Random rounding. Got it.

Is base 2 used also for integers for storing values?



Currently working on Cataclysm of Chaos, Remade.
Link to OPU page = http://forum.outpost2.net/index.php/topic,6073.0.html

Offline Hooman

  • Administrator
  • Hero Member
  • *****
  • Posts: 4955
Re: A Note On Floating Point
« Reply #2 on: August 30, 2015, 08:48:56 PM »
Of course base 2 is used for integers. It's all place value, instead of the 1's place, the 10's place, the 100's place, and the 1000's place, it's the 1's place, the 2's place, the 4's place, and the 8's place. For floating point you can also have the 1/2's place, the 1/4's place, and the 1/8's place, depending on the exponent.

0001 = 1
0010 = 2
0100 = 4
1000 = 8 (or -8 for a signed 4-bit int)
0110 = 6

What is:
00110101
10001001 (signed?/unsigned?)
0.1
0.11
11.001001000011111101101010

What is 0.1 (decimal) expressed in binary?