Menu

Binary options 5 decimal strategy

4 Comments

binary options 5 decimal strategy

CopyrightAssociation for Computing Machinery, Inc. This is rather surprising because floating-point is ubiquitous in computer systems. Almost every language has a floating-point datatype; computers from PCs to supercomputers have floating-point accelerators; most compilers will be called upon to compile floating-point algorithms from time to time; and virtually every operating system strategy respond to floating-point exceptions such as overflow. This paper presents a tutorial on those aspects of floating-point that have a direct impact on designers of computer systems.

It begins with background on floating-point representation and rounding error, continues with a discussion of the IEEE floating-point standard, and concludes with numerous examples of how computer builders can better support floating-point Categories and Subject Descriptors: Primary C.

There are, however, remarkably few sources of detailed information about it. One of the few books on the subject, Floating-Point Computation by Pat Sterbenz, is strategy out of print. This paper is a tutorial on those aspects of floating-point arithmetic floating-point hereafter that have a direct connection to systems building. It consists of three loosely connected parts. The first section, Rounding Errordiscusses the implications of using different rounding strategies for the basic operations of addition, subtraction, multiplication and division.

It also contains background information on the two methods of measuring rounding error, ulps and relative error. The second part discusses the IEEE floating-point standard, which is becoming rapidly accepted by commercial hardware manufacturers.

Included in the IEEE standard is the rounding method for basic operations. The discussion of the standard draws on the material in the section Rounding Error. The third part discusses the connections between floating-point and the design of various aspects of computer systems. Topics include instruction set design, optimizing compilers and exception handling I have tried to avoid making statements about floating-point without also giving reasons why the statements are true, especially since the justifications involve nothing more complicated than elementary calculus.

Those explanations that are not central to the main argument have been grouped into a section called "The Details," so that they can be skipped if desired.

In particular, the proofs of many of the theorems appear in this section. Decimal end of each proof is marked with the z symbol. When a proof is not included, the z appears immediately following the statement of the theorem Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits.

Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation. The section Relative Error and Ulps describes how it is measured Since most floating-point calculations have rounding error anyway, does it matter if the basic arithmetic operations introduce a little bit more rounding error than necessary?

That question is a main theme throughout this section. The section Guard Digits discusses guard digits, a means of reducing the error decimal subtracting two nearby numbers. Guard digits were considered sufficiently important by IBM that in it added a guard digit to the double precision format in the System architecture single precision already had a guard digitand retrofitted all existing machines in the field.

Two examples are given to illustrate the utility of guard digits The IEEE standard goes further than just requiring the use of a guard digit. It gives an algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm.

Thus, when a program is moved from one machine to another, the results of the basic operations will be the same in every bit if both machines support the IEEE standard. This greatly simplifies the porting of programs. Other uses of this precise specification are given in Exactly Rounded Operations Several different representations of real numbers have been proposed, but by far the most widely used is the floating-point representation Floating-point representations have a base which is always assumed to be even and a precision p.

Two other parameters associated with floating-point representations are the largest and smallest allowable exponents, e max and e min. The precise encoding is not important for now There are two reasons why a real number might not be exactly representable as a floating-point number. The most common situation is illustrated by the decimal number Although it has a finite decimal representation, in binary it has an infinite repeating representation.

However, numbers that are out of range will be discussed in the sections Infinity and Denormalized Numbers Floating-point representations are not necessarily unique. Unfortunately, this restriction makes it impossible to represent zero! In general, if the floating-point number d.

Another way to measure the difference between a floating-point number and the real number it is approximating is relative errorwhich is simply the difference between the two numbers divided by the real number. In general, when the base isa fixed relative error expressed in ulps can wobble by a factor of up to And conversely, as equation above shows, a fixed error of. For example rounding to the nearest floating-point number corresponds to an error of less than or equal to.

However, when analyzing the rounding error caused by various formulas, relative error is a better measure. A good illustration of this is the analysis in the section Theorem Since can overestimate the effect of rounding to the nearest floating-point number by the wobble factor oferror estimates of formulas will be tighter on machines with a small When only the order of magnitude of rounding error is of interest, ulps and may be used interchangeably, since they differ by at most a factor of For example, when a floating-point number is in error by n ulps, that means that the number of contaminated digits is log n.

If the relative error in a computation is nthen One method of computing the difference between two floating-point numbers is to compute the difference exactly and then round it to the nearest floating-point number. This is very expensive if the operands differ greatly in size. Suppose that the number of digits kept is pand that when the smaller operand is shifted right, digits are simply discarded as opposed to rounding. Take another example This becomes The correct answer is.

How bad can the error be? That is, all of the p digits in the result are wrong! Suppose that one extra digit is added to guard against this situation a guard digit. That is, the smaller number is truncated to p digits, and then the result of the subtraction is rounded to p digits. With a guard digit, the previous example becomes and the answer is exact. Addition is included in the above theorem since x and y can be positive or negative The last section can be summarized by saying that without a guard digit, the relative error committed when subtracting two nearby quantities can be very large.

In other words, the evaluation of any expression containing a subtraction or an addition of quantities with opposite signs could result in a relative error so large that all the digits are meaningless Theorem When subtracting nearby quantities, the most significant digits in the operands match and cancel each other.

There are two kinds of cancellation: catastrophic and benign Catastrophic cancellation occurs when the operands are subject to rounding errors. For example in the quadratic formula, the expression b ac occurs. The quantities b and 4 ac are subject to rounding errors since they are the results of floating-point multiplications. Suppose that they are rounded to the nearest floating-point number, and so are accurate to within ulp.

When they are subtracted, cancellation can cause many of the accurate digits to disappear, leaving behind mainly digits contaminated by rounding error. Hence the difference might have an error of many ulps.

If x and y have no rounding error, then by Theorem 2 if the subtraction is done with a guard digit, the difference x -y has a very small relative error less than A formula that exhibits catastrophic cancellation can sometimes be rearranged to eliminate the problem. Again consider the quadratic formula Whenthen does not involve a cancellation and But the other addition subtraction in one of the formulas will have a catastrophic cancellation.

To avoid this, multiply the numerator and denominator of r by and similarly for r to obtain If andthen computing r using formula will involve a cancellation. It is more accurate to evaluate it as x y x y Unlike the quadratic formula, this improved form still has a subtraction, but it is a benign cancellation of quantities without rounding error, not a catastrophic one.

By Theorem 2, the relative error in x y is at most The same is true of x y. Multiplying two quantities with a small relative error results in a product with a small relative error see the section Rounding Error In order to avoid confusion between exact and computed values, the following notation is used.

Whereas x y denotes the exact difference of x and yx y denotes the computed difference i. Similarly, and denote computed addition, multiplication, and division, respectively. All caps indicate the computed value of a function, as in LN x or SQRT x. Lowercase functions and traditional mathematical notation denote their exact values as in ln x and Binary x y x y is an excellent approximation to x y 2the floating-point numbers x and y might themselves be approximations to some true quantities and For example, and might be exactly known decimal numbers that cannot be expressed exactly in binary.

In this case, even though x y is a good approximation to x yit can have a huge relative error compared to the true expressionand so the advantage of x y x y over x y 2 is not as dramatic. Since computing x y x y is about the same amount of work as computing x y 2it is clearly the preferred form in this case.

In general, however, replacing a catastrophic cancellation by a benign one is not worthwhile if the expense is large, because the input is often but not always an approximation.

But eliminating a cancellation entirely as in the quadratic formula is worthwhile even if the data are not exact. Throughout this paper, it will be assumed that the floating-point inputs to an algorithm are exact and that the results are computed as accurately as possible The expression x y is more accurate when rewritten as x y x y because a catastrophic cancellation is replaced with a benign one.

We next present more interesting examples of formulas exhibiting catastrophic cancellation that can be rewritten to exhibit only benign cancellation The area of a triangle can be expressed directly in terms of the lengths of its sides aband c as Suppose the triangle is very flat; that is, a b c. Then s aand the term s a in formula subtracts two nearby numbers, one of which may have rounding error. It is If ab, and c do not satisfy a b crename them before applying It is straightforward to check that the right-hand sides of and are algebraically identical.

In particular, the relative error is actually of the expression Because of the cumbersome nature ofin the statement of theorems we will usually say the computed value of E rather than writing out E with circle notation Error bounds are usually too pessimistic. In the numerical example given above, the computed value of iscompared with a true value of for a relative error ofwhich is much less than The main reason for computing error bounds is not to get precise bounds but rather to verify that the formula does not contain numerical problems A final example of an expression that can be rewritten to use benign cancellation is x nwhere This expression arises in financial calculations.

The expression i n involves adding 1 toso the low order bits of i n are lost. This rounding error is amplified when i n is raised to the n th power The troublesome expression i n n can be rewritten as e nln i nwhere now the problem is to compute ln x options small x. Theorem 4 assumes that LN x approximates ln x to within ulp. The problem it solves is that when x is small, LN x is not close to ln x because x has lost the information in the low order bits of x. That is, the computed value of ln x is not close to its actual value when This formula will work for any value of x but is only interesting forwhich is where catastrophic cancellation occurs in the naive formula ln x Although the formula may seem mysterious, there is a simple explanation for why it works.

So changing x slightly will not introduce much error. Sometimes a formula that gives inaccurate results can be rewritten to have much higher numerical accuracy by using benign cancellation; however, the procedure only works if subtraction is performed using a guard digit.

The price of a guard digit is not high, because it merely requires making the adder one bit wider. For this price, you gain the ability to run many algorithms such as formula for computing the area of a triangle and the expression ln x Although most modern computers have a guard digit, there are a few such as Cray systems that do not When floating-point operations are done with a guard digit, they are not as accurate as if they were computed exactly then rounded to the nearest floating-point number.

Operations performed in this manner will be called exactly rounded The example immediately preceding Theorem 2 shows that a single guard digit will not always give exactly rounded results.

The previous section gave several examples of algorithms that require a guard digit in order to work properly. This section gives examples of algorithms that require exact rounding So far, the definition of rounding has not been given. Rounding is straightforward, with the exception of how to round halfway cases; for example, should round to 12 or 13? Another school of thought says that since numbers ending in 5 are halfway between two possible roundings, they should round down half the time and round up the other half.

Thus rounds to 12 rather than 13 because 2 is even. Which of these methods is best, round up or round to even? Throughout the rest of this paper, round to even will be used One application of exact rounding occurs in multiple precision arithmetic. There are two basic approaches to higher precision. One approach represents floating-point numbers using a very large significand, which is stored in an array of words, and codes the routines for manipulating these numbers in assembly language.

The second approach represents higher precision floating-point numbers as an array of ordinary floating-point numbers, where adding the elements of the array in infinite precision recovers the high precision floating-point number. It is this second approach that will be discussed here. The advantage of using an array of floating-point numbers is that it can be coded portably in a high level language, but it requires exactly rounded arithmetic The key to multiplication in this system is representing a product x y as a sum, where each summand has the same precision as x and y.

This can be done by splitting x and y. When p is even, it is easy to find a splitting. The number x x x p can be written as the sum of x x x p and x p x p When p is odd, this simple splitting method will not work. An extra bit can, however, be gained by using negative numbers. Actually, a more general fact due to Kahan is true. The proof is ingenious, but readers not interested in such details can skip ahead to section The IEEE Standard The theorem holds true for any baseas long as 2 i j is replaced by i j.

As gets larger, however, denominators of the form i j are farther and farther apart We are now in a position to answer the question, Does it matter if the basic arithmetic operations introduce a little more rounding error than necessary? The answer is that it does matter, because accurate basic operations enable us to prove that formulas are "correct" in the sense they have a small relative error. The section Cancellation discussed several algorithms that require guard digits to produce correct results in this sense.

If the input to those formulas are numbers representing imprecise measurements, however, the bounds of Theorems 3 and 4 become less interesting.

The reason is that the benign cancellation x y can become catastrophic if x and y are only approximations to some measured quantity. But accurate operations are useful even in the face of inexact data, because they enable us to establish exact relationships like those discussed in Theorems 6 and 7.

These are useful even if every floating-point variable is only an approximation to some actual value There are two different IEEE standards for floating-point computation. It also specifies the precise layout of bits in a single and double precision. It does not require a particular value for p binary, but instead it specifies constraints on the allowable values of p for single and double precision.

The term IEEE Standard will be used when discussing properties common to both standards This section provides a tour of the IEEE standard. Each subsection discusses one aspect of the standard and why it was included. It is not the purpose of this paper to argue that the IEEE standard is the best possible floating-point standard but rather to accept the standard as given and provide binary introduction to its use.

The section Relative Error and Ulps mentioned one reason: the results of error analyses are much tighter when is 2 because a rounding error of. A related reason has to do with the effective precision for large bases. Both systems have 4 bits of significand. In general, base 16 can lose up to 3 bits, so that a precision of p hexadecimal digits can have an effective precision as low as 4 p rather than 4 p binary bits.

Only IBM knows for sure, but there are two possible reasons. The first is increased exponent range. Hence the significand requires 24 bits. Since this must fit into 32 bits, this leaves 7 bits for the exponent and one for the sign bit.

When adding two floating-point numbers, if their exponents are different, one of the significands will have to be shifted to make the radix points line up, slowing down the operation. Formats that use this trick are said to have a hidden bit. It was already pointed out in Floating-point Formats that this requires a special convention for 0. The method given there was that an exponent of e min and a significand of all zeros represents notbut rather IEEE single precision is encoded in 32 bits using 1 bit for the sign, 8 bits for the exponent, and 23 bits for the significand.

In IEEE 754, single and double precision correspond roughly to what most floating-point hardware provides. Single precision occupies a single 32 bit word, double precision two consecutive 32 bit words. The minimum allowable double-extended format is sometimes referred to as bit formateven though the table shows it using 79 bits.

The reason is that hardware implementations of extended precision normally do not use a hidden bit, and so would use 80 rather than 79 bits The strategy puts the most emphasis on extended precision, making no recommendation concerning double precision, but strongly recommending that Implementations should support the extended format corresponding to the widest basic format supported, One motivation for extended precision comes from calculators, which will often display 10 digits, but use 13 digits internally.

By displaying only 10 of the 13 digits, the calculator appears to the user as a "black box" that computes exponentials, cosines, etc. For the calculator to compute functions like exp, log and cos to within 10 digits with reasonable efficiency, it needs a few extra digits to work with.

It is not hard to find a simple rational expression that approximates log with an error of units in the last place. Thus computing with 13 digits gives an answer correct to 10 digits. By keeping these extra 3 digits hidden, the calculator presents a simple model to the operator Extended precision in the IEEE standard serves a similar function. It enables libraries to efficiently compute quantities to within about.

However, when using extended precision, it is important to make sure that its use is transparent to the user. For example, on a calculator, if the internal representation of a displayed value is not rounded to the same precision as the display, then the result of further operations will depend on the hidden digits and appear unpredictable to the user To illustrate extended precision further, consider the problem of converting between IEEE single precision and decimal.

Ideally, single precision numbers will be printed with enough digits so that when the decimal number is read back in, the single precision number can be recovered. It turns out that 9 decimal digits are enough to recover a single precision binary number see the section Binary to Decimal Conversion. When converting a decimal number back to its unique binary representation, a rounding error as small as 1 ulp is fatal, because it will give the wrong answer.

Here is a situation where extended precision is vital for an efficient algorithm. When single-extended is available, a very straightforward method exists for converting a decimal number to a single precision binary one. First read in the 9 decimal digits as an integer Nignoring the decimal point. Next find the appropriate power P necessary to scale N. This will be a combination of the exponent of the decimal number, together with the position of the up until now ignored decimal point.

If this last binary is done exactly, then the closest binary number is recovered. The section Binary to Decimal Conversion shows how to do the last multiply or divide exactly. Thus for Pthe use of the single-extended format enables 9-digit decimal numbers to be converted to the closest binary number i.

Although it is true that the reciprocal of the largest number will underflow, underflow is usually less serious than overflow. That is, the result must be computed exactly and then rounded to the nearest floating-point number using round to even. The section Guard Digits pointed out that computing the exact difference or sum of two floating-point numbers can be very expensive when their exponents are substantially different.

That section introduced guard digits, which provide a practical way of computing differences while guaranteeing that the relative error is small.

However, computing with a single guard digit will not always give the same answer as computing the exact result and then rounding. Thus the standard can be implemented efficiently One reason for completely specifying the results of arithmetic operations is to improve the portability of software. When a program is moved between two machines and both support IEEE arithmetic, then if any intermediate result differs, it must be because of software bugs, not from differences in arithmetic.

Another advantage of precise specification is that it makes it easier to reason about floating-point. Proofs about floating-point are hard enough, without having to deal with multiple cases arising from multiple kinds of arithmetic.

Just as integer programs can be proven to be correct, so can floating-point programs, although what is proven in that case is that the rounding error of the result satisfies certain bounds.

Theorem 4 is an example of such a proof. These proofs are made much easier when the operations being reasoned about are precisely specified. However, proofs in this system cannot verify the algorithms of sections Cancellation and Exactly Rounded Operationswhich require features not present on all hardware. It also requires that conversion between internal formats and decimal be correctly rounded except for very large numbers. They note that options inner products are computed in IEEE arithmetic, the final answer can be quite wrong.

The reason is that efficient algorithms for exactly rounding all the operations are known, except conversion. To illustrate, suppose you are making a table of the exponential function to 4 places. If exp is computed more carefully, it becomes And then And then Since exp is transcendental, this could go on arbitrarily long before distinguishing whether exp is ddd or ddd.

Thus it is not practical to specify that the precision of transcendental functions be the same as if they were computed to infinite precision and then rounded. Another approach would be to specify transcendental functions algorithmically. But there does not appear to be a single algorithm that works well across all hardware architectures.

Rational approximation, CORDIC, options large tables are three different techniques that are used for computing transcendentals on contemporary machines.

Each is appropriate for a different class of hardware, and at present no single algorithm works acceptably over the wide range of current hardware On some floating-point hardware every bit pattern represents a valid floating-point number. The IBM System is an example of this. On the other hand, the VAX TM reserves some bit patterns to represent special numbers called reserved operands.

This idea goes back to the CDCwhich had bit patterns for the special quantities INDEFINITE and INFINITY The IEEE standard continues in this tradition and has NaNs Not a Number and infinities. Without any special quantities, there is no good way to handle exceptional situations like taking the square root of a negative number, other than aborting computation.

Under IBM System FORTRAN, the default action in response to computing the square root of a negative number like -4 results in the printing of an error message.

Since every bit pattern represents a valid number, the return value of square root must be some floating-point number. In the case of System FORTRAN, is returned. However, there are examples where it makes sense for a computation to continue in such a situation.

Consider a subroutine that finds the zeros of a function fsay zero f. That is, the subroutine is called as zero fab. A more useful zero finder would not require the user to input this extra information. This more general zero finder is especially appropriate for calculators, where it is natural to simply key in a function, and awkward to then have to specify the domain.

However, it is easy to see why most zero finders require a domain. The zero finder does its work by probing the function f at various values. If it probed for a value outside the domain of fthe code for f might well compute orand the computation would halt, unnecessarily aborting the zero finding process This problem can be avoided by introducing a special value called NaN, and specifying that the computation of expressions like and produce NaN, rather than halting.

That is, zero f is not "punished" for making an incorrect guess. With this example in mind, it is easy to see what the result of combining a NaN with an ordinary floating-point number should be. Similarly if one operand of a division operation is a NaN, the quotient should be a NaN.

The zero-finder could install a signal handler for floating-point exceptions. Then if f was evaluated outside its domain and raised an exception, control would be returned to the zero solver. The problem with this approach is that every language has a different method of handling signals if it has a method at alland so it has no hope of portability In IEEE 754, NaNs are often represented as floating-point numbers with the exponent e max and nonzero significands.

Implementations are free to put system-dependent information into the significand. Thus there is not a unique NaN, but rather a whole family of NaNs.

When a NaN and an ordinary floating-point number are combined, the result should be the same as the NaN operand. Thus if the result of a long computation is a NaN, the system-dependent information in the significand will be the information that was generated when the first NaN in the computation was generated. Actually, there is a caveat to the last statement. If both operands are NaNs, then the result will be one of those NaNs, but it might not be the NaN that was generated first Just as NaNs provide a way to continue a computation when expressions like or are encountered, infinities provide a way to continue when an overflow occurs.

This is much safer than simply returning the largest representable number. Thus in the IEEE standard, results in a NaN. This agrees with the reasoning used to conclude that should be a NaN When a subexpression evaluates to a NaN, the value of the entire expression is also a NaN.

Here is a practical example that makes use of the rules for infinity arithmetic. Consider computing the function x x This is a bad formula, because not only will it overflow when x is larger thanbut infinity arithmetic will give the wrong answer because it will yield 0, rather than a number near x.

This example illustrates a general fact, namely that infinity arithmetic often avoids the need for special case checking; however, formulas need to be carefully inspected to make sure they do not have spurious behavior at infinity as x x did Zero is represented by the exponent e min and a zero significand. When a multiplication or division involves a signed zero, the usual sign rules apply in computing the sign of the answer.

The reason is that and both result in 0, and results inthe sign information having been lost. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return Another example of a function with a discontinuity at zero is the signum function, which returns the sign of a number Probably the most interesting use of signed zero occurs in complex arithmetic.

The problem can be traced to the fact that square root is multi-valued, and there is no way to select the values so that it is continuous in the entire complex plane. However, square root is continuous if a branch cut consisting of all negative real numbers is excluded from consideration. Signed zero provides a perfect way to resolve this problem. Tracking down bugs like this is frustrating and time consuming.

On a more philosophical level, computer science textbooks often point out that even though it is currently impractical to prove large programs correct, designing programs with the idea of proving them often results in better code. Floating-point code is just like any other code: it helps to have provable facts on which to depend. Similarly, knowing that is true makes writing reliable floating-point code easier. If it is only true for most numbers, it cannot be used to prove anything The IEEE standard uses denormalized numbers, which guaranteeas well as other useful relations.

They are the most controversial part of the standard and probably accounted for the long delay in getting approved. The top number line in the figure shows normalized floating-point numbers. Notice the gap between 0 and the smallest normalized number If the result of a floating-point calculation falls into this gulf, it is flushed to zero. The bottom number line shows what happens when denormals are added to the set of floating-point numbers.

The "gulf" is filled in, and when the result of a calculation is less thanit is represented by the nearest denormal. Consider dividing two complex numbers, a ib and c id. The obvious formula suffers from the problem that if either component of the denominator c id is larger thanthe formula will overflow, even though the final result may be well within range. It yields with flush to zero, an error of ulps. It is typical for denormalized numbers to guarantee error bounds for arguments all the way down to x When an exceptional condition like division by zero or overflow occurs in IEEE arithmetic, the default is to deliver a result and continue.

Typical of the default results are NaN for andand for and overflow. The preceding sections gave examples where proceeding from an exception with these default values was the reasonable thing to do. When any exception occurs, a status flag is also set. Implementations of the IEEE standard are required to provide users with a way to read and write the status flags.

The flags are "sticky" in that once set, they remain set until explicitly cleared. Although for this formula the problem can be solved by rewriting it as x xrewriting may not always solve the problem. The IEEE standard strongly recommends that decimal allow trap handlers to be installed.

Then when an exception occurs, the trap handler is called instead of setting the flag. The value returned by the trap handler will be used as the result of the operation.

It is the responsibility of the trap handler to either clear or set the status flag; otherwise, the value of the flag is allowed to be undefined The IEEE standard divides exceptions into 5 classes: overflow, underflow, division by zero, invalid operation and inexact. There is a separate status flag for each class of exception. The meaning of the first three exceptions is self-evident.

The default result of an operation that causes an invalid exception is to return a NaN, but the converse is not true. Binary to Decimal Conversion discusses an algorithm that uses the inexact exception. If floating-point hardware does not have flags of its own, but instead interrupts the operating system to signal a floating-point exception, the cost of inexact exceptions could be prohibitive.

This cost can be avoided by having the status flags maintained by software. The first time an exception is raised, set the software flag for the appropriate class, and tell the floating-point hardware to mask off that class of exceptions. Then all further exceptions will run without interrupting the operating system. When a user resets that status flag, the hardware mask is re-enabled One obvious use for trap handlers is for backward compatibility.

Old codes that expect to be aborted when exceptions occur can install a trap handler that aborts the process. One solution is to use logarithms, and compute exp instead. The problem with this approach is that it is less accurate, and that it costs more than the simple expressioneven if there is no overflow.

There is a global counter initialized to zero. Whenever the partial product overflows for some kthe trap handler increments the counter by one and returns the overflowed quantity with the exponent wrapped around. Similarly, if p k underflows, the counter would be decremented, and negative exponent would get wrapped around into a positive one.

When all the multiplications are done, if the counter is zero then the final product is p n. If the counter is positive, the product overflowed, if the counter is negative, it underflowed. If none of the partial products are out of range, the trap handler is never called and the computation incurs no extra cost. The definition of wrapped-around for overflow is that the result is computed as if to infinite precision, then divided byand then rounded to the relevant precision.

For underflow, the result is multiplied by The exponent is for single precision and for double precision. By default, rounding means round toward nearest. The standard requires that three other rounding modes be provided, namely round toward 0, round towardand round toward When used with the convert to integer operation, round toward causes the convert to become the floor function, while round toward is ceiling. The rounding mode affects overflow, because when round toward 0 or round toward is in effect, an overflow of positive magnitude causes the default result to be the largest representable number, not Similarly, overflows of negative magnitude will produce the largest negative number when round toward or round toward 0 is in effect One application of rounding modes occurs in interval arithmetic another is mentioned in Binary to Decimal Conversion.

When using interval arithmetic, the sum of two numbers x and y is an intervalwhere is x y rounded towardand is x y rounded toward The exact result of the addition is contained within the interval Without rounding modes, interval arithmetic is usually implemented by computing andwhere is machine epsilon This results in overestimates for the size of the intervals. Since the result of an operation in interval arithmetic is an interval, in general the input to binary operation will also be an interval.

If binary intervalsandare added, the result iswhere is with the rounding mode set to round towardand is with the rounding mode set to round toward When a floating-point calculation is performed using interval arithmetic, the final answer is an interval that contains the exact result of the calculation. This is not very helpful if the interval turns out to be large as it often doessince the correct answer could be anywhere in that interval.

Interval arithmetic makes more sense when used in conjunction with a multiple precision floating-point package. The calculation is first performed with some precision p. If interval arithmetic suggests that the final answer options be inaccurate, the computation is redone with higher and higher precisions until the final interval is a reasonable size The IEEE standard has a number of flags and modes. As discussed above, there is one status flag for each of the five exceptions: underflow, overflow, division by zero, invalid operation and inexact.

There are four rounding modes: round toward nearest, round towardround toward 0, and round toward It is strongly recommended that there be an enable mode bit for each of the five exceptions. This section gives some simple examples of how these modes and flags can be put to good use. A more sophisticated example is discussed in the section Binary to Decimal Conversion Consider writing a subroutine to compute x nwhere n is an integer.

Unfortunately, these is a slight snag in this strategy. If PositivePower x, -n underflows, then either the underflow trap handler will be called, or else the underflow status flag will be set. This is incorrect, because if x n underflows, then x n will either overflow or be in range But since the IEEE standard gives the user access to all the flags, the subroutine can easily correct for this. It simply turns off the overflow and underflow trap enable bits and saves the overflow and underflow status bits.

If neither the overflow nor underflow status bit is set, it restores them together with the trap enable bits. However, there is a small snag, because the computation of x x will cause the divide by zero exception flag to be set, even though arccos is not exceptional.

The solution to this problem is straightforward. Simply save the value of the divide by zero flag before computing arccos, and then restore its old value after the computation The design of almost every aspect of a computer system requires knowledge about floating-point.

Computer architectures usually have floating-point instructions, compilers must generate those floating-point instructions, and the operating system must decide what to do when exception conditions are raised for those floating-point instructions.

Computer system designers rarely get guidance from numerical analysis texts, which are typically aimed at users and writers of software, not at computer designers.

This example will be analyzed in the next section Incidentally, some people think that the solution to such anomalies is never to compare floating-point numbers for equality, but instead to consider them equal if they are within some error bound E. This is hardly a cure-all because it raises as many questions as it answers. What should the value of E be? One example occurs in the quadratic formula a.

As discussed in the section Proof of Theorem 4when b acrounding error can contaminate up to half the digits in the roots computed with the quadratic formula. By performing the subcalculation of b ac in double precision, half the double precision bits of the root are lost, which means that all the single precision bits are preserved The computation of b ac in strategy precision when each of the quantities aband c are in single precision is easy if there is a multiplication instruction that takes two single precision numbers and produces a double precision result.

In order to produce the exactly rounded product of two p -digit numbers, a multiplier needs to generate the entire 2 p bits of product, although it may throw bits away as it proceeds. Thus, hardware to compute a double precision product from single precision operands will normally be only a little more expensive than a single precision multiplier, and much cheaper than a double precision multiplier.

However, this instruction has many other uses. There is a simple way to improve the accuracy of the result called iterative improvement. First compute and then solve the system Note that if x is an exact solution, then is the zero vector, as is y. Then y x xso an improved estimate for the solution is The three steps, and can be repeated, replacing x with xand x with x This argument that x i is more accurate than x i is only informal. Once again, this is a case of computing the product of two single precision numbers A and xwhere the full double precision result is needed To summarize, instructions that multiply two floating-point numbers and return a product with twice the precision of the operands make a useful addition to a floating-point instruction set.

While this is usually true for the integer part of a language, language definitions often have a large grey area when it comes to floating-point. Perhaps this is due to the fact that many language designers believe that nothing can be proven about floating-point, since it entails rounding error.

If so, the previous sections have demonstrated the fallacy in this reasoning. Thinking about floating-point in this fuzzy way stands in sharp contrast to the IEEE model, where the result of each floating-point operation is precisely defined.

The IEEE standard precisely specifies the behavior of exceptions, and so languages that use the standard as a model can avoid any ambiguity on this point Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. The importance of preserving parentheses cannot be overemphasized.

The algorithms presented in theorems 3, 4 and 6 all depend on it. A language definition that does not require parentheses to be honored is useless for floating-point calculations Subexpression evaluation is imprecisely defined in many languages. Suppose that ds is double precision, but x and y are single precision.

There are two ways to deal with this problem, neither of which is completely satisfactory. The first is to require that all variables in an expression have the same type. This is the simplest solution, but has some drawbacks. First of all, languages like Pascal that have subrange types allow mixing subrange variables with integer variables, so it is somewhat bizarre to prohibit mixing single and double precision variables.

Another problem concerns constants. Now suppose the programmer decides to change the declaration of all the floating-point variables from single to double precision.

If is still treated as a single precision constant, then there will be a compile time error. The programmer will have to hunt down and change every floating-point constant The second approach is to allow mixed expressions, in which case rules for subexpression evaluation must be provided.

There are a number of guiding examples. This leads to anomalies like the example at the beginning of this section. The expression is computed in double precision, but if q is a single-precision variable, the quotient is rounded to single precision for storage.

Since is a repeating binary fraction, its computed value in double precision is different from its stored value in single precision. This suggests that computing every expression in the highest precision available is not a good rule Another guiding example is inner products.

If the inner product has thousands of terms, the rounding error in the sum can become substantial. One way to reduce this rounding error is to accumulate the sums in double precision this will be discussed in more detail in the section Optimizers. If the multiplication is done in single precision, than much of the advantage of double precision accumulation is lost, because the product is truncated to single precision just before being added to a double precision variable A rule that covers both of the previous two examples is to compute an expression in the highest precision of any variable that occurs in that expression.

However, this rule is too simplistic to cover all cases cleanly. First assign each operation a tentative precision, which is the maximum of the precisions of its operands. This assignment has to be carried out from the leaves to the root of the expression tree.

Then perform a second pass from the root to the leaves. This can have some annoying consequences. For example, suppose you are debugging a program and want to know the value of a subexpression.

You cannot simply type the subexpression to the debugger and ask it to be evaluated, because the value of the subexpression in the program depends on the expression it is embedded in.

A final comment on subexpressions: since converting decimal constants to binary is an operation, the evaluation rule also affects the interpretation of decimal constants. This is especially important for constants like which are not exactly representable in binary Another potential grey area occurs when a language includes exponentiation as one of its built-in operations. One definition might be to use the method shown in section Infinity. For example, to determine the value of a bconsider non-constant analytic functions f and g with the property that f x a and g x b as x If f x g x always approaches the same limit, then this should be the value of a b.

However, the IEEE standard says nothing about how these features are to be accessed from a programming language. Thus, there is usually a mismatch between floating-point hardware that supports the standard and programming languages like C, Pascal or FORTRAN. Some of the IEEE capabilities can be accessed through a library of subroutine calls. For example the IEEE standard requires that square root be exactly rounded, and the square root function is binary implemented directly in hardware.

This functionality is easily accessed via a library square root routine. However, other aspects of the standard are not so easily implemented as subroutines.

For example, most computer languages specify at most two floating-point types, while the IEEE standard has four different precisions although the recommended configurations are single plus single-extended or single, double, and double-extended. Infinity provides another example. But that might make them unusable in places that require constant expressions, such as the initializer of a constant variable A more subtle situation is manipulating the state associated with a computation, where the state consists of the rounding modes, trap enable bits, trap handlers and exception flags.

One approach is to provide subroutines for reading and writing the state. In addition, a single call that can atomically set a new value and return the old value is often useful.

As the examples in the section Flags show, a very common pattern of modifying IEEE state is to change it only within the scope of a block or subroutine. Thus the burden is on the programmer to find each exit from the block, and make sure the state is restored.

Language support for setting the state precisely in the scope of a block would be very useful here. In fact, the expression x x is the simplest way to test for a NaN if the IEEE recommended function Isnan is not provided. For example when computing the appropriate scale factor to use in plotting a graph, the maximum of a set of values must be computed.

In this decimal it makes sense for the max operation to simply ignore NaNs Finally, rounding can be a problem. The IEEE standard defines rounding very precisely, and it depends on the current value of the rounding modes. This sometimes conflicts with the definition of implicit rounding in type conversions or the explicit round function in decimal.

For example Aho et al. However, these two expressions do not have the same semantics on a binary machine, because cannot be represented exactly in binary. Although it does qualify the statement that any algebraic identity can be used when optimizing code by noting that optimizers should not violate the language definition, it leaves the impression that floating-point semantics are not very important. Because each addition can potentially introduce an error as large as.

A simple way to correct for this is to store decimal partial summand in a double precision variable and to perform each addition using double precision. If the calculation is being done in single precision, performing the sum in double precision is easy on most computer systems.

However, if the calculation is already being done in double precision, doubling the precision is not so simple.

One method that is sometimes advocated is strategy sort the numbers and add them from smallest to largest. Comparing this with the error in the Kahan summation formula shows a dramatic improvement. Each summand is perturbed by only 2 einstead of perturbations as large as ne in the simple formula. These examples can be summarized by saying that optimizers should be extremely cautious when applying algebraic identities that hold for the mathematical real numbers to expressions involving floating-point variables Another way that optimizers can change the semantics of floating-point code involves constants.

Because this constant cannot be represented exactly in binary, the inexact exception should be raised. In addition, the underflow flag should to be set if the expression is evaluated in single precision.

Since the constant is inexact, its exact conversion to binary depends on the current value of the IEEE rounding modes. Thus an optimizer that converts E to binary at compile time would strategy changing the semantics of the program. However, constants like which are exactly representable in the smallest available precision can be safely options at compile time, since they are always exact, cannot raise any exception, and are unaffected by the rounding modes.

First of all, there are algebraic identities that are valid for floating-point numbers. Perhaps they have in mind that floating-point numbers model real numbers and should obey the same laws that real numbers do. The problem with real number semantics is that they are extremely expensive to implement. Every time two n bit numbers are multiplied, the product will have 2 n bits.

An algorithm that involves thousands of operations such as solving a linear system will soon be operating on numbers with many significant bits, and be hopelessly slow. Exact integer arithmetic is often provided by lisp systems and is handy for some problems. Trap handlers also raise some interesting systems issues.

The IEEE standard strongly recommends that users be able to specify a trap handler for each of the five classes of exceptions, and the section Trap Handlersgave some applications of user defined options handlers. In the case of invalid operation and division by zero exceptions, the handler should be provided with the operands, otherwise, with the exactly rounded result. Depending on the decimal language being used, the trap handler might be able to access other variables in the program as well.

For all exceptions, the trap handler must be able to identify what operation was being performed and the precision of its destination The IEEE standard assumes that operations are conceptually serial and that when an interrupt occurs, it is possible to identify the operation and its operands. On machines which have pipelining or multiple arithmetic units, when an exception occurs, it may not be enough to simply have the trap handler examine the program counter.

On hardware that can do an add and multiply in parallel, an optimizer would probably move the addition operation ahead of the second multiply, so that the add can proceed in parallel with the first multiply. It would not be reasonable for a compiler to avoid this kind of optimization, because every floating-point operation can potentially trap, and thus virtually all instruction scheduling optimizations would be eliminated. This problem can be avoided by prohibiting trap handlers from accessing any variables of the program directly.

Instead, the handler can be given the operands or result as an argument But there are still problems. If the multiply traps, its argument z could already have been overwritten by the addition, especially since addition is usually faster than multiply.

Computer systems that support the IEEE standard must provide some way to save the value of zeither in hardware or by having the compiler avoid such a situation in the first place W. Kahan has proposed using presubstitution instead of trap handlers to avoid these problems. In this method, the user specifies an exception and the value he wants to be used as the result when the exception occurs.

Using IEEE trap handlers, the user would write a handler that returns a value of 1 and install it before computing sin x x. Using presubstitution, the user would specify that when an invalid operation occurs, the value 1 should be used. Kahan calls this presubstitution, because the value to be used must be specified before the exception occurs. When using trap handlers, the value to be returned can be computed when the trap occurs The advantage of presubstitution is that it has a straightforward hardware implementation As soon as the type of exception has been determined, it can be used to index a table which contains the desired result of the operation.

Although presubstitution has some attractive attributes, the widespread acceptance of the IEEE standard makes it unlikely to be widely implemented by hardware manufacturers A number of claims have been made in this paper concerning properties of floating-point arithmetic. We now proceed to show that floating-point is not black magic, but rather is a straightforward subject whose claims can be verified mathematically.

This section is divided into three parts. The first part presents an introduction to error analysis, and provides the details for the section Rounding Error. The second part explores binary to decimal conversion, filling in some gaps from the section The IEEE Standard. The third part discusses the Kahan summation formula, which was used as an example in the section Systems Aspects In the discussion of rounding error, it was stated that a single guard digit is enough to guarantee that addition and subtraction will always be accurate Theorem We now proceed to verify this fact.

Theorem 2 has two parts, one for subtraction and one for addition. The part for subtraction is If x and y are positive floating-point numbers in a format with parameters and p, and if subtraction is done with p digits i.

Theorem 2 gives the relative error for performing one operation. Comparing the rounding error of x y and x y x y requires knowing the relative error of multiple operations. Or to write it another way Assuming that multiplication is performed by computing the exact product and then rounding, the relative error is at most. Another way to see this is to try and duplicate the analysis that worked on x y x yyielding When x and y are nearby, the error term y 2 can be as large as the result x y These computations formally justify our claim that x y x y is more accurate than x y We next turn decimal an analysis of the formula for the area of a triangle.

The analysis of the error in x y x yimmediately following the proof of Theorem 10, used the fact that the relative error in the basic operations of addition and subtraction is small namely equations and This is the most common kind of error analysis.

This one cannot be eliminated by a simple rearrangement of the formula. Roughly speaking, when b acrounding error can contaminate up to half the digits in the roots computed with the quadratic formula. Since b acthen r rso the second error term is Thus the computed value of is so the absolute error in a is about Since p, and thus the absolute error of destroys the bottom half of the bits of the roots r r In other words, since the calculation of the roots involves computing withand this expression does not have meaningful bits in the position corresponding to the lower order half of r ithen the lower order bits of r i cannot be meaningful.

It is based on the following fact, which is proven in the section Theorem 14 and Theorem Theorem 6 gives a way to express the product of two working precision numbers exactly as a sum. There is a companion formula for expressing a sum exactly. However, this is not the case The same argument applied to double precision shows that 17 decimal digits are required to recover a double precision number Binary-decimal conversion also provides another example of the use of flags.

Recall from the section Precisionthat to recover a binary number from its decimal expansion, the decimal to binary conversion must be computed exactly. If the product is to bethen this would be rounded to as part of the single-extended multiply operation. Rounding to single precision would give But that answer is not correct, because rounding the product to single precision should give The error is due to double rounding By using the IEEE flags, double rounding can be avoided as follows.

Save the current value of the inexact flag, and then reset it. Set the rounding mode to round-to-zero. Store the new value of the inexact flag in ixflagand restore the rounding mode and inexact flag.

If ixflag is 1, then some digits were truncated, since round-to-zero always truncates. The simplest approach to improving accuracy is to double the precision. The first term x is perturbed by nthe last term x n by only The second equality in shows that error term is bounded by Doubling the precision has the effect of squaring If the sum is being done in an IEEE double precision format,so that for any reasonable value of n.

Thus, doubling the precision takes the maximum perturbation of n and changes it to Thus the error bound for the Kahan summation formula Theorem 8 is not as good as using double precision, even though it is much better than single precision For an intuitive explanation of why the Kahan summation formula works, consider the following diagram of the procedure Each time a summand is added, there is a correction factor C which will be applied on the next loop.

So first subtract the correction C computed in the previous loop from X jgiving the corrected summand Y. Then add this summand to the running sum S. The low order bits of Y namely Y l are lost in the sum. Next compute the high order bits of Y by computing T S. When Y is subtracted from this, the low order bits of Y will be recovered. These are the bits that were lost in the first sum in the diagram. They become the correction factor for the next loop. This is probably due to the fact that floating-point is given very little if any attention in the computer science curriculum.

This in turn has caused the apparently widespread belief that floating-point is not a quantifiable subject, and so there is little point in fussing over the details of hardware and software that deal with it This paper has demonstrated that it is possible to reason rigorously about floating-point. For example, floating-point algorithms involving cancellation can be proven to have small relative errors if the underlying hardware has a guard digit, and there is an efficient algorithm for binary-decimal conversion that can be proven to be invertible, provided that extended precision is supported.

The task of constructing reliable floating-point software is made much easier when the underlying computer system is supportive of floating-point. In addition to the two examples just mentioned guard digits and extended precisionthe section Systems Aspects of this paper has examples ranging from instruction set design to compiler optimization illustrating how to better support floating-point The increasing acceptance of the IEEE floating-point standard means that codes that utilize features of the standard are becoming ever more portable.

The section The IEEE Standardgave numerous examples illustrating how the features of the IEEE standard can be used in writing practical floating-point codes This article was inspired by a course given by W.

Kahan at Sun Microsystems from May through July ofwhich was very ably organized by David Hough of Sun. My hope is to enable others to strategy about the interaction of floating-point and computer systems without having to get up in time to attend a. Thanks are due to Kahan and many of my colleagues at Xerox PARC especially John Gilbert for reading drafts of this paper and providing many useful comments.

Reviews from Paul Hilfinger and an anonymous referee also helped improve the presentation Aho, Alfred V. D Compilers: Principles, Techniques and ToolsAddison-Wesley, Reading, MA ANSI American National Standard Programming Language FORTRANANSI Standard XAmerican National Standards Institute, New York, NY Barnett, David A Portable Floating-Point Environmentunpublished manuscript Brown, W.

S A Simple but Realistic Model of Floating-Point ComputationACM Trans. Softwarepp Cody, W. J Floating-Point Standards -- Theory and Practicein "Reliability in Computing: the role of interval methods in scientific computing", ed. Moore, ppAcademic Press, Boston, MA Coonen, Jerome Contributions to a Proposed Standard for Binary Floating-Point ArithmeticPhD Thesis, Univ.

J A Floating-Point Technique for Extending the Available PrecisionNumer. Mathpp Demmel, James Underflow and the Reliability of Numerical SoftwareSIAM J. Computpp Farnum, Charles Compiler Support for Floating-point ComputationSoftware-Practice and Experience,pp Forsythe, G. B Computer Solution of Linear Algebraic SystemsPrentice-Hall, Englewood Cliffs, NJ Goldberg, I. Bennett Bits Are Not Enough for 8-Digit AccuracyComm.

Hennessy, Appendix A, Morgan Kaufmann, Los Altos, CA Golub, Gene H. Iserles Univ of Birmingham, EnglandChapter 7, Oxford University Press, New York Kahan, W Unpublished lectures given at Sun Microsystems, Mountain View, CA Kahan, W. Reid, ppNorth-Holland, Amsterdam Kahan, W. L The Arithmetic of the Digital Computer: A New ApproachSIAM Reviewpp Matula, D.

Cpp Nelson, G Systems Programming With Modula-3Prentice-Hall, Englewood Cliffs, NJ Reiser, Options F. Cpp Walther, J. It has been added to clarify certain points and correct possible misconceptions about the IEEE standard that the reader might infer from the paper.

This material was not written by David Goldberg, but it appears here with his permission The preceding paper has shown that floating-point arithmetic must be implemented carefully, since programmers may depend on its properties for the correctness and accuracy of their programs. In particular, the IEEE standard requires a careful implementation, and it is possible to write useful programs that work correctly and deliver accurate results only on systems that conform to the standard.

The reader might be tempted to conclude that such programs should be portable to all IEEE systems. Indeed, portable software would be easier to write if the remark "When a program is moved between two machines and both support IEEE arithmetic, then if any intermediate result differs, it must be because of software bugs, not from differences in arithmetic," were true Unfortunately, the IEEE standard does not guarantee that the same program will deliver identical results on all conforming systems.

Most programs will actually produce different results on different systems for a variety of reasons. For one, most programs involve the conversion of numbers between decimal and binary formats, and the IEEE standard does not completely specify the accuracy with which such conversions must be performed. Of course, most programmers know that these features lie beyond the scope of the IEEE standard Many programmers may not realize that even a program that uses only the numeric formats and operations prescribed by the IEEE standard can compute different results on different systems.

In fact, the authors of the standard intended to allow different implementations to obtain different results. Their intent is evident in the definition of the term destination in the IEEE standard: "A destination may be either explicitly designated by the user or implicitly supplied by the system for example, intermediate results in subexpressions or arguments for procedures.

Thus, different systems may deliver their results to destinations with different precisions, causing the same program to produce different results sometimes dramatically soeven though those systems all conform to the standard Several of the examples in the preceding paper depend on some knowledge of the way floating-point arithmetic is rounded. In order to rely on examples such as these, a programmer must be able to predict how a program will be interpreted, and in particular, on an IEEE system, what the precision of the destination of each arithmetic operation may be.

Consequently, several of the examples given above, when implemented as apparently portable programs in a high-level language, may not work correctly on IEEE systems that normally deliver results to destinations with a different precision than the programmer expects. We then review some examples from the paper to show that delivering results in a wider precision than a program expects can cause it to compute wrong results even though it is provably correct when the expected precision is used.

These examples show that despite all that the IEEE standard prescribes, the differences it allows among different implementations can prevent us from writing portable, efficient numerical software whose behavior we can accurately predict. To develop such software, then, we must first create programming languages and environments that limit the variability the IEEE standard permits and allow programmers to express the floating-point semantics upon which their programs depend Decimal implementations of IEEE arithmetic can be divided into two groups distinguished by the degree to which they support different floating-point formats in hardware.

Extended-based systems, exemplified by the Intel x86 family of processors, provide full support for an extended double precision format but only partial support for single and double precision: they provide instructions to load or store data in single and double precision, converting it on-the-fly to or from the extended double format, and they provide special modes strategy the default in which the results of arithmetic operations are rounded to single or double precision even though they are kept in registers in extended double format.

Motorola series processors round results to both the precision and range of the single or double formats in these modes. Intel x86 and compatible processors round results to the precision of the single or double formats but retain the same range as the extended double format. Thus, q will be assigned the value rounded correctly to double precision. In the next line, the expression will again be evaluated in double precision, and of course the result will be equal to the value just assigned to qso the program will print "Equal" as expected On an extended-based system, even though the expression has type doublethe quotient will be computed in a register in extended double format, and thus in the default mode, it will be rounded to extended double precision.

When the resulting value is assigned to the variable qhowever, it may then be stored in memory, and since q is declared doublethe value will be rounded to double precision. In the next line, the expression may again be evaluated in extended precision yielding a result that differs from the double precision value stored in qcausing the program to print "Not equal".

Of course, other outcomes are possible, too: the compiler could decide to store and thus round the value of the expression in the second line before comparing it with qor it could keep q in a register in extended precision without storing it. An optimizing compiler might evaluate the expression at compile time, perhaps in double precision or perhaps in extended double precision. With one x86 compiler, the program prints "Equal" when compiled with optimization and "Not Equal" when compiled for debugging.

Finally, some compilers for extended-based systems automatically change the rounding precision mode to cause operations producing results in registers to round those results to single or double precision, albeit possibly with a wider range. Some languages, such as Ada, were influenced in this respect by variations among different arithmetics prior to the IEEE standard.

More recently, languages like ANSI C have been influenced by standard-conforming extended-based systems. In fact, the ANSI C standard explicitly allows a compiler to evaluate a floating-point expression to a precision wider than that normally associated with its type. Extended-based systems run most efficiently when expressions are evaluated in extended precision registers whenever possible, yet values that must be stored are stored in the narrowest precision required.

Recall the algorithm presented in Theorem 4 for computing ln xwritten here in Fortran real function log1p x real x if x. Thus, if x is not so small that x rounds to in extended precision but small enough that x rounds to in single precision, then the value returned by log1p x will be zero instead of xand the relative error will be one--rather larger than Similarly, suppose the rest of the expression in the sixth line, including the reoccurrence of the subexpression xis evaluated in extended precision.

In that case, if x is small but not quite small enough that x rounds to in single precision, then the value returned by log1p x can exceed the correct value by nearly as much as xand again the relative error can approach one.

For a concrete example, take x to beso x is the smallest single precision number such that x rounds up to the next larger number, Then log x is approximately Because the denominator in the expression in the sixth line is evaluated in extended precision, it is computed exactly and delivers xso log1p x returns approximatelywhich is nearly twice as large as the exact value.

This actually happens with at least one compiler. When the preceding code is compiled by the Sun WorkShop Compilers Fortran 77 compiler for x86 systems using the -O optimization flag, the generated code computes x exactly as described.

As a result, the function delivers zero for log1p 1. Of course, since log is a generic intrinsic function in Fortran, a compiler could evaluate the expression x in extended precision throughout, computing its logarithm in the same precision, but evidently we cannot assume that the compiler will do so.

One can also imagine a similar example involving a user-defined function. In that case, a compiler could still keep the argument in extended precision even though the function returns a single precision result, but few if any existing Fortran compilers do this, either.

We might therefore attempt to ensure that x is evaluated consistently by assigning it to a variable. Unfortunately, if we declare that variable realwe may still be foiled by a compiler that substitutes a value kept in a register in extended precision for one appearance of the variable and a value stored in memory in single precision for another. Instead, we would need to declare the variable with a type that corresponds to the extended precision format.

In short, there is no portable way to write this program in standard Fortran that is guaranteed to prevent the expression x from being evaluated in a way that invalidates our proof There are other examples that can malfunction on extended-based systems even when each subexpression is stored and thus rounded to the same precision.

The cause is double-rounding. In the default precision mode, an extended-based system will initially round each result to extended double precision. If that result is then stored to double precision, it is rounded again.

The combination of these two roundings can yield a value that is different than what would have been obtained by rounding the first result correctly to double precision. This can happen when the result as rounded to extended double precision is a "halfway case", i.

If this second rounding rounds in the same direction as the first, options net rounding error will exceed half a unit in the last place. Note, though, that double-rounding only affects double precision computations.

The most useful of these are the portable algorithms for performing simulated multiple precision arithmetic mentioned in the section Exactly Rounded Operations. Also, later steps in the multiple precision multiplication algorithm assume that all partial products have been computed in double precision.

Handling a mixture of double and extended double variables correctly would make the implementation significantly more expensive Likewise, portable algorithms for adding multiple precision numbers represented as arrays of double precision numbers can fail in double-rounding arithmetic. Here again, it would be possible to recover the roundoff error by computing the sum in extended double precision, but then a program would have to do extra work to reduce the final outputs back to double precision, and double-rounding could afflict this process, too.

For this reason, although portable programs for simulating multiple precision arithmetic by these methods work correctly and efficiently on a wide variety of machines, they do not work as advertised on extended-based systems Finally, some algorithms that at first sight appear to depend on correct rounding may in fact work correctly with double-rounding.

In these cases, the cost of coping with double-rounding lies not in the implementation but in the verification that the algorithm works as advertised. Consider first the case in which each floating-point operation is rounded correctly to double precision. Thus, as neagain m ne will round to m.

In this case, binary can appeal to the arguments of the previous paragraph provided we consider the fact that q n will be rounded twice. The proof also shows that extending our reasoning to include the possibility of double-rounding can be challenging even for a program with only two floating-point operations.

For a more complicated program, it may be impossible to systematically account for the effects of double-rounding, not to mention more general combinations of double and extended double precision computations The preceding examples should not be taken to suggest that extended precision per se is harmful.

Many programs can benefit from extended precision when the programmer is able to use it selectively. Unfortunately, current programming languages do not provide sufficient means for a programmer to specify when and how extended precision should be used.

To indicate what support is needed, we consider the ways in which we might want to manage the use of extended precision In a portable program that uses double precision as its nominal working precision, there are five ways we might want to control the use of a wider precision Compile to produce strategy fastest code, using extended precision where possible on extended-based systems.

Clearly most numerical software does not require more of the arithmetic than that the relative error in each operation is bounded by the "machine epsilon".

When data in memory are stored in double precision, the machine epsilon is usually taken to be the largest relative roundoff error in that precision, since the input data are rightly or wrongly assumed to have been rounded when they were entered and the results will likewise be rounded when they are stored. Thus, while computing some of the intermediate results in extended precision may yield a more accurate result, extended precision is not essential.

In this case, we might prefer that the compiler use extended precision only when it will not appreciably slow the program and use double precision otherwise Use a format wider than double if it is reasonably fast and wide enough, otherwise resort to something else. Some computations can be performed more easily when extended precision is available, but they can also be carried out in double precision with only somewhat greater effort.

Consider computing the Euclidean norm of a vector of double precision numbers. By computing the squares of the elements and accumulating their sum in an IEEE extended double format with its wider exponent range, we can trivially avoid premature underflow or overflow for vectors of practical lengths.

On extended-based systems, this is the fastest way to compute the norm. Note that to support this use of extended precision, a language must provide both an indication of the widest available format that is reasonably fast, so that a program can choose which method to use, and environmental parameters that indicate the precision and range of each format, so that the program can verify that the widest fast format is wide enough e.

For more complicated programs than the Euclidean norm example, the programmer may simply wish to avoid the need to write two versions of the binary and instead rely on extended precision even if it is slow. For programs that are most easily written to depend on correctly rounded double precision arithmetic, including some of the examples mentioned above, a language must provide a way for the programmer to indicate that extended precision must not be used, even though intermediate results may be computed in registers with a wider exponent range than double.

Intermediate results computed in this way can still incur double-rounding if they underflow when stored to memory: if the result of an arithmetic operation is rounded first to 53 significant bits, then rounded again to fewer significant bits when it must be denormalized, the final result may differ from what would have been obtained by rounding just once to a denormalized number.

Of course, this form of double-rounding is highly unlikely to affect any practical program adversely Round results correctly to both the precision and range of the double format. This strict enforcement of double precision would be most useful for programs that test either numerical software or the arithmetic itself near the limits of both the range and precision of the double format.

Such careful test programs tend to be difficult to write in a portable way; they become even more difficult and error prone when they must employ dummy subroutines and other tricks to force results to be rounded to a particular format.

In fact, few languages have attempted to give the programmer the ability to control the use of extended precision at all. On the other hand, the same implementation must keep anonymous expressions in extended precision even when they are stored in memory e.

A C99 standard version of the log1p function is guaranteed to work correctly if the expression x is assigned to a variable of any type and that variable used throughout. A portable, efficient C99 standard program for splitting a double precision number into high and low parts, however, is more difficult: how can we split at the correct position and avoid double-rounding if we cannot guarantee that double expressions are rounded correctly to double precision?

On extended-based systems, this merely requires changing the rounding precision mode, but unfortunately, the C99 standard does not provide a portable way to do this. Early drafts of the Floating-Point C Edits, the working document that specified the changes to be made to the C90 standard to support floating-point, recommended that implementations on systems with rounding precision modes provide fegetprec and fesetprec functions to get and set the rounding precision, analogous to the fegetround and fesetround functions that get and set the rounding direction.

The fast types could allow compilers on extended-based systems to generate the fastest possible code subject only to the constraint that the values of named variables must not appear to change as a result of register spilling. The exact width types would cause compilers on extended-based systems to set the rounding precision mode to round to the specified precision, allowing wider range subject to the same constraint.

Together with environmental parameter macros named accordingly, such a scheme would readily support all five options described above and allow programmers to indicate easily and unambiguously the floating-point semantics their programs require Must language support for extended precision be so complicated?

Extended-based systems, however, pose difficult choices: they support neither pure double precision nor pure extended precision computation as efficiently as a mixture of the two, and different programs call for different mixtures. Moreover, the choice of when to use extended precision should not be left to compiler writers, who are often tempted by benchmarks and sometimes told outright by numerical analysts to regard floating-point arithmetic as "inherently inexact" and therefore neither deserving nor capable of the predictability of integer arithmetic.

Instead, the choice must be presented to programmers, and they will require languages capable of expressing their selection The foregoing remarks are not intended to disparage extended-based systems but to expose several fallacies, the first being that all IEEE systems must deliver identical results for the same program. A fused multiply-add can also foil the splitting process of Theorem 6, although it can be used in a non-portable way to perform multiple precision multiplication without the need for splitting.

Many programmers like to believe that they can understand the options of a program and prove that it will work correctly without reference to the compiler that compiles it or the computer that runs it. In many ways, supporting this belief is a worthwhile goal for the designers of computer systems and programming languages. Unfortunately, when it comes to floating-point arithmetic, the goal is virtually impossible to achieve. As a result, despite nearly universal conformance to most of the IEEE standard throughout the computer industry, programmers of portable software must continue to cope with unpredictable floating-point arithmetic If programmers are to exploit the features of IEEE 754, they will need programming languages that make floating-point arithmetic predictable.

Whether future languages will choose instead to allow programmers to write a single program with syntax that unambiguously expresses the extent to which it depends on IEEE semantics remains to be seen.

Existing extended-based systems threaten that prospect by tempting us to assume that the compiler and the hardware can know better than the programmer how a computation should be performed on a given system.

4 thoughts on “Binary options 5 decimal strategy”

  1. Aleks69 says:

    If you cannot find what you are looking for, make sure you bookmark our website and come back later.

  2. adomik says:

    If it really is like many people believe it is, then the study of the key educational, ergo curricular, issues in the Philippines is a significant endeavor that needs serious pair of eyes, ears and hands.

  3. airFrance says:

    C. Panuto: Para sa Bilang 13 hanggang 15, tukuyin kung ang sumusunod ay.

  4. AmomeBese says:

    For example, two of the heroes of the Left Behind series become romantically involved.

Leave a Reply

Your email address will not be published. Required fields are marked *

inserted by FC2 system