5814b7b279
Implement double precision log10 function correctly rounded for all rounding modes. This implementation currently needs FMA instructions for correctness. Use 2 passes: Fast pass: - 1 step range reduction with a lookup table of `2^7 = 128` elements to reduce the ranges to `[-2^-7, 2^-7]`. - Use a degree-7 minimax polynomial generated by Sollya, evaluated using a mixed of double-double and double precisions. - Apply Ziv's test for accuracy. Accurate pass: - Apply 5 more range reduction steps to reduce the ranges further to [-2^-27, 2^-27]. - Use a degree-4 minimax polynomial generated by Sollya, evaluated using 192-bit precisions. - By the result of Lefevre (add quote), this is more than enough for correct rounding to all rounding modes. In progress: Adding detail documentations about the algorithm. Depend on: https://reviews.llvm.org/D136799 Reviewed By: zimmermann6 Differential Revision: https://reviews.llvm.org/D139846
558 lines
20 KiB
ReStructuredText
558 lines
20 KiB
ReStructuredText
.. _log_algorithm:
|
|
|
|
========================
|
|
Log/Log10/Log2 Algorithm
|
|
========================
|
|
|
|
.. default-role:: math
|
|
|
|
In this short note, we will discuss in detail about the computation of
|
|
:math:`\log(x)` function, with double precision inputs, in particular, the range
|
|
reduction steps and error analysis. The algorithm is broken down into 2 main
|
|
phases as follow:
|
|
|
|
1. Fast phase:
|
|
|
|
a. Range reduction
|
|
b. Polynomial approximation
|
|
c. Ziv's test
|
|
|
|
2. Accurate phase (if Ziv's test failed):
|
|
|
|
a. Further range reduction
|
|
b. Polynomial approximation
|
|
|
|
|
|
Fast phase
|
|
==========
|
|
|
|
Range reduction
|
|
---------------
|
|
|
|
Let `x = 2^{e_x} (1 + m_x)` be a normalized double precision number, in which
|
|
`-1074 \leq e_x \leq 1022` and `0 \leq m_x < 1` such that
|
|
`2^{52} m_x \in \mathbb{Z}`.
|
|
|
|
Then from the properties of logarithm:
|
|
|
|
.. math::
|
|
\log(x) &= \log\left( 2^{e_x} (1 + m_x) \right) \\
|
|
&= \log\left( 2^{e_x} \right) + \log(1 + m_x) \\
|
|
&= e_x \log(2) + \log(1 + m_x)
|
|
|
|
the computation of `\log(x)` can be reduced to:
|
|
|
|
1. compute the product of `e_x` and `\log(2)`,
|
|
2. compute `\log(1 + m_x)` for `0 \leq m_x < 1`,
|
|
3. add step 1 and 2.
|
|
|
|
To compute `\log(1 + m_x)` in step 2, we can reduce the range further by finding
|
|
`r > 0` such that:
|
|
|
|
.. math::
|
|
| r(1 + m_x) - 1 | < C \quad \quad \text{(R1)}
|
|
|
|
for small `0 < C < 1`. Then if we let `u = r(1 + m_x) - 1`, `|u| < C`:
|
|
|
|
.. math::
|
|
\log(1 + m_x) &= \log \left( \frac{r (1 + m_x)}{r} \right) \\
|
|
&= \log(r (1 + m_x) ) - \log(r) \\
|
|
&= \log(1 + u) - \log(r)
|
|
|
|
and step 2 can be computed with:
|
|
|
|
a. extract `r` and `-\log(r)` from look-up tables,
|
|
b. compute the reduced argument `u = r(1 + m_x) - 1`,
|
|
c. compute `\log(1 + u)` by polynomial approximation or further range reduction,
|
|
d. add step a and step c results.
|
|
|
|
|
|
How to derive `r`
|
|
-----------------
|
|
|
|
For an efficient implementation, we would like to use the first `M` signficicant
|
|
bits of `m_x` to look up for `r`. In particular, we would like to find a value
|
|
of `r` that works for all `m_x` satisfying:
|
|
|
|
.. math::
|
|
k 2^{-M} \leq m_x < (k + 1) 2^{-M} \quad \text{for some} \quad
|
|
k = 0..2^{M} - 1. \quad\quad \text{(M1)}
|
|
|
|
Let `r = 1 + s`, then `u` can be expressed in terms of `s` as:
|
|
|
|
.. math::
|
|
u &= r(1 + m_x) - 1 \\
|
|
&= (1 + s)(1 + m_x) - 1 \\
|
|
&= s m_x + s + m_x &\quad\quad \text{(U1)} \\
|
|
&= s (1 + m_x) + m_x \\
|
|
&= m_x (1 + s) + s.
|
|
|
|
From the condition `\text{(R1)}`, `s` is bounded by:
|
|
|
|
.. math::
|
|
\frac{-C - m_x}{1 + m_x} < s < \frac{C - m_x}{1 + m_x} \quad\quad \text{(S1)}.
|
|
|
|
Since our reduction constant `s` must work for all `m_x` in the interval
|
|
`I = \{ v: k 2^{-M} \leq v < (k + 1) 2^{-M} \}`, `s` is bounded by:
|
|
|
|
.. math::
|
|
\sup_{v \in I} \frac{-C - v}{1 + v} < s < \inf_{v \in I} \frac{C - v}{1 + v}
|
|
|
|
For a fixed constant `|c| < 1`, let `f(v) = \frac{c - v}{1 + v}`, then its
|
|
derivative is:
|
|
|
|
.. math::
|
|
f'(v) = \frac{(-1)(1 + v) - (1)(c - v)}{(1 + v)^2} = \frac{-1 - c}{(1 + v)^2}.
|
|
|
|
Since `|c| < 1`, `f'(v) < 0` for all `v \neq -1`, so:
|
|
|
|
.. math::
|
|
\sup_{v \in I} f(v) &= f \left( \inf\{ v: v \in I \} \right)
|
|
= f \left( k 2^{-M} \right) \\
|
|
\inf_{v \in I} f(v) &= f \left( \sup\{ v: v \in I \} \right)
|
|
= f \left( (k + 1) 2^{-M} \right)
|
|
|
|
Hence we have the following bound on `s`:
|
|
|
|
.. math::
|
|
\frac{-C - k 2^{-M}}{1 + k 2^{-M}} < s \leq
|
|
\frac{C - (k + 1) 2^{-M}}{1 + (k + 1) 2^{-M}}. \quad\quad \text{(S2)}
|
|
|
|
In order for `s` to exist, we need that:
|
|
|
|
.. math::
|
|
\frac{C - (k + 1) 2^{-M}}{1 + (k + 1) 2^{-M}} >
|
|
\frac{-C - k 2^{-M}}{1 + k 2^{-M}}
|
|
|
|
which is equivalent to:
|
|
|
|
.. math::
|
|
\quad\quad 2C - 2^{-M} + (2k + 1) 2^{-M} C > 0
|
|
\iff C > \frac{2^{-M - 1}}{1 + (2k + 1) 2^{-M - 1}} \quad\quad \text{(C1)}.
|
|
|
|
Consider the case `C = 2^{-N}`. Since `0 \leq k \leq 2^M - 1,` the right hand
|
|
side of `\text{(C1)}` is bounded by:
|
|
|
|
.. math::
|
|
2^{-M - 1} > \frac{2^{-M - 1}}{1 + (2k + 1) 2^{-M - 1}} \geq
|
|
\frac{2^{-M - 1}}{1 + (2^{M + 1} - 1) 2^{-M - 1}} > 2^{-M - 2}.
|
|
|
|
Hence, from `\text{(C1)}`, being an exact power of 2, `C = 2^{-N}` is bounded below
|
|
by:
|
|
|
|
.. math::
|
|
C = 2^{-N} \geq 2^{-M - 1}.
|
|
|
|
To make the range reduction efficient, we will want to minimize `C` (maximize
|
|
`N`) while keeping the required precision of `s`(`r`) as low as possible. And
|
|
for that, we will consider the following two cases: `N = M + 1` and `N = M`.
|
|
|
|
Case 1 - `N = M + 1`
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
When `N = M + 1`, `\text{(S2)}` becomes:
|
|
|
|
.. math::
|
|
\frac{-2^{-M - 1} - k 2^{-M}}{1 + k 2^{-M}} < s <
|
|
\frac{2^{-M - 1} - (k + 1) 2^{-M}}{1 + (k + 1) 2^{-M}}.
|
|
\quad\quad \text{(S2')}
|
|
|
|
This is an interval of length:
|
|
|
|
.. math::
|
|
l &= \frac{2^{-M - 1} - (k + 1) 2^{-M}}{1 + (k + 1) 2^{-M}} -
|
|
\frac{-2^{-M - 1} - k 2^{-M}}{1 + k 2^{-M}} \\
|
|
&= \frac{(2k + 1)2^{-2M - 1}}{(1 + k 2^{-M})(1 + (k + 1)2^{-M})}
|
|
\quad\quad \text{(L1)}
|
|
|
|
As a function of `k`, the length `l` has its derivative with respect to `k`:
|
|
|
|
.. math::
|
|
\frac{dl}{dk} =
|
|
\frac{2^{2M + 1} - 2k(k + 1) - 1}
|
|
{2^{4M}(1 + k 2^{-M})^2 (1 + (k + 1) 2^{-M})^2}
|
|
|
|
which is always positive for `0 \leq k \leq 2^M - 1`. So for all
|
|
`0 < k < 2^{-M}` (`k = 0` will be treated differently in edge cases), and for
|
|
`M > 2`, `l` is bounded below by:
|
|
|
|
.. math::
|
|
l > 2^{-2M}.
|
|
|
|
It implies that we can always find `s` with `\operatorname{ulp}(s) = 2^{-2M}`.
|
|
And from `\text{(U1)}`, `u = s(1 + m_x) + m_x`, its `ulp` is:
|
|
|
|
.. math::
|
|
\operatorname{ulp}(u) &= \operatorname{ulp}(s) \cdot \operatorname{ulp}(m_x) \\
|
|
&= 2^{-2M} \operatorname{ulp}(m_x).
|
|
|
|
Since:
|
|
|
|
.. math::
|
|
|u| < C = 2^{-N} = 2^{-M - 1},
|
|
|
|
Its required precision is:
|
|
|
|
.. math::
|
|
\operatorname{prec}(u) &= \log_2(2^{-M-1} / \operatorname{ulp}(u)) \\
|
|
&= \log_2(2^{M - 1} / \operatorname{ulp}(m_x)) \\
|
|
&= M - 1 - \log_2(\operatorname{ulp}(m_x)).
|
|
|
|
This means that in this case, we cannot restrict `u` to be exactly representable
|
|
in double precision for double precision input `x` with `M > 2`. Nonetheless,
|
|
for a reasonable value of `M`, we can have `u` exactly representable in double
|
|
precision for single precision input `x` (`\operatorname{ulp}(m_x) = 2^{-23}`)
|
|
such that `|u| < 2^{-M - 1}` using a look-up table of size `2^M`.
|
|
|
|
A particular formula for `s` can be derived from `\text{(S2')}` by the midpoint
|
|
formula:
|
|
|
|
.. math::
|
|
s &= 2^{-2M} \operatorname{round}\left( 2^{2M} \cdot \operatorname{midpoint}
|
|
\left(-\frac{-2^{-M - 1} - k2^{-M}}{1 + k 2^{-M}},
|
|
\frac{2^{-M-1} - (k + 1)2^{-M}}{1 + (k + 1) 2^{-M}}\right) \right) \\
|
|
&= 2^{-2M} \operatorname{round}\left( 2^{2M} \cdot \frac{1}{2} \left(
|
|
\frac{-2^{-M - 1} - k2^{-M}}{1 + k 2^{-M}} +
|
|
\frac{2^{-M - 1} + (k + 1)2^{-M}}{1 + (k + 1) 2^{-M}}
|
|
\right) \right) \\
|
|
&= 2^{-2M} \operatorname{round}\left( \frac{
|
|
- \left(k + \frac{1}{2} \right) \left(2^M - k - \frac{1}{2} \right) }
|
|
{(1 + k 2^{-N})(1 + (k + 1) 2^{-N})} \right) \\
|
|
&= - 2^{-2M} \operatorname{round}\left( \frac{
|
|
\left(k + \frac{1}{2} \right) \left(2^M - k - \frac{1}{2} \right) }
|
|
{(1 + k 2^{-N})(1 + (k + 1) 2^{-N})} \right) \quad\quad \text{(S3)}
|
|
|
|
The corresponding range and formula for `r = 1 + s` are:
|
|
|
|
.. math::
|
|
\frac{1 - 2^{-M - 1}}{1 + k 2^{-M}} < r \leq
|
|
\frac{1 + 2^{-M - 1}}{1 + (k + 1) 2^{-M}}
|
|
|
|
.. math::
|
|
r &= 2^{-2M} \operatorname{round}\left( 2^{2M} \cdot
|
|
\operatorname{midpoint}\left( \frac{1 - 2^{-M - 1}}{1 + k 2^{-M}},
|
|
\frac{1 + 2^{-M - 1}}{1 + (k + 1) 2^{-M}}\right) \right) \\
|
|
&= 2^{-2M} \operatorname{round}\left( 2^{2M} \cdot \frac{1}{2} \left(
|
|
\frac{1 + 2^{-M-1}}{1 + (k + 1) 2^{-M}} + \frac{1 - 2^{-M-1}}{1 + k 2^{-M}}
|
|
\right) \right) \\
|
|
&= 2^{-2M} \operatorname{round}\left( 2^{2M} \cdot \frac{
|
|
1 + \left(k + \frac{1}{2} \right) 2^{-M} - 2^{-2M-2} }{(1 + k 2^{-M})
|
|
(1 + (k + 1) 2^{-M})} \right)
|
|
|
|
Case 1 - `N = M`
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
When `N = M`, `\text{(S2)}` becomes:
|
|
|
|
.. math::
|
|
\frac{-(k + 1)2^{-M}}{1 + k 2^{-M}} < s < \frac{-k 2^{-M}}{1 + (k + 1) 2^{-M}}
|
|
\quad\quad \text{(S2")}
|
|
|
|
This is an interval of length:
|
|
|
|
.. math::
|
|
l &= \frac{- k 2^{-M}}{1 + (k + 1) 2^{-M}} -
|
|
\frac{- (k + 1) 2^{-M}}{1 + k 2^{-M}} \\
|
|
&= \frac{2^{-M} (1 + (2k + 1) 2^{-M})}{(1 + k 2^{-M})(1 + (k + 1)2^{-M})}
|
|
\quad\quad \text{(L1')}
|
|
|
|
As a function of `k`, its derivative with respect to `k`:
|
|
|
|
.. math::
|
|
\frac{dl}{dk} =
|
|
-\frac{2^{-2M}(k(k + 1)2^{-M + 1} + 2^{-M} + 2k + 1)}
|
|
{(1 + k 2^{-M})^2 (1 + (k + 1) 2^{-M})^2}
|
|
|
|
which is always negative for `0 \leq k \leq 2^M - 1`. So for `M > 1`, `l` is
|
|
bounded below by:
|
|
|
|
.. math::
|
|
l > \frac{2^{-M - 1} (3 - 2^{-M})}{2 - 2^{-M}} > 2^{-M - 1}.
|
|
|
|
It implies that we can always find `s` with `\operatorname{ulp}(s) = 2^{-M-1}`.
|
|
And from `\text{(U1)}`, `u = s(1 + m_x) + m_x`, its `ulp` is:
|
|
|
|
.. math::
|
|
\operatorname{ulp}(u) &= \operatorname{ulp}(s) \cdot \operatorname{ulp}(m_x) \\
|
|
&= 2^{-M - 1} \operatorname{ulp}(m_x).
|
|
|
|
Since:
|
|
|
|
.. math::
|
|
|u| < C = 2^{-N} = 2^{-M},
|
|
|
|
Its required precision is:
|
|
|
|
.. math::
|
|
\operatorname{prec}(u) &= \log_2(2^{-M} / \operatorname{ulp}(u)) \\
|
|
&= \log_2(2 / \operatorname{ulp}(m_x)) \\
|
|
&= 1 - \log_2(\operatorname{ulp}(m_x)).
|
|
|
|
Hence, for double precision `x`, `\operatorname{ulp}(m_x) = 2^{-52}`, and the
|
|
precision needed for `u` is `\operatorname{prec}(u) = 53`, i.e., `u` can be
|
|
exactly representable in double precision. And in this case, `s` can be
|
|
derived from `\text{(S2")}` by the midpoint formula:
|
|
|
|
.. math::
|
|
s &= 2^{-M - 1} \operatorname{round}\left( 2^{M + 1} \cdot
|
|
\operatorname{midpoint} \left(-\frac{-(k + 1)2^{-M}}{1 + k 2^{-M}},
|
|
\frac{-k2^{-M}}{1 + (k + 1) 2^{-M}}\right) \right) \\
|
|
&= 2^{-M - 1} \operatorname{round}\left( 2^{M + 1} \cdot \frac{1}{2} \left(
|
|
\frac{-(k + 1)2^{-M}}{1 + k 2^{-M}} + \frac{-k2^{-M}}{1 + (k + 1) 2^{-M}}
|
|
\right) \right) \\
|
|
&= -2^{-M - 1} \operatorname{round}\left( \frac{
|
|
(2k + 1) + (2k^2 + 2k + 1) 2^{-M} }
|
|
{(1 + k 2^{-N})(1 + (k + 1) 2^{-N})} \right) \quad\quad \text{(S3')}
|
|
|
|
The corresponding range and formula for `r = 1 + s` are:
|
|
|
|
.. math::
|
|
\frac{1 - 2^{-M}}{1 + k 2^{-M}} < r \leq \frac{1 + 2^{-M}}{1 + (k + 1) 2^{-M}}
|
|
|
|
.. math::
|
|
r &= 2^{-M-1} \operatorname{round}\left( 2^{M + 1} \cdot
|
|
\operatorname{midpoint}\left( \frac{1 - 2^{-M}}{1 + k 2^{-M}},
|
|
\frac{1 + 2^{-M}}{1 + (k + 1) 2^{-M}}\right) \right) \\
|
|
&= 2^{-M-1} \operatorname{round}\left( 2^{M + 1} \cdot \frac{1}{2} \left(
|
|
\frac{1 + 2^{-M}}{1 + (k + 1) 2^{-M}} + \frac{1 - 2^{-M}}{1 + k 2^{-M}}
|
|
\right) \right) \\
|
|
&= 2^{-M - 1} \operatorname{round}\left( 2^{M + 1} \cdot \frac{
|
|
1 + \left(k + \frac{1}{2} \right) 2^{-M} - 2^{-2M-1} }{(1 + k 2^{-M})
|
|
(1 + (k + 1) 2^{-M})} \right)
|
|
|
|
Edge cases
|
|
----------
|
|
|
|
1. When `k = 0`, notice that:
|
|
|
|
.. math::
|
|
0 = k 2^{-N} \leq m_x < (k + 1) 2^{-N} = 2^{-N} = C,
|
|
|
|
so we can simply choose `r = 1` so that `\log(r) = 0` is exact, then `u = m_x`.
|
|
This will help reduce the accumulated errors when `m_x` is close to 0 while
|
|
maintaining the range reduction output's requirements.
|
|
|
|
2. When `k = 2^{N} - 1`, `\text{(S2)}` becomes:
|
|
|
|
.. math::
|
|
-\frac{1}{2} - \frac{C - 2^{-M-1}}{2 - 2^{-M}} <> s \leq
|
|
-\frac{1}{2} + \frac{C}{2}.
|
|
|
|
so when `C > 2^{-M - 1}` is a power of 2, we can always choose:
|
|
|
|
.. math::
|
|
s = -\frac{1}{2}, \quad \text{i.e.} \quad r = \frac{1}{2}.
|
|
|
|
This reduction works well to avoid catastropic cancellation happening when
|
|
`e_x = -1`.
|
|
|
|
This also works when `C = 2^{-M - 1}` if we relax the condition on `u` to
|
|
`|u| \leq C = 2^{-M-1}`.
|
|
|
|
Intermediate precision, and Ziv's test
|
|
--------------------------------------
|
|
|
|
In the fast phase, we want extra precision while performant, so we use
|
|
double-double precision for most intermediate computation steps, and employ Ziv
|
|
test to see if the result is accurate or not. In our case, the Ziv's test can
|
|
be described as follow:
|
|
|
|
1. Let `re = re.hi + re.lo` be the double-double output of the fast phase
|
|
computation.
|
|
2. Let `err` be an estimated upper bound of the errors of `re`.
|
|
3. If `\circ(re.hi + (re.lo - err)) == \circ(re.hi + (r.lo + err))` then the
|
|
result is correctly rounded to double precision for the current rounding mode
|
|
`\circ`. Otherwise, the accurate phase with extra precision is needed.
|
|
|
|
For an easy and cheap estimation of the error bound `err`, since the range
|
|
reduction step described above is accurate, the errors of the result:
|
|
|
|
.. math::
|
|
\log(x) &= e_x \log(2) - \log(r) + \log(1 + u) \\
|
|
&\approx e_x \log(2) - \log(r) + u P(u)
|
|
|
|
come from 2 parts:
|
|
|
|
1. the look-up part: `e_x \log(2) - \log(r)`
|
|
2. the polynomial approximation part: `u P(u)`
|
|
|
|
The errors of the first part can be computed with a single `\operatorname{fma}`
|
|
operation:
|
|
|
|
.. math::
|
|
err_1 = \operatorname{fma}(e_x, err(\log(2)), err(\log(r))),
|
|
|
|
and then combining with the errors of the second part for another
|
|
`\operatorname{fma}` operation:
|
|
|
|
.. math::
|
|
err = \operatorname{fma}(u, err(P), err_1)
|
|
|
|
|
|
Accurate phase
|
|
==============
|
|
|
|
Extending range reduction
|
|
-------------------------
|
|
|
|
Since the output `u = r(1 + m_x) - 1` of the fast phase's range reduction
|
|
is computed exactly, we can apply further range reduction steps by
|
|
using the following formula:
|
|
|
|
.. math::
|
|
u_{i + 1} = r_i(1 + u_i) - 1 = u_i \cdot r_i + (r_i - 1),
|
|
|
|
where `|u_i| < 2^{-N_i}` and `u_0 = u` is representable in double precision.
|
|
|
|
Let `s_i = r_i - 1`, then we can rewrite it as:
|
|
|
|
.. math::
|
|
u_{i + 1} &= (1 + s_i)(1 + u_i) - 1 \\
|
|
&= s_i u_i + u_i + s_i \\
|
|
&= u_i (1 + s_i) + s_i
|
|
&= s_i (1 + u_i) + u_i.
|
|
|
|
Then the bound on `u_{i + 1}` is translated to `s_i` as:
|
|
|
|
.. math::
|
|
\frac{-2^{-N_{i + 1}} - u_i}{1 + u_i} < s_i < \frac{2^{-N_{i + 1}} - u_i}{1 + u_i}.
|
|
|
|
Let say we divide the interval `[0, 2^-{N_i})` into `2^{M_i}` subintervals
|
|
evenly and use the index `k` such that:
|
|
|
|
.. math::
|
|
k 2^{-N_i - M_i} \leq u_i < (k + 1) 2^{-N_i - M_i},
|
|
|
|
to look-up for the reduction constant `s_{i, k}`. In other word, `k` is given
|
|
by the formula:
|
|
|
|
.. math::
|
|
k = \left\lfloor 2^{N_i + M_i} u_i \right\rfloor
|
|
|
|
Notice that our reduction constant `s_{i, k}` must work for all `u_i` in the
|
|
interval `I = \{ v: k 2^{-N_i - M_i} \leq v < (k + 1) 2^{-N_i - M_i} \}`,
|
|
so it is bounded by:
|
|
|
|
.. math::
|
|
\sup_{v \in I} \frac{-2^{-N_{i + 1}} - v}{1 + v} < s_{i, k} < \inf_{v \in I} \frac{2^{-N_{i + 1}} - v}{1 + v}
|
|
|
|
For a fixed constant `|C| < 1`, let `f(v) = \frac{C - v}{1 + v}`, then its derivative
|
|
is:
|
|
|
|
.. math::
|
|
f'(v) = \frac{(-1)(1 + v) - (1)(C - v)}{(1 + v)^2} = \frac{-1 - C}{(1 + v)^2}.
|
|
|
|
Since `|C| < 1`, `f'(v) < 0` for all `v \neq -1`, so:
|
|
|
|
.. math::
|
|
\sup_{v \in I} f(v) &= f \left( \inf\{ v: v \in I \} \right)
|
|
= f \left( k 2^{-N_i - M_i} \right) \\
|
|
\inf_{v \in I} f(v) &= f \left( \sup\{ v: v \in I \} \right)
|
|
= f \left( (k + 1) 2^{-N_i - M_i} \right)
|
|
|
|
Hence we have the following bound on `s_{i, k}`:
|
|
|
|
.. math::
|
|
\frac{-2^{-N_{i + 1}} - k 2^{-N_i - M_i}}{1 + k 2^{-N_i - M_i}} < s_{i, k}
|
|
\leq \frac{2^{-N_{i + 1}} - (k + 1) 2^{-N_i - M_i}}{1 + (k + 1) 2^{-N_i - M_i}}
|
|
|
|
This interval is of length:
|
|
|
|
.. math::
|
|
l &= \frac{2^{-N_{i + 1}} - (k + 1) 2^{-N_i - M_i}}{1 + (k + 1) 2^{-N_i - M_i}} -
|
|
\frac{-2^{-N_{i + 1}} - k 2^{-N_i - M_i}}{1 + k 2^{-N_i - M_i}} \\
|
|
&= \frac{2^{-N_{i + 1} + 1} - 2^{-N_i - M_i} + (2k + 1) 2^{-N_{i + 1} - N_i - M_i}}
|
|
{(1 + k 2^{-N_i - M_i})(1 + (k + 1) 2^{-N_i -M_i})}
|
|
|
|
So in order to be able to find `s_{i, k}`, we need that:
|
|
|
|
.. math::
|
|
2^{-N_{i + 1} + 1} - 2^{-N_i - M_i} + (2k + 1) 2^{-N_{i + 1} - N_i - M_i} > 0
|
|
|
|
This give us the following bound on `N_{i + 1}`:
|
|
|
|
.. math::
|
|
N_{i + 1} \leq N_i + M_i + 1.
|
|
|
|
To make the range reduction effective, we will want to maximize `N_{i + 1}`, so
|
|
let consider the two cases: `N_{i + 1} = N_i + M_i + 1` and
|
|
`N_{i + 1} = N_i + M_i`.
|
|
|
|
|
|
|
|
The optimal choice to balance between maximizing `N_{i + 1}` and minimizing the
|
|
precision needed for `s_{i, k}` is:
|
|
|
|
.. math::
|
|
N_{i + 1} = N_i + M_i,
|
|
|
|
and in this case, the optimal `\operatorname{ulp}(s_{i, k})` is:
|
|
|
|
.. math::
|
|
\operatorname{ulp}(s_{i, k}) = 2^{-N_i - M_i}
|
|
|
|
and the corresponding `\operatorname{ulp}(u_{i + 1})` is:
|
|
|
|
.. math::
|
|
\operatorname{ulp}(u_{i + 1}) &= \operatorname{ulp}(u_i) \operatorname{ulp}(s_{i, k}) \\
|
|
&= \operatorname{ulp}(u_i) \cdot 2^{-N_i - M_i} \\
|
|
&= \operatorname{ulp}(u_0) \cdot 2^{-N_0 - M_0} \cdot 2^{-N_0 - M_0 - M_1} \cdots 2^{-N_0 - M_0 - M_1 - \cdots - M_i} \\
|
|
&= 2^{-N_0 - 53} \cdot 2^{-N_0 - M_0} \cdot 2^{-N_0 - M_0 - M_1} \cdots 2^{-N_0 - M_0 - M_1 - \cdots - M_i}
|
|
|
|
Since `|u_{i + 1}| < 2^{-N_{i + 1}} = 2^{-N_0 - M_1 - ... -M_i}`, the precision
|
|
of `u_{i + 1}` is:
|
|
|
|
.. math::
|
|
\operatorname{prec}(u_{i + 1}) &= (N_0 + 53) + (N_0 + M_0) + \cdots +
|
|
(N_0 + M_0 + \cdots + M_i) - (N_0 + M_0 + \cdots + M_i) \\
|
|
&= (i + 1) N_0 + i M_0 + (i - 1) M_1 + \cdots + M_{i - 1} + 53
|
|
|
|
If we choose to have the same `M_0 = M_1 = \cdots = M_i = M`, this can be
|
|
simplified to:
|
|
|
|
.. math::
|
|
\operatorname{prec}(u_{i + 1}) = (i + 1) N_0 + \frac{i(i + 1)}{2} \cdot M + 53.
|
|
|
|
We summarize the precision analysis for extending the range reduction in the
|
|
table below:
|
|
|
|
+-------+-----+-----------+------------+--------------+-----------------+-------------------+
|
|
| `N_0` | `M` | No. steps | Table size | Output bound | ulp(`s_{i, k}`) | prec(`u_{i + 1}`) |
|
|
+-------+-----+-----------+------------+--------------+-----------------+-------------------+
|
|
| 7 | 4 | 1 | 32 | `2^{-11}` | `2^{-12}` | 60 |
|
|
| | +-----------+------------+--------------+-----------------+-------------------+
|
|
| | | 2 | 64 | `2^{-15}` | `2^{-16}` | 71 |
|
|
| | +-----------+------------+--------------+-----------------+-------------------+
|
|
| | | 3 | 96 | `2^{-19}` | `2^{-20}` | 86 |
|
|
| | +-----------+------------+--------------+-----------------+-------------------+
|
|
| | | 4 | 128 | `2^{-23}` | `2^{-24}` | 105 |
|
|
| | +-----------+------------+--------------+-----------------+-------------------+
|
|
| | | 5 | 160 | `2^{-27}` | `2^{-28}` | 128 |
|
|
| | +-----------+------------+--------------+-----------------+-------------------+
|
|
| | | 6 | 192 | `2^{-31}` | `2^{-32}` | 155 |
|
|
| +-----+-----------+------------+--------------+-----------------+-------------------+
|
|
| | 5 | 3 | 192 | `2^{-22}` | `2^{-23}` | 89 |
|
|
| | +-----------+------------+--------------+-----------------+-------------------+
|
|
| | | 4 | 256 | `2^{-27}` | `2^{-28}` | 111 |
|
|
| | +-----------+------------+--------------+-----------------+-------------------+
|
|
| | | 5 | 320 | `2^{-32}` | `2^{-33}` | 138 |
|
|
| | +-----------+------------+--------------+-----------------+-------------------+
|
|
| | | 6 | 384 | `2^{-37}` | `2^{-38}` | 170 |
|
|
| +-----+-----------+------------+--------------+-----------------+-------------------+
|
|
| | 6 | 3 | 384 | `2^{-25}` | `2^{-26}` | 92 |
|
|
| | +-----------+------------+--------------+-----------------+-------------------+
|
|
| | | 4 | 512 | `2^{-31}` | `2^{-32}` | 117 |
|
|
| +-----+-----------+------------+--------------+-----------------+-------------------+
|
|
| | 7 | 1 | 256 | `2^{-24}` | `2^{-15}` | 60 |
|
|
| | +-----------+------------+--------------+-----------------+-------------------+
|
|
| | | 2 | 512 | `2^{-21}` | `2^{-22}` | 74 |
|
|
+-------+-----+-----------+------------+--------------+-----------------+-------------------+
|
|
|
|
where:
|
|
|
|
- Number of steps = `i + 1`
|
|
- Table size = `(i + 1) 2^{M + 1}`
|
|
- Output bound = `2^{-N_{i + 1}} = 2^{-N_0 - (i + 1) M}`
|
|
- `\operatorname{ulp}(s_{i, k}) = 2^{-N_{i + 1} - 1}`
|
|
- `\operatorname{prec}(u_{i + 1}) = (i + 1) N_0 + \frac{i(i + 1)}{2} \cdot M + 53`
|