FLOAT
Floating-point arithmetics
Library functions of arithmetic with floating-point simulate math coprocessor with single-precision calculations. For a great range of code there are only most basic mathematical operations. Missing trigonometric functions, exponents, logarithms. The other reason for missing functions is the fact, that the internal single-precision calculations would cause too much error result. If a mathematical coprocessor is present, its support will be used for calculations. It can cause a small deviation in the last significant places, compared to emulated calculation.
Floating-point numbers in single-precision conform to the IEEE 754-1985 format. Operand size is 4 bytes, of which:
- bit 31: sign bit, 1=negative number,
0=positive number
- bit 23..30: exponent (8 bits) biased by 127 (i.e. zero
exponent, e.g. "1" number, has biased value 127)
- bit 0..22: mantissa (23 bits) without most significant bit
"1"
Exponent is in range -126 .. +127 (biased value 1 .. 254). Exponent -127 (biased value 0) represents zero. Unlike physical coprocessor, here does not calculate with subnormal values (i.e. with exponent -127 and non-zero mantissa). Exponent +128 (biased value 255) represents NaN infinity (result overflow). Negative infinity is not used.
Numbers are in range 1.1754944e-38 (=00800000h) to 1.7014118e+38 (=7F000000h), positive or negative.
FloatZero - floating-point zero constant
OUTPUT:
DX:AX = 0.0
FloatOne - floating-point one constant
OUTPUT:
DX:AX = 1.0
FloatInf - floating-point infinity constant
OUTPUT:
DX:AX = 1.#INF000
FloatPi - floating-point PI constant
OUTPUT:
DX:AX = PI constant (3.14159265)
FloatNeg - floating-point negate number
INPUT:
DX:AX = number
OUTPUT:
DX:AX = result
NOTES: Only highest byte of number (DH) is needed.
FloatAbs - floating-point absolute number
INPUT:
DX:AX = number
OUTPUT:
DX:AX = result
NOTES: Only highest byte of number (DH) is needed.
FloatCmp - floating-point comparison
INPUT:
DX:AX = first operand
CX:BX = second operand
OUTPUT:
AL = 1 if first > second, 0 if first = second, -1 if first
< second
SF, ZF = as for "signed CMP first,second", use JL, JG,
JLE,...
FloatCmpZ - floating-point comparison with zero
INPUT:
DX:AX = operand
OUTPUT:
AL = 1 if operand > 0, 0 if operand = 0, -1 if operand < 0
SF, ZF = as for "signed CMP operand,0", use JL, JG,
JLE,...
FloatFact - floating-point factorial
INPUT:
AL = integer number 0..34
OUTPUT:
DX:AX = floating-point number n!
FloatInvFact - floating-point invert factorial
INPUT:
AL = integer number 0..34
OUTPUT:
DX:AX = floating-point number 1/n!
FloatFromWord - import floating-point from unsigned word
INPUT:
AX = integer unsigned number
OUTPUT:
DX:AX = floating-point number
FloatFromDWord - import floating-point from unsigned dword
INPUT:
DX:AX = integer unsigned number
OUTPUT:
DX:AX = floating-point number
FloatFromSWord - import floating-point from signed word
INPUT:
AX = integer signed number
OUTPUT:
DX:AX = floating-point number
FloatFromSDWord - import floating-point from signed dword
INPUT:
DX:AX = integer signed number
OUTPUT:
DX:AX = floating-point number
FloatToWord - export floating-point to unsigned word
INPUT:
DX:AX = floating-point number
OUTPUT:
AX = integer unsigned number
FloatToDWord - export floating-point to unsigned dword
INPUT:
DX:AX = floating-point number
OUTPUT:
DX:AX = integer unsigned number
FloatToSWord - export floating-point to signed word
INPUT:
DX:AX = floating-point number
OUTPUT:
AX = integer signed number
FloatToSDWord - export floating-point to signed dword
INPUT:
DX:AX = floating-point number
OUTPUT:
DX:AX = integer signed number
FloatAdd - floating-point addition
INPUT:
DX:AX = first operand
CX:BX = second operand
OUTPUT:
DX:AX = result
FloatSub - floating-point subtraction
INPUT:
DX:AX = first operand
CX:BX = second operand
OUTPUT:
DX:AX = result (= first - second)
FloatMul - floating-point multiplication
INPUT:
DX:AX = first operand
CX:BX = second operand
OUTPUT:
DX:AX = result
FloatDiv - floating-point division
INPUT:
DX:AX = first operand (dividend)
CX:BX = second operand (divisor)
OUTPUT:
DX:AX = result (quotient)
FloatInv - floating-point inverse (reciprocal) value
INPUT:
DX:AX = operand
OUTPUT:
DX:AX = result (1/operand)
Constants:
FLOAT_INF: (=7F800000h) floating-point
infinity value
FLOAT_ZERO: (=0) floating-point zero value
FLOAT_ONE: (=3F800000h) floating-point one value
FACT_MAX: (=34) max. valid factorial index