<< Back

FLOAT

Floating-point arithmetics

Library functions of arithmetic with floating-point simulate math coprocessor with single-precision calculations. For a great range of code there are only most basic mathematical operations. Missing trigonometric functions, exponents, logarithms. The other reason for missing functions is the fact, that the internal single-precision calculations would cause too much error result. If a mathematical coprocessor is present, its support will be used for calculations. It can cause a small deviation in the last significant places, compared to emulated calculation.

Floating-point numbers in single-precision conform to the IEEE 754-1985 format. Operand size is 4 bytes, of which:

- bit 31: sign bit, 1=negative number, 0=positive number
- bit 23..30: exponent (8 bits) biased by 127 (i.e. zero exponent, e.g. "1" number, has biased value 127)
- bit 0..22: mantissa (23 bits) without most significant bit "1"

Exponent is in range -126 .. +127 (biased value 1 .. 254). Exponent -127 (biased value 0) represents zero. Unlike physical coprocessor, here does not calculate with subnormal values (i.e. with exponent -127 and non-zero mantissa). Exponent +128 (biased value 255) represents NaN infinity (result overflow). Negative infinity is not used.

Numbers are in range 1.1754944e-38 (=00800000h) to 1.7014118e+38 (=7F000000h), positive or negative.

FloatZero - floating-point zero constant

OUTPUT:
DX:AX = 0.0

FloatOne - floating-point one constant

OUTPUT:
DX:AX = 1.0

FloatInf - floating-point infinity constant

OUTPUT:
DX:AX = 1.#INF000

FloatPi - floating-point PI constant

OUTPUT:
DX:AX = PI constant (3.14159265)

FloatNeg - floating-point negate number

INPUT:
DX:AX = number

OUTPUT:
DX:AX = result

NOTES: Only highest byte of number (DH) is needed.

FloatAbs - floating-point absolute number

INPUT:
DX:AX = number

OUTPUT:
DX:AX = result

NOTES: Only highest byte of number (DH) is needed.

FloatCmp - floating-point comparison

INPUT:
DX:AX = first operand
CX:BX = second operand

OUTPUT:
AL = 1 if first > second, 0 if first = second, -1 if first < second
SF, ZF = as for "signed CMP first,second", use JL, JG, JLE,...

FloatCmpZ - floating-point comparison with zero

INPUT:
DX:AX = operand

OUTPUT:
AL = 1 if operand > 0, 0 if operand = 0, -1 if operand < 0
SF, ZF = as for "signed CMP operand,0", use JL, JG, JLE,...

FloatFact - floating-point factorial

INPUT:
AL = integer number 0..34

OUTPUT:
DX:AX = floating-point number n!

FloatInvFact - floating-point invert factorial

INPUT:
AL = integer number 0..34

OUTPUT:
DX:AX = floating-point number 1/n!

FloatFromWord - import floating-point from unsigned word

INPUT:
AX = integer unsigned number

OUTPUT:
DX:AX = floating-point number

FloatFromDWord - import floating-point from unsigned dword

INPUT:
DX:AX = integer unsigned number

OUTPUT:
DX:AX = floating-point number

FloatFromSWord - import floating-point from signed word

INPUT:
AX = integer signed number

OUTPUT:
DX:AX = floating-point number

FloatFromSDWord - import floating-point from signed dword

INPUT:
DX:AX = integer signed number

OUTPUT:
DX:AX = floating-point number

FloatToWord - export floating-point to unsigned word

INPUT:
DX:AX = floating-point number

OUTPUT:
AX = integer unsigned number

FloatToDWord - export floating-point to unsigned dword

INPUT:
DX:AX = floating-point number

OUTPUT:
DX:AX = integer unsigned number

FloatToSWord - export floating-point to signed word

INPUT:
DX:AX = floating-point number

OUTPUT:
AX = integer signed number

FloatToSDWord - export floating-point to signed dword

INPUT:
DX:AX = floating-point number

OUTPUT:
DX:AX = integer signed number

INPUT:
DX:AX = first operand
CX:BX = second operand

OUTPUT:
DX:AX = result

FloatSub - floating-point subtraction

INPUT:
DX:AX = first operand
CX:BX = second operand

OUTPUT:
DX:AX = result (= first - second)

FloatMul - floating-point multiplication

INPUT:
DX:AX = first operand
CX:BX = second operand

OUTPUT:
DX:AX = result

FloatDiv - floating-point division

INPUT:
DX:AX = first operand (dividend)
CX:BX = second operand (divisor)

OUTPUT:
DX:AX = result (quotient)

FloatInv - floating-point inverse (reciprocal) value

INPUT:
DX:AX = operand

OUTPUT:
DX:AX = result (1/operand)

Constants:

FLOAT_INF: (=7F800000h) floating-point infinity value
FLOAT_ZERO: (=0) floating-point zero value
FLOAT_ONE: (=3F800000h) floating-point one value
FACT_MAX: (=34) max. valid factorial index

Source code FLOAT.ASM

<< Back