我很早就知道这一点,但我所写的图书馆在固定点数上有一些说明。
Overview of Using Fixed Point Math
Key ideas that come up in the implementation of quantized or DSP type math areas include how to represent fractional units and what to do when math operations cause integer registers to overflow.
抽取点(C,C++的借方或双型)通常用来代表零件单位,但这些单位并非总能得到(无硬件支持,或低端微观控制器没有图书馆软件支持)。 如果无法提供硬件支持,则可以通过软件图书馆提供浮动点,但这些图书馆的大小一般比硬件的安装要慢。 如果方案管理员是cl,则可以使用惯性数学,而不是软件浮动点,导致代码更加快捷。 在这些情况下,如果采用一个递减系数,则分单位可以有户籍登记(短、中、长)。
Scaling and Wrap Around
Two common issues that come up in creating scaled-integer representations of numbers are
缩略语——方案管理员必须逐手跟踪微小幅递减系数,而在使用浮动点号码时,汇编者和浮点图书馆对方案管理员来说就是这样。
周围的溢出和瓦拉布——如果加上两个大数目的话,结果可能大于对愤怒的表述,从而造成包装或溢出错误。 更糟糕的是,这些错误可能根据汇编者警报环境或行动类型而不予解释。 例如,有2个8个轨道编号(通常是C/C++的果园)。
char a = 34, b= 3, c;
//and compute
c=a*b;
//Multiply them together and the result is 102, which still is fits in a 8 bit result. But
//what happens if b = 5?
c=a*b;实际答案为170,但结果为:86
这类错误称为超支或围绕错误进行总结,可以用几种方式处理。 我们可以使用诸如短时间或帐篷等较大的分类类型,或者我们可以事先测试数字,看看结果是否会产生超支。 哪些是更好的? 它取决于具体情况。 如果我们已经使用最大的自然类型(例如32倍频频频频的32倍频频),那么我们可能不得不在开展行动之前测试其价值,即使这样做会带来一些暂时的绩效处罚。
Loss of Precision
True floating point representations can maintain a relatively arbitrary set of precision across a range of operations, however when using fixed point (scaled) math the process of scaling means we end up dedicating a portion of the available bits to the precision we desire. This means that mathematical information that is smaller than the scaling factor will be lost causing quantization error.
A. 简便车道测量 让我们努力代表第10.25号,进行愤怒,我们可以做一些简单的事,如将价值乘以100,并储存结果,形成一种分类变数。 因此,我们:
10.25 * 100 ==> 1025
如果我们想要增加另一个数字,就把它说0.55,那么我们就会拿到1.55,把这个数目推到100。 ......
0.55 * 100=>155
现在,加上这些数字,我们添加了分类数字。
1025+55 => 1070
但是,用一些法典来审视这一点:
void main (void)
{
int myIntegerizedNumber = 1025;
int myIntegerizedNumber2 = 155;
int myIntegerizedNumber3;
myIntegerizedNumber3 = myIntegerizedNumber1 + myIntegerizedNumber2;
printf("%d + %d = %d
",myIntegerizedNumber1,myIntegerizedNumber2,myIntegerizedNumber3);
}
但现在是几个挑战中的第一个挑战。 我们如何处理ger和零件? 我们如何展示成果? 没有汇编者的支持,我们作为方案者必须区分分类和分数结果。 上述方案将结果印成1070份而不是10.70份,因为汇编者只知道我们没有打算扩大定义的分类变量。
Thinking in powers of 2 (radixes)
In the previous example we used base10 math, which while useful for humans is not an optimal use of bits as all the numerics in the machine will be using binary math. If we use powers of two we can specify the precision in terms of bits instead of base10 for the fractional and integer parts and we get several other advantages:
Ease of notation - for example with a 16 bit signed integer (typically a short in C/C++), we can say the number is "s11.4" which means its a number that is signed with 11 bits of integer and 4 bits of fractional precision. In fact one bit is not used for a sign representation but the number is represented as 2 s complement format. However effectively 1 bit is used for sign representation from the point of precision represented. If a number is unsigned, then we can say its u12.4 - yes the same number now has 12 integer bits of precision and 4 bits of fractional representation.
If we were to use base10 math no such simple mapping would be possible (I won t go in to all the base10 issues that will come up). Worse yet many divide by 10 operations would need to be performed which are slow and can result in precision loss.
Ease of changing radix precision. Using base2 radixes allows us to use simple shifts (<< and >>) to change from integer to fixed-point or change from different fixed point representations. Many programmers assume that radixes should be a byte multiple like 16bits or 8 bits but actually using just enough precision and no more (like 4 or 5 bits for small graphics apps) can allow much larger headroom because integer portion will get the remaining bits. Suppose we need a range of +/-800 and a precision of 0.05 (or 1/20 in decimal). We can fit this in to 16bit integer as follows. First one bit is allocated to sign. This leaves 15 bits of resolution. Now we need 800 counts of range. Log2(800) = 9.64.. So we need 10 bits for the integer dynamic range. Now lets look at the fractional precision, we need log2(1/(0.05))= 4.32 bits which rounded up is 5 bits. So we could, in this application use a fixed radix of s10.5 or signed 10bit integer and 5 bit fractional resolution. Better yet it still fits in a 16 bit integer. Now there are some issues: while the fractional precision is 5 bits (or 1/32 or about 0.03125) which is better than the requirement of 0.05, it is not identical and so with accumulated operations we will get quantization errors. This can be greatly reduced by moving to a larger integer and radix (e.g. 32bit integer with more fractional bits) but often this is not necessary and manipulating 16 bit integers is much more efficient in both compute and memory for smaller processors. The choice of these integer sizes etc should be made with care.
A few notes on fixed point precision
When adding fixed point numbers together its important to align their radix points (e.g. you must add 12.4 and 12.4 numbers or even 18.4 + 12.4 + 24.4 will work where the integer portion represents the number of bits in use not the physical size of the integer register declared). Now, and here begins some of the trickyiness, the result, purely speaking is a 13.4 number! But this becomes 17 bits - so you must be careful to watch for overflow. One way is to put the results in a larger 32 bit wide register of 28.4 precision. Another way is to test and see if the high order bits of either of the operands is actually set - if not then the numbers can be safely added despite the register width precision limitations. Another solution is to use saturating math - if the result is larger than can be put in the register we set the result to the maximum possible value. Also be sure to watch out for sign. Adding two negative numbers can cause wrap around just as easy as two positive numbers.
在设计固定的圆形管道方面,一个令人感兴趣的问题正在跟踪现有精度的实际使用程度。 您可使用 象四舍五入变迁这样的行动尤其如此,这种行动可能导致一些阵列细胞具有相对较大的价值,而另一些则具有接近零的数学能量。
一些规则:添加 2兆位数得出M+1轨精度结果(不测试过量)
单方位数乘以百万分之数,得出N+M轨道精确结果(不测试过量)。
Saturation can be useful in some circumstance but may result in performance hits or loss of data.
Adding...
When adding or subtracting fixed radix numbers the radix points must be aligned beforehand. For example: to add a A is a s11.4 number and B is a 9.6 number. We need to make some choices. We could move them to larger registers first, say 32 bit registers. resulting in A2 being a s27.4 number and B2 being a s25.6 number. Now we can safely shift A2 up by two bits so that A2 = A2<<2. Now A2 is a s25.6 number but we haven t lost any data because the upper bits of the former A can be held in the larger register without precision loss.
现在我们可以补充这些内容,并取得结果C=A2+B2。 结果是25.6个数字,但精度实际上为12.6个(我们使用的是来自A的较大的分类部分,即11个比值和较大的分部分,这些部分来自B,即6个比值,加上1个比值增加业务)。 因此,这一12.6号的精确度为18+1,而没有准确损失。 但是,如果我们需要将其改为16倍的精确数字,我们将需要就保持多少分数的准确性作出选择。 最简单的办法是保存所有分类账,这样我们就把标记和分类账的借项划入16个轨道登记册。 因此,C=C>>3得出12.3个数字。 只要我们跟踪六分点,我们就会保持准确性。 现在,如果我们事先测试了A和B,我们可能已经认识到,我们可以保留更多的分界线。
Multiplying...
Multiplying does not require that the radix points be aligned before the operation is carried out. Lets assume we have two numbers as we had in our Adding example. A is a s11.4 precision number which we move to A2 now a s27.4 large register (but still a s11.4 number of bits are in use). B is a s9.6 number which we move to B2 now a s25.6 large registor (but still a s9.6 number of bits are in use). C=A2*B2 results in C being a s20.10 number. note that C is using the entire 32 bit register for the result. Now if we want the result squished back in to a 16 bit register then we must make some hard choices. Firstly we already have 20 bits if integer precision - so any attempt (without looking at the number of bits actually used) to fit the result in to a 16 bit regist must result in some type of truncation. If we take the top 15 bits of the result (+1 bit for sign precision) we the programmers must remember that this scaled up by 20-5 = 5 bits of precision. So when even though we can fit the top 15 bits in a 16 bit register we will lose the bottom 16 bits of precision and we have to remember that the result is scaled by 5 bits (or integer 32). Interestingly if test both A and B beforehand we may find that while they had the stated incoming precion by wrote they may not actually contain that number of live, set bits (e.g. if A is a s11.4 number by programmer s convention but its actual value is integer 33 or fixed-radix 2and1/16 then we may not have as many bits to truncate in C).
(这里还有图书馆:https://github.com/deftio/fr_math)