c++ - Faster 16bit multiplication algorithm for 8-bit MCU -
i'm searching algorithm multiply 2 integer numbers better 1 below. have idea that? (the mcu - @ tiny 84/85 or similar - code runs has no mul/div operator)
uint16_t umul16_(uint16_t a, uint16_t b) { uint16_t res=0; while (b) { if ( (b & 1) ) res+=a; b>>=1; a+=a; } return res; }
this algorithm, when compiled @ tiny 85/84 using avr-gcc compiler, identical algorithm __mulhi3 avr-gcc generates.
avr-gcc algorithm:
00000106 <__mulhi3>: 106: 00 24 eor r0, r0 108: 55 27 eor r21, r21 10a: 04 c0 rjmp .+8 ; 0x114 <__mulhi3+0xe> 10c: 08 0e add r0, r24 10e: 59 1f adc r21, r25 110: 88 0f add r24, r24 112: 99 1f adc r25, r25 114: 00 97 sbiw r24, 0x00 ; 0 116: 29 f0 breq .+10 ; 0x122 <__mulhi3+0x1c> 118: 76 95 lsr r23 11a: 67 95 ror r22 11c: b8 f3 brcs .-18 ; 0x10c <__mulhi3+0x6> 11e: 71 05 cpc r23, r1 120: b9 f7 brne .-18 ; 0x110 <__mulhi3+0xa> 122: 80 2d mov r24, r0 124: 95 2f mov r25, r21 126: 08 95 ret
umul16_ algorithm:
00000044 <umul16_>: 44: 20 e0 ldi r18, 0x00 ; 0 46: 30 e0 ldi r19, 0x00 ; 0 48: 61 15 cp r22, r1 4a: 71 05 cpc r23, r1 4c: 49 f0 breq .+18 ; 0x60 <umul16_+0x1c> 4e: 60 ff sbrs r22, 0 50: 02 c0 rjmp .+4 ; 0x56 <umul16_+0x12> 52: 28 0f add r18, r24 54: 39 1f adc r19, r25 56: 76 95 lsr r23 58: 67 95 ror r22 5a: 88 0f add r24, r24 5c: 99 1f adc r25, r25 5e: f4 cf rjmp .-24 ; 0x48 <umul16_+0x4> 60: c9 01 movw r24, r18 62: 08 95 ret
edit: instruction set available here.
summary
- consider swapping
a
,b
(original proposal) - trying avoid conditional jumps (not successful optimization)
- reshaping of input formula (estimated 35% gain)
- removing duplicated shift
- unrolling loop: "optimal" assembly
- convincing compiler give optimal assembly
1. consider swapping
a
, b
one improvement first compare , b, , swap them if a<b
: should use b
smaller of two, have minimum number of cycles. note can avoid swapping duplicating code (if (a<b)
jump mirrored code section), doubt it's worth.
2. trying avoid conditional jumps (not successful optimization)
try:
uint16_t umul16_(uint16_t a, uint16_t b) { ///here swap if necessary uint16_t accum=0; while (b) { accum += ((b&1) * uint16_t(0xffff)) & a; //hopefully multiplication optimized away b>>=1; a+=a; } return accum; }
from sergio's feedback, didn't bring improvements.
3. reshaping of input formula
considering target architecture has 8bit instructions, if separate upper , bottom 8 bit of input variables, can write:
a = a1 * 0xff + a0; b = b1 * 0xff + b0; * b = a1 * b1 * 0xffff + a0 * b1 * 0xff + a1 * b0 * 0xff + a0 * b0
now, cool thing can throw away term a1 * b1 * 0xffff
, because 0xffff
send out of register.
(16bit) * b = a0 * b1 * 0xff + a1 * b0 * 0xff + a0 * b0
furthermore, a0*b1
, a1*b0
term can treated 8bit multiplications, because of 0xff
: part exceeding 256 sent out of register.
so far exciting! ... but, here comes reality striking: a0 * b0
has treated 16 bit multiplication, you'll have keep resulting bits. a0
have kept on 16 bit allow shift lefts. multiplication has half of iterations of a * b
, in part 8bit (because of b0) still have take account 2 8bit multiplications mentioned before, , final result composition. we need further reshaping!
so collect b0
.
(16bit) * b = a0 * b1 * 0xff + b0 * (a0 + a1 * 0xff)
but
(a0 + a1 * 0xff) =
so get:
(16bit) * b = a0 * b1 * 0xff + b0 *
if n cycles of original a * b
, first term 8bit multiplication n/2 cycles, , second 16bit * 8bit multiplication n/2 cycles. considering m number of instructions per iteration in original a*b
, 8bit*8bit iteration has half of instructions, , 16bit*8bit 80% of m (one shift instruction less b0 compared b). putting have n/2*m/2+n/2*m*0.8 = n*m*0.65
complexity, an expected saving of ~35% respect original n*m
. sounds promising.
this code:
uint16_t umul16_(uint16_t a, uint16_t b) { uint8_t res1 = 0; uint8_t a0 = & 0xff; //this needs copy data uint8_t b0 = b & 0xff; //this should optimized away uint8_t b1 = b >>8; //this should optimized away //here a0 , b1 swapped (to have b1 < a0) while (b1) {///maximum 8 cycles if ( (b1 & 1) ) res1+=a0; b1>>=1; a0+=a0; } uint16_t res = (uint16_t) res1 * 256; //should optimized away, it's not copy! //here swapping wouldn't make sense while (b0) {///maximum 8 cycles if ( (b0 & 1) ) res+=a; b0>>=1; a+=a; } return res; }
also, splitting in 2 cycles should double, in theory, chance of skipping cycles: n/2 might slight overestimate.
a tiny further improvement consist in avoiding last, unnecessary shift a
variables. small side note: if either b0 or b1 0 causes 2 instructions. but saves first check of b0 , b1, expensive because cannot check zero flag
status shift operation conditional jump of loop.
uint16_t umul16_(uint16_t a, uint16_t b) { uint8_t res1 = 0; uint8_t a0 = & 0xff; //this needs copy data uint8_t b0 = b & 0xff; //this should optimized away uint8_t b1 = b >>8; //this should optimized away //here a0 , b1 swapped (to have b1 < a0) if ( (b1 & 1) ) res1+=a0; b1>>=1; while (b1) {///maximum 7 cycles a0+=a0; if ( (b1 & 1) ) res1+=a0; b1>>=1; } uint16_t res = (uint16_t) res1 * 256; //should optimized away, it's not copy! //here swapping wouldn't make sense if ( (b0 & 1) ) res+=a; b0>>=1; while (b0) {///maximum 7 cycles a+=a; if ( (b0 & 1) ) res+=a; b0>>=1; } return res; }
4. removing duplicated shift
is there still space improvement? yes, bytes in a0
gets shifted two times. there should benefit in combining 2 loops. might little bit tricky convince compiler want, result register.
so, process in same cycle b0
, b1
. first thing handle is, loop exit condition? far using b0
/b1
cleared status has been convenient because avoids using counter. furthermore, after shift right, flag might set if operation result zero, , flag might allow conditional jump without further evaluations.
now loop exit condition failure of (b0 || b1)
. require expensive computation. 1 solution compare b0 , b1 , jump 2 different code sections: if b1 > b0
test condition on b1
, else test condition on b0
. prefer solution, 2 loops, first exit when b0
zero, second when b1
zero. there cases in 0 iterations in b1
. point in second loop know b0
zero, can reduce number of operations performed.
now, let's forget exit condition , try join 2 loops of previous section.
uint16_t umul16_(uint16_t a, uint16_t b) { uint16_t res = 0; uint8_t b0 = b & 0xff; //this should optimized away uint8_t b1 = b >>8; //this should optimized away //swapping doesn't make sense anymore if ( (b1 & 1) ) res+=(uint16_t)((uint8_t)(a && 0xff))*256; //hopefully compiler understands has add low 8bit register of high 8bit register of res if ( (b0 & 1) ) res+=a; b1>>=1; b0>>=1; while (b0) {///n cycles, maximum 7 a+=a; if ( (b1 & 1) ) res+=(uint16_t)((uint8_t)(a & 0xff))*256; if ( (b0 & 1) ) res+=a; b1>>=1; b0>>=1; //i try put last 1 leave carry flag in desired state } uint8_t a0 = & 0xff; //again, not real copy register selection while (b1) {///p cycles, maximum 7 - n cycles a0+=a0; if ( (b1 & 1) ) res+=(uint16_t) a0 * 256; b1>>=1; } return res; }
thanks sergio providing assembly generated (-ofast). @ first glance, considering outrageous amount of mov
in code, it seems compiler did not interpret wanted hints gave him interpret registers.
inputs are: r22,r23 , r24,25.
avr instruction set: quick reference, detailed documentation
sbrs //tests single bit in register , skips next instruction if bit set. skip takes 2 clocks. ldi // load immediate, 1 clock sbiw // subtracts immediate *word*, 2 clocks 00000010 <umul16_antonio5>: 10: 70 ff sbrs r23, 0 12: 39 c0 rjmp .+114 ; 0x86 <__sreg__+0x47> 14: 41 e0 ldi r20, 0x01 ; 1 16: 00 97 sbiw r24, 0x00 ; 0 18: c9 f1 breq .+114 ; 0x8c <__sreg__+0x4d> 1a: 34 2f mov r19, r20 1c: 20 e0 ldi r18, 0x00 ; 0 1e: 60 ff sbrs r22, 0 20: 07 c0 rjmp .+14 ; 0x30 <umul16_antonio5+0x20> 22: 28 0f add r18, r24 24: 39 1f adc r19, r25 26: 04 c0 rjmp .+8 ; 0x30 <umul16_antonio5+0x20> 28: e4 2f mov r30, r20 2a: 45 2f mov r20, r21 2c: 2e 2f mov r18, r30 2e: 34 2f mov r19, r20 30: 76 95 lsr r23 32: 66 95 lsr r22 34: b9 f0 breq .+46 ; 0x64 <__sreg__+0x25> 36: 88 0f add r24, r24 38: 99 1f adc r25, r25 3a: 58 2f mov r21, r24 3c: 44 27 eor r20, r20 3e: 42 0f add r20, r18 40: 53 1f adc r21, r19 42: 70 ff sbrs r23, 0 44: 02 c0 rjmp .+4 ; 0x4a <__sreg__+0xb> 46: 24 2f mov r18, r20 48: 35 2f mov r19, r21 4a: 42 2f mov r20, r18 4c: 53 2f mov r21, r19 4e: 48 0f add r20, r24 50: 59 1f adc r21, r25 52: 60 fd sbrc r22, 0 54: e9 cf rjmp .-46 ; 0x28 <umul16_antonio5+0x18> 56: e2 2f mov r30, r18 58: 43 2f mov r20, r19 5a: e8 cf rjmp .-48 ; 0x2c <umul16_antonio5+0x1c> 5c: 95 2f mov r25, r21 5e: 24 2f mov r18, r20 60: 39 2f mov r19, r25 62: 76 95 lsr r23 64: 77 23 , r23, r23 66: 61 f0 breq .+24 ; 0x80 <__sreg__+0x41> 68: 88 0f add r24, r24 6a: 48 2f mov r20, r24 6c: 50 e0 ldi r21, 0x00 ; 0 6e: 54 2f mov r21, r20 70: 44 27 eor r20, r20 72: 42 0f add r20, r18 74: 53 1f adc r21, r19 76: 70 fd sbrc r23, 0 78: f1 cf rjmp .-30 ; 0x5c <__sreg__+0x1d> 7a: 42 2f mov r20, r18 7c: 93 2f mov r25, r19 7e: ef cf rjmp .-34 ; 0x5e <__sreg__+0x1f> 80: 82 2f mov r24, r18 82: 93 2f mov r25, r19 84: 08 95 ret 86: 20 e0 ldi r18, 0x00 ; 0 88: 30 e0 ldi r19, 0x00 ; 0 8a: c9 cf rjmp .-110 ; 0x1e <umul16_antonio5+0xe> 8c: 40 e0 ldi r20, 0x00 ; 0 8e: c5 cf rjmp .-118 ; 0x1a <umul16_antonio5+0xa>
5. unrolling loop: "optimal" assembly
with information, let's try understand "optimal" solution given architecture constraints. "optimal" quoted because "optimal" depends lot on input data , want optimize. let's assume want optimize on number of cycles on worst case. if go worst case, loop unrolling reasonable choice: know have 8 cycles, , remove tests understand if finished (if b0 , b1 zero). far used trick "we shift, , check 0 flag" check if had exit loop. removed requirement, can use different trick: shift, , we check carry bit (the bit sent out of register when shifting) to understand if should update result. given instruction set, in assembly "narrative" code instructions become following.
//input: = a1 * 256 + a0, b = b1 * 256 + b0 //output: r = r1 * 256 + r0 preliminary: p0 r0 = 0 (clr) p1 r1 = 0 (clr) main block: 0 shift right b0 (lsr) 1 if carry not set skip 2 instructions = jump 4 (brcc) 2 r0 = r0 + a0 (add) 3 r1 = r1 + a1 + carry prev. (adc) 4 shift right b1 (lsr) 5 if carry not set skip 1 instruction = jump 7 (brcc) 6 r1 = r1 + a0 (add) 7 a0 = a0 + a0 (add) 8 a1 = a1 + a1 + carry prev. (adc) [repeat same instructions 7 times]
branching takes 1 instruction if no jump caused, 2 otherwise. other instructions 1 cycle. b1 state has no influence on number of cycles, while have 9 cycles if b0 = 1, , 8 cycles if b0 = 0. counting initialization, 8 iterations , skipping last update of a0 , a1, in worse case (b0 = 11111111b), have total of 8 * 9 + 2 - 2 =
72 cycles. wouldn't know c++ implementation convince compiler generate it. maybe:
void iterate(uint8_t& b0,uint8_t& b1,uint16_t& a, uint16_t& r) { const uint8_t temp0 = b0; b0 >>=1; if (temp0 & 0x01) {//will convince him use carry flag? r += a; } const uint8_t temp1 = b1; b1 >>=1; if (temp1 & 0x01) { r+=(uint16_t)((uint8_t)(a & 0xff))*256; } += a; } uint16_t umul16_(uint16_t a, uint16_t b) { uint16_t r = 0; uint8_t b0 = b & 0xff; uint8_t b1 = b >>8; iterate(b0,b1,a,r); iterate(b0,b1,a,r); iterate(b0,b1,a,r); iterate(b0,b1,a,r); iterate(b0,b1,a,r); iterate(b0,b1,a,r); iterate(b0,b1,a,r); iterate(b0,b1,a,r); //hopefully understands doesn't need last update variable return r; }
but, given previous result, obtain desired code 1 should switch assembly!
finally 1 consider more extreme interpretation of loop unrolling: sbrc/sbrs instructions allows test on specific bit of register. can therefore avoid shifting b0 , b1, , @ each cycle check different bit. problem instructions allow skip next instruction, , not custom jump. so, in "narrative code" this:
main block: 0 test nth bit of b0 (sbrs). if set jump 2 (+ 1cycle) otherwise continue 1 1 jump 4 (rjmp) 2 r0 = r0 + a0 (add) 3 r1 = r1 + a1 + carry prev. (adc) 4 test nth bit of (sbrc). if cleared jump 6 (+ 1cycle) otherwise continue 5 5 r1 = r1 + a0 (add) 6 a0 = a0 + a0 (add) 7 a1 = a1 + a1 + carry prev. (adc)
while second substitution allows save 1 cycle, there's no clear advantage in second substitution. however, believe c++ code might easier interpret compiler. considering 8 cycles, initialization , skipping last update of a0 , a1, have 64 cycles.
c++ code:
template<uint8_t mask> void iteratewithmask(const uint8_t& b0,const uint8_t& b1, uint16_t& a, uint16_t& r) { if (b0 & mask) r += a; if (b1 & mask) r+=(uint16_t)((uint8_t)(a & 0xff))*256; += a; } uint16_t umul16_(uint16_t a, const uint16_t b) { uint16_t r = 0; const uint8_t b0 = b & 0xff; const uint8_t b1 = b >>8; iteratewithmask<0x01>(b0,b1,a,r); iteratewithmask<0x02>(b0,b1,a,r); iteratewithmask<0x04>(b0,b1,a,r); iteratewithmask<0x08>(b0,b1,a,r); iteratewithmask<0x10>(b0,b1,a,r); iteratewithmask<0x20>(b0,b1,a,r); iteratewithmask<0x40>(b0,b1,a,r); iteratewithmask<0x80>(b0,b1,a,r); //hopefully understands doesn't need last update return r; }
note in implementation 0x01, 0x02 not real value, hint compiler know bit test. therefore, mask cannot obtained shifting right: differently other functions seen far, has no equivalent loop version.
one big problem that
r+=(uint16_t)((uint8_t)(a & 0xff))*256;
it should sum of upper register of r
lower register of a
. not interpreted like. other option:
r+=(uint16_t) 256 *((uint8_t)(a & 0xff));
6. convincing compiler give optimal assembly
we can keep a
constant, , shift instead result r
. in case process b
starting significant bit. complexity equivalent, might easier compiler digest. also, time have careful write explicitly last loop, must not further shift right r
.
template<uint8_t mask> void inverseiteratewithmask(const uint8_t& b0,const uint8_t& b1,const uint16_t& a, const uint8_t& a0, uint16_t& r) { if (b0 & mask) r += a; if (b1 & mask) r+=(uint16_t)256*a0; //hopefully easier understand compiler? r += r; } uint16_t umul16_(const uint16_t a, const uint16_t b) { uint16_t r = 0; const uint8_t b0 = b & 0xff; const uint8_t b1 = b >>8; const uint8_t a0 = & 0xff; inverseiteratewithmask<0x80>(b0,b1,a,r); inverseiteratewithmask<0x40>(b0,b1,a,r); inverseiteratewithmask<0x20>(b0,b1,a,r); inverseiteratewithmask<0x10>(b0,b1,a,r); inverseiteratewithmask<0x08>(b0,b1,a,r); inverseiteratewithmask<0x04>(b0,b1,a,r); inverseiteratewithmask<0x02>(b0,b1,a,r); //last iteration: if (b0 & 0x01) r += a; if (b1 & 0x01) r+=(uint16_t)256*a0; return r; }