c++ - Faster 16bit multiplication algorithm for 8-bit MCU -


i'm searching algorithm multiply 2 integer numbers better 1 below. have idea that? (the mcu - @ tiny 84/85 or similar - code runs has no mul/div operator)

uint16_t umul16_(uint16_t a, uint16_t b) {     uint16_t res=0;      while (b) {         if ( (b & 1) )             res+=a;         b>>=1;         a+=a;     }      return res; } 

this algorithm, when compiled @ tiny 85/84 using avr-gcc compiler, identical algorithm __mulhi3 avr-gcc generates.

avr-gcc algorithm:

00000106 <__mulhi3>:  106:   00 24           eor r0, r0  108:   55 27           eor r21, r21  10a:   04 c0           rjmp    .+8         ; 0x114 <__mulhi3+0xe>  10c:   08 0e           add r0, r24  10e:   59 1f           adc r21, r25  110:   88 0f           add r24, r24  112:   99 1f           adc r25, r25  114:   00 97           sbiw    r24, 0x00   ; 0  116:   29 f0           breq    .+10        ; 0x122 <__mulhi3+0x1c>  118:   76 95           lsr r23  11a:   67 95           ror r22  11c:   b8 f3           brcs    .-18        ; 0x10c <__mulhi3+0x6>  11e:   71 05           cpc r23, r1  120:   b9 f7           brne    .-18        ; 0x110 <__mulhi3+0xa>  122:   80 2d           mov r24, r0  124:   95 2f           mov r25, r21  126:   08 95           ret 

umul16_ algorithm:

00000044 <umul16_>:   44:   20 e0           ldi r18, 0x00   ; 0   46:   30 e0           ldi r19, 0x00   ; 0   48:   61 15           cp  r22, r1   4a:   71 05           cpc r23, r1   4c:   49 f0           breq    .+18        ; 0x60 <umul16_+0x1c>   4e:   60 ff           sbrs    r22, 0   50:   02 c0           rjmp    .+4         ; 0x56 <umul16_+0x12>   52:   28 0f           add r18, r24   54:   39 1f           adc r19, r25   56:   76 95           lsr r23   58:   67 95           ror r22   5a:   88 0f           add r24, r24   5c:   99 1f           adc r25, r25   5e:   f4 cf           rjmp    .-24        ; 0x48 <umul16_+0x4>   60:   c9 01           movw    r24, r18   62:   08 95           ret 


edit: instruction set available here.

summary

  1. consider swapping a , b (original proposal)
  2. trying avoid conditional jumps (not successful optimization)
  3. reshaping of input formula (estimated 35% gain)
  4. removing duplicated shift
  5. unrolling loop: "optimal" assembly
  6. convincing compiler give optimal assembly


1. consider swapping a , b

one improvement first compare , b, , swap them if a<b: should use b smaller of two, have minimum number of cycles. note can avoid swapping duplicating code (if (a<b) jump mirrored code section), doubt it's worth.


2. trying avoid conditional jumps (not successful optimization)

try:

uint16_t umul16_(uint16_t a, uint16_t b) {     ///here swap if necessary     uint16_t accum=0;      while (b) {         accum += ((b&1) * uint16_t(0xffff)) & a; //hopefully multiplication optimized away         b>>=1;         a+=a;     }      return accum; } 

from sergio's feedback, didn't bring improvements.


3. reshaping of input formula

considering target architecture has 8bit instructions, if separate upper , bottom 8 bit of input variables, can write:

a = a1 * 0xff + a0; b = b1 * 0xff + b0;  * b = a1 * b1 * 0xffff + a0 * b1 * 0xff + a1 * b0 * 0xff + a0 * b0 

now, cool thing can throw away term a1 * b1 * 0xffff, because 0xffff send out of register.

(16bit) * b = a0 * b1 * 0xff + a1 * b0 * 0xff + a0 * b0 

furthermore, a0*b1 , a1*b0 term can treated 8bit multiplications, because of 0xff: part exceeding 256 sent out of register.

so far exciting! ... but, here comes reality striking: a0 * b0 has treated 16 bit multiplication, you'll have keep resulting bits. a0 have kept on 16 bit allow shift lefts. multiplication has half of iterations of a * b, in part 8bit (because of b0) still have take account 2 8bit multiplications mentioned before, , final result composition. we need further reshaping!

so collect b0.

(16bit) * b = a0 * b1 * 0xff + b0 * (a0 + a1 * 0xff) 

but

(a0 + a1 * 0xff) = 

so get:

(16bit) * b = a0 * b1 * 0xff + b0 * 

if n cycles of original a * b, first term 8bit multiplication n/2 cycles, , second 16bit * 8bit multiplication n/2 cycles. considering m number of instructions per iteration in original a*b, 8bit*8bit iteration has half of instructions, , 16bit*8bit 80% of m (one shift instruction less b0 compared b). putting have n/2*m/2+n/2*m*0.8 = n*m*0.65 complexity, an expected saving of ~35% respect original n*m. sounds promising.

this code:

uint16_t umul16_(uint16_t a, uint16_t b) {     uint8_t res1 = 0;      uint8_t a0 = & 0xff; //this needs copy data     uint8_t b0 = b & 0xff; //this should optimized away     uint8_t b1 = b >>8; //this should optimized away      //here a0 , b1 swapped (to have b1 < a0)     while (b1) {///maximum 8 cycles         if ( (b1 & 1) )             res1+=a0;         b1>>=1;         a0+=a0;     }      uint16_t res = (uint16_t) res1 * 256; //should optimized away, it's not copy!      //here swapping wouldn't make sense     while (b0) {///maximum 8 cycles         if ( (b0 & 1) )             res+=a;         b0>>=1;         a+=a;     }      return res; } 

also, splitting in 2 cycles should double, in theory, chance of skipping cycles: n/2 might slight overestimate.

a tiny further improvement consist in avoiding last, unnecessary shift a variables. small side note: if either b0 or b1 0 causes 2 instructions. but saves first check of b0 , b1, expensive because cannot check zero flag status shift operation conditional jump of loop.

uint16_t umul16_(uint16_t a, uint16_t b) {     uint8_t res1 = 0;      uint8_t a0 = & 0xff; //this needs copy data     uint8_t b0 = b & 0xff; //this should optimized away     uint8_t b1 = b >>8; //this should optimized away      //here a0 , b1 swapped (to have b1 < a0)     if ( (b1 & 1) )         res1+=a0;     b1>>=1;     while (b1) {///maximum 7 cycles         a0+=a0;         if ( (b1 & 1) )             res1+=a0;         b1>>=1;     }      uint16_t res = (uint16_t) res1 * 256; //should optimized away, it's not copy!      //here swapping wouldn't make sense     if ( (b0 & 1) )         res+=a;     b0>>=1;     while (b0) {///maximum 7 cycles         a+=a;         if ( (b0 & 1) )             res+=a;         b0>>=1;     }      return res; } 


4. removing duplicated shift

is there still space improvement? yes, bytes in a0 gets shifted two times. there should benefit in combining 2 loops. might little bit tricky convince compiler want, result register.

so, process in same cycle b0 , b1. first thing handle is, loop exit condition? far using b0/b1 cleared status has been convenient because avoids using counter. furthermore, after shift right, flag might set if operation result zero, , flag might allow conditional jump without further evaluations.

now loop exit condition failure of (b0 || b1). require expensive computation. 1 solution compare b0 , b1 , jump 2 different code sections: if b1 > b0 test condition on b1, else test condition on b0. prefer solution, 2 loops, first exit when b0 zero, second when b1 zero. there cases in 0 iterations in b1. point in second loop know b0 zero, can reduce number of operations performed.

now, let's forget exit condition , try join 2 loops of previous section.

uint16_t umul16_(uint16_t a, uint16_t b) {     uint16_t res = 0;      uint8_t b0 = b & 0xff; //this should optimized away     uint8_t b1 = b >>8; //this should optimized away      //swapping doesn't make sense anymore     if ( (b1 & 1) )         res+=(uint16_t)((uint8_t)(a && 0xff))*256;     //hopefully compiler understands has add low 8bit register of high 8bit register of res      if ( (b0 & 1) )         res+=a;      b1>>=1;     b0>>=1;     while (b0) {///n cycles, maximum 7         a+=a;         if ( (b1 & 1) )             res+=(uint16_t)((uint8_t)(a & 0xff))*256;         if ( (b0 & 1) )             res+=a;         b1>>=1;         b0>>=1; //i try put last 1 leave carry flag in desired state     }      uint8_t a0 = & 0xff; //again, not real copy register selection      while (b1) {///p cycles, maximum 7 - n cycles         a0+=a0;         if ( (b1 & 1) )             res+=(uint16_t) a0 * 256;         b1>>=1;     }     return res; } 

thanks sergio providing assembly generated (-ofast). @ first glance, considering outrageous amount of mov in code, it seems compiler did not interpret wanted hints gave him interpret registers.

inputs are: r22,r23 , r24,25.
avr instruction set: quick reference, detailed documentation

sbrs //tests single bit in register , skips next instruction if bit set. skip takes 2 clocks.  ldi // load immediate, 1 clock sbiw // subtracts immediate *word*, 2 clocks      00000010 <umul16_antonio5>:       10:    70 ff           sbrs    r23, 0       12:    39 c0           rjmp    .+114        ; 0x86 <__sreg__+0x47>       14:    41 e0           ldi    r20, 0x01    ; 1       16:    00 97           sbiw    r24, 0x00    ; 0       18:    c9 f1           breq    .+114        ; 0x8c <__sreg__+0x4d>       1a:    34 2f           mov    r19, r20       1c:    20 e0           ldi    r18, 0x00    ; 0       1e:    60 ff           sbrs    r22, 0       20:    07 c0           rjmp    .+14         ; 0x30 <umul16_antonio5+0x20>       22:    28 0f           add    r18, r24       24:    39 1f           adc    r19, r25       26:    04 c0           rjmp    .+8          ; 0x30 <umul16_antonio5+0x20>       28:    e4 2f           mov    r30, r20       2a:    45 2f           mov    r20, r21       2c:    2e 2f           mov    r18, r30       2e:    34 2f           mov    r19, r20       30:    76 95           lsr    r23       32:    66 95           lsr    r22       34:    b9 f0           breq    .+46         ; 0x64 <__sreg__+0x25>       36:    88 0f           add    r24, r24       38:    99 1f           adc    r25, r25       3a:    58 2f           mov    r21, r24       3c:    44 27           eor    r20, r20       3e:    42 0f           add    r20, r18       40:    53 1f           adc    r21, r19       42:    70 ff           sbrs    r23, 0       44:    02 c0           rjmp    .+4          ; 0x4a <__sreg__+0xb>       46:    24 2f           mov    r18, r20       48:    35 2f           mov    r19, r21       4a:    42 2f           mov    r20, r18       4c:    53 2f           mov    r21, r19       4e:    48 0f           add    r20, r24       50:    59 1f           adc    r21, r25       52:    60 fd           sbrc    r22, 0       54:    e9 cf           rjmp    .-46         ; 0x28 <umul16_antonio5+0x18>       56:    e2 2f           mov    r30, r18       58:    43 2f           mov    r20, r19       5a:    e8 cf           rjmp    .-48         ; 0x2c <umul16_antonio5+0x1c>       5c:    95 2f           mov    r25, r21       5e:    24 2f           mov    r18, r20       60:    39 2f           mov    r19, r25       62:    76 95           lsr    r23       64:    77 23           ,    r23, r23       66:    61 f0           breq    .+24         ; 0x80 <__sreg__+0x41>       68:    88 0f           add    r24, r24       6a:    48 2f           mov    r20, r24       6c:    50 e0           ldi    r21, 0x00    ; 0       6e:    54 2f           mov    r21, r20       70:    44 27           eor    r20, r20       72:    42 0f           add    r20, r18       74:    53 1f           adc    r21, r19       76:    70 fd           sbrc    r23, 0       78:    f1 cf           rjmp    .-30         ; 0x5c <__sreg__+0x1d>       7a:    42 2f           mov    r20, r18       7c:    93 2f           mov    r25, r19       7e:    ef cf           rjmp    .-34         ; 0x5e <__sreg__+0x1f>       80:    82 2f           mov    r24, r18       82:    93 2f           mov    r25, r19       84:    08 95           ret       86:    20 e0           ldi    r18, 0x00    ; 0       88:    30 e0           ldi    r19, 0x00    ; 0       8a:    c9 cf           rjmp    .-110        ; 0x1e <umul16_antonio5+0xe>       8c:    40 e0           ldi    r20, 0x00    ; 0       8e:    c5 cf           rjmp    .-118        ; 0x1a <umul16_antonio5+0xa> 


5. unrolling loop: "optimal" assembly

with information, let's try understand "optimal" solution given architecture constraints. "optimal" quoted because "optimal" depends lot on input data , want optimize. let's assume want optimize on number of cycles on worst case. if go worst case, loop unrolling reasonable choice: know have 8 cycles, , remove tests understand if finished (if b0 , b1 zero). far used trick "we shift, , check 0 flag" check if had exit loop. removed requirement, can use different trick: shift, , we check carry bit (the bit sent out of register when shifting) to understand if should update result. given instruction set, in assembly "narrative" code instructions become following.

//input: = a1 * 256 + a0, b = b1 * 256 + b0 //output: r = r1 * 256 + r0  preliminary: p0 r0 = 0 (clr) p1 r1 = 0 (clr)  main block: 0 shift right b0 (lsr) 1 if carry not set skip 2 instructions = jump 4 (brcc) 2 r0 = r0 + a0 (add) 3 r1 = r1 + a1 + carry prev. (adc) 4 shift right b1 (lsr) 5 if carry not set skip 1 instruction = jump 7 (brcc) 6 r1 = r1 + a0 (add) 7 a0 = a0 + a0 (add)   8 a1 = a1 + a1 + carry prev. (adc)  [repeat same instructions 7 times] 

branching takes 1 instruction if no jump caused, 2 otherwise. other instructions 1 cycle. b1 state has no influence on number of cycles, while have 9 cycles if b0 = 1, , 8 cycles if b0 = 0. counting initialization, 8 iterations , skipping last update of a0 , a1, in worse case (b0 = 11111111b), have total of 8 * 9 + 2 - 2 = 72 cycles. wouldn't know c++ implementation convince compiler generate it. maybe:

 void iterate(uint8_t& b0,uint8_t& b1,uint16_t& a, uint16_t& r) {      const uint8_t temp0 = b0;      b0 >>=1;      if (temp0 & 0x01) {//will convince him use carry flag?          r += a;      }      const uint8_t temp1 = b1;      b1 >>=1;      if (temp1 & 0x01) {          r+=(uint16_t)((uint8_t)(a & 0xff))*256;      }      += a;  }   uint16_t umul16_(uint16_t a, uint16_t b) {      uint16_t r = 0;      uint8_t b0 = b & 0xff;      uint8_t b1 = b >>8;       iterate(b0,b1,a,r);      iterate(b0,b1,a,r);      iterate(b0,b1,a,r);      iterate(b0,b1,a,r);      iterate(b0,b1,a,r);      iterate(b0,b1,a,r);      iterate(b0,b1,a,r);      iterate(b0,b1,a,r); //hopefully understands doesn't need last update variable      return r;  } 

but, given previous result, obtain desired code 1 should switch assembly!


finally 1 consider more extreme interpretation of loop unrolling: sbrc/sbrs instructions allows test on specific bit of register. can therefore avoid shifting b0 , b1, , @ each cycle check different bit. problem instructions allow skip next instruction, , not custom jump. so, in "narrative code" this:

main block: 0 test nth bit of b0 (sbrs). if set jump 2 (+ 1cycle) otherwise continue 1 1 jump 4 (rjmp) 2 r0 = r0 + a0 (add) 3 r1 = r1 + a1 + carry prev. (adc) 4 test nth bit of (sbrc). if cleared jump 6 (+ 1cycle) otherwise continue 5 5 r1 = r1 + a0 (add) 6 a0 = a0 + a0 (add)   7 a1 = a1 + a1 + carry prev. (adc) 

while second substitution allows save 1 cycle, there's no clear advantage in second substitution. however, believe c++ code might easier interpret compiler. considering 8 cycles, initialization , skipping last update of a0 , a1, have 64 cycles.

c++ code:

 template<uint8_t mask>  void iteratewithmask(const uint8_t& b0,const uint8_t& b1, uint16_t& a, uint16_t& r) {      if (b0 & mask)          r += a;      if (b1 & mask)          r+=(uint16_t)((uint8_t)(a & 0xff))*256;      += a;  }   uint16_t umul16_(uint16_t a, const uint16_t b) {      uint16_t r = 0;      const uint8_t b0 = b & 0xff;      const uint8_t b1 = b >>8;       iteratewithmask<0x01>(b0,b1,a,r);      iteratewithmask<0x02>(b0,b1,a,r);      iteratewithmask<0x04>(b0,b1,a,r);      iteratewithmask<0x08>(b0,b1,a,r);      iteratewithmask<0x10>(b0,b1,a,r);      iteratewithmask<0x20>(b0,b1,a,r);      iteratewithmask<0x40>(b0,b1,a,r);      iteratewithmask<0x80>(b0,b1,a,r);       //hopefully understands doesn't need last update      return r;  } 

note in implementation 0x01, 0x02 not real value, hint compiler know bit test. therefore, mask cannot obtained shifting right: differently other functions seen far, has no equivalent loop version.

one big problem that

r+=(uint16_t)((uint8_t)(a & 0xff))*256; 

it should sum of upper register of r lower register of a. not interpreted like. other option:

r+=(uint16_t) 256 *((uint8_t)(a & 0xff)); 


6. convincing compiler give optimal assembly

we can keep a constant, , shift instead result r. in case process b starting significant bit. complexity equivalent, might easier compiler digest. also, time have careful write explicitly last loop, must not further shift right r.

 template<uint8_t mask>  void inverseiteratewithmask(const uint8_t& b0,const uint8_t& b1,const uint16_t& a, const uint8_t& a0, uint16_t& r) {      if (b0 & mask)          r += a;      if (b1 & mask)          r+=(uint16_t)256*a0; //hopefully easier understand compiler?      r += r;  }   uint16_t umul16_(const uint16_t a, const uint16_t b) {      uint16_t r = 0;      const uint8_t b0 = b & 0xff;      const uint8_t b1 = b >>8;      const uint8_t a0 = & 0xff;       inverseiteratewithmask<0x80>(b0,b1,a,r);      inverseiteratewithmask<0x40>(b0,b1,a,r);      inverseiteratewithmask<0x20>(b0,b1,a,r);      inverseiteratewithmask<0x10>(b0,b1,a,r);      inverseiteratewithmask<0x08>(b0,b1,a,r);      inverseiteratewithmask<0x04>(b0,b1,a,r);      inverseiteratewithmask<0x02>(b0,b1,a,r);       //last iteration:      if (b0 & 0x01)          r += a;      if (b1 & 0x01)          r+=(uint16_t)256*a0;       return r;  } 

Popular posts from this blog

c# - ODP.NET Oracle.ManagedDataAccess causes ORA-12537 network session end of file -

matlab - Compression and Decompression of ECG Signal using HUFFMAN ALGORITHM -

utf 8 - split utf-8 string into bytes in python -