I just noticed your response. The machine I timed it on was an older Lenova lapt...

I just noticed your response. The machine I timed it on was an older Lenova laptop with some sort of dual core Intel processor - indeed not an "unusual" CPU.

In the end, it's results are what matter even if you (or I) don't understand why. For all I know, the compiler parallelized some of the code with divisions and didn't for the shift using code. The thing is, you can't just look at what "should" be fastest and my experience was a good reminder of that.

In the old days, one optimization technique we used was to "unroll loops" i.e. instead of:

for (int i=0; i<8; i++) a[i] += b[i];

We'd do this:

// Unrolled loop a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; ... a[7] += b[7];

The unroll loop has less clock cycle according to the book. Great - unless the larger "faster" code no longer fits in the CPU cache, now it's actually slower. The only way to tell is to measure it.

Here is the code. Substitute ">>1" for the "/2" portions and compare.

// pBuffer contains a single horizontal line in a bitmap // destBuffer will contain the first derivative convolution i.e. if you graph it, it would look like peaks at the edges of things in the image

void Convolution1D_FastFirstDerivative(int* pBuffer, int sizeOfBuffer, int destBuffer) { int lastPos = sizeOfBuffer - 1; int isum = 0;

// Special case start of buffer where filter "hangs off edge" isum = -pBuffer[0] / 2; // 0.5 //sum += pBuffer[0] * 0; // * 0 isum += pBuffer[1] / 2; destBuffer[0] = isum;

for (int i=1; i<sizeOfBuffer-1; i++) { isum = -pBuffer[i-1] / 2; // * -0.5 //sum += pBuffer[lastPos] * 0; // * 0 isum += pBuffer[i+1] / 2; // * 0.5 destBuffer[i] = isum; }

// Special case end of buffer where filter "hangs off edge" isum = -pBuffer[lastPos-1] / 2; // * -0.5 //sum += pBuffer[lastPos] * 0; // * 0 isum += pBuffer[lastPos] / 2; // * 0.5 destBuffer[lastPos] = isum;

}

Sorry, I wish I could fix the formatting.