I agree that it is a pleasure to use. I also suspect that generics will be forth...

supersillyus · on May 23, 2012

For what it's worth, I've looked into the 'Split' case, and the performance difference when specialized to the single-byte case is about 2%, which is mostly because Split already has built-in specialization for the single-byte case, which amounts to a couple extra instructions in a function whose running time is dominated by allocation.

I think they made the right choice there; the Go team seems very good about optimizing only where it matters; there's lots of low hanging fruit, but the majority of it isn't very useful fruit.

taliesinb · on May 23, 2012

Just for fun, I just looked into it too. Which factor dominates depends on the kinds of strings; for large strings, extra instructions in the loop matter very much. I'm processing very large strings.

I did 128 runs on a byte array of length 2^24. It has delimiters placed at positions {2^i, i < 24}.

I tested my implementation against both the "bytes" package implementation, and a copy of the relevant portions of the "bytes" package (to account for any odd effects of inlining and separate compilation). I did the set of timings twice in case there was any GC.

Here's the wall time in milliseconds for the three implementations, on a 2010 Macbook Air.

mine 3313 copy 4709 bytes 5689 mine 3327 copy 4660 bytes 5660

My single-byte implementation is about 40% faster than the local version, and 70% faster than the "bytes" version. Not quite twice, but I wasn't far off.

But aside from performance, there is just consistency of interface. Once you've established a 'Byte' variant of some functions, you should do it for all the common functions.

supersillyus · on May 24, 2012

It doesn't sound like you're using the benchmarking tools that Go provides; I'd recommend using that if you're not.

Ah, yeah, I was testing a much much smaller byte array with multiple split points. I'm not terribly surprised that in your case you've found the hand-coded byte version to be faster (though the difference is more than I would've guessed; care to post the code?) However, I'm still not sure it's merited in the standard library. Split() could pretty easily be further specialized to basically call your single byte implementation or equivalent at the cost of a handful of instructions per call. Alternately, if you know you're dealing with very large byte slices with only a few split points, it is only a couple lines of code to write a specialized version that is tuned for that. The same argument could be make for IndexByte, but I'd claim that IndexByte is a much more fundamental operation in which one more often has need for a hand-tuned assembly implementation. I wouldn't say the same for Split. There's a benefit to having fewer speed-specialized standard library calls, and I don't think splitting on a byte with performance as a primary concern happens often enough to merit another way to split on a byte in the standard library. But I'm certain that reasonable people who are smarter than me would disagree.

taliesinb · on May 29, 2012

Sure.

Here's the implementation of the three versions: https://gist.github.com/2821937

Here's how I was originally benchmarking things: https://gist.github.com/2821943

Here's benchmarking using the Go testing package: https://gist.github.com/2821947

Some of the performance I was seeing on my crappy benchmarker evaporates using Go's benchmarker. But there is something else afoot. Try changing 'runs', which controls the size of the inner loop (needed to get enough digits of precision):

From "runs = 128":

  BenchmarkSplit1	       1	2478184000 ns/op
  BenchmarkSplit2	       1	2787795000 ns/op
  BenchmarkSplit3	       1	2747341000 ns/op

From "runs = 32":

  BenchmarkSplit1	1000000000	         0.62 ns/op
  BenchmarkSplit2	1000000000	         0.68 ns/op
  BenchmarkSplit3	1000000000	         0.56 ns/op

Why did it suddenly jump from 1 billion outer loops to just 1? I think there is a bug in the go benchmarker here, because if you take into account the factor-of-4 difference in work and then divide by 1 billion, it looks like the first set of ns/op are actually correct and aren't being scaled correctly.

Either way, the increase in performance is now only about 10%. Which I agree, isn't anything to write home about. More bizarre is that the bytes package one is faster for runs = 32 but not for runs = 128. I can't make head or tail of that, or why it should matter at all -- unless there is custom assembly in pkg/bytes that has odd properties inside that inner loop.

But this is only one half of my complaint: it's the interface that matters, and I see no good reason for having IndexByte, but no CountByte and SplitByte, contrary to what you say about which is more fundamental. Having to construct a slice containing a single byte just to call SplitByte and CountByte left me with an bad taste in my mouth.