Hacker News new | past | comments | ask | show | jobs | submit login

Well, or compare to a GPU or a TPU



Those are largely simd but not vector.


That's fair. I haven't spent any time looking through the RISC-V vector extensions yet. I look forward to it, though.


They are said to be "inspired" by classic Cray vector instructions, although I have of course never used a Cray :-( so I can't comment on how true that is. I did use a Convex C2[0] for a while which also had real vector instructions, but it was all hidden behind a compiler option.

[0] https://en.wikipedia.org/wiki/Convex_Computer


How is vector different from simd?


The RISC-V Vector extension allows the vector length to vary at runtime whereas with SIMD the vector length is fixed at compile time (128 bit, 256 bit etc.). It means the code is more portable basically.

With x86 SIMD the standard solution is to compile the same code multiple times for different SIMD widths (using different instructions) and then detect the CPU at runtime. Though that is such a pain that it's only really done in explicitly numerical libraries (Numpy, Eigen, etc.). In theory with Vector you can compile once, run anywhere.


That design seems like a reasonable thing for a high level language that could then be converted to different architectures’ simd widths. But I’m kinda surprised it’s good at the ISA level. Eg for something like a vectorized strlen, mightn’t one worry that the cpu would choose vlen[1] too large causing you to load from cache lines (or pages!) that turn out to be unnecessary for finding the length of the string. With the various simd extensions on x86 or arm, such a routine could be carefully written to align with cache lines and so avoid depending on reading the next line when the string ends before it. I also worry about various simd tricks that seem to rely on the width. Eg I think there’s some instruction to interpret each half-byte of one vector as an index into some vector of 16 things. How could these be ported to risc-v? Or maybe that’s not the sort of thing their vector extensions are meant for.

I guess part of my thinking here is that the ISA designers at intel, arm, aren’t stupid, but they ended up with fixed widths for sse, neon, knights landing, avx, avx-512. Presumably they had reasons to prefer that to the dynamic risc-v style thing. So I wonder: are there some risc-v constraints the push this design (eg maybe low-power environments presumably pushed neon to have a small width and this made higher-power environments suffer; having a dynamic length might allow both to use the same machine code), or were there some reasons intel preferred to stick with fixed widths, eg making something that could only work on more expensive chips and thereby having something people can pay more for? Is there something reasonable written about why risc-v went with this design.

[1] what do you even pass to vsetvl in this case as you don’t know your string length.


I'm not sure it is difficult to see why variable length SIMD makes sense. If you want to process 15 elements with a width of 8, you will need the function twice, once with SIMD processing whole batches of 8 elements and a scalar version of the same function to process the last 7 elements. This makes it inherently difficult to write SIMD code even in the simple and happy case of data parallelism. With RISC-V all you do is set vlen to 7 in the last iteration.

>what do you even pass to vsetvl in this case as you don’t know your string length.

I'm not sure what you are trying to say here. You must know the length of the buffer, if you don't know the length of the buffer, then processing the string is inherently sequential, just like reading from a linked list, since accessing even a single byte beyond the null terminator risks a buffer overflow. Why pick an example that can't be vectorized by definition?


I wonder if you’re using a different definition of ‘vectorized’ from the one I would use. For example glibc provides a vectorized strlen. Here is the sse version: https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/m...

It’s pretty simple to imagine how to write an unoptimized version: read a vector from the start of the string, compare it to 0, convert that to a bitvector, test for equal to zero, then loop or clz and finish.

I would call this vectorized because it operates on 16 bytes (sse) at a time.

There are a few issues:

1. You’re still spending a lot of time in the scalar code checking loop conditions.

2. You’re doing unaligned reads which are slower on old processors

3. You may read across a cache line forcing you to pull a second line into cache even if the string ends before then.

4. You may read across a page boundary which could cause a segfault if the next page is not accessible

So the fixes are to do 64-byte (ie cache line) aligned accesses which also means page-aligned (so you won’t read from a page until you know the string doesn’t end in the previous page). That deals with alignment problems. You read four vector registers at a time but this doesn’t really cost much more if the string is shorter as it all comes from one cache line. Another trick in the linked code is that it first finds the cache line by reading the first 16 bytes then merging in the next 3 groups with unsigned-min, so it only requires one test against a zero vector instead of 4. Then it finds the zero in the cache line. You need to do a bit of work in the first iteration to become aligned. With AVX, you can use mask registers on reads to handle that first step instead.


Another point is that the CPU can sequence the multiple calls to its internal SIMD unit internally without that having to be done by user code. This in extreme case degrades to the Cray-1-like vector unit, which still has measurable preformance impact and can be implementated even in very resource constrained environments.


> what do you even pass to vsetvl in this case as you don’t know your string length.

To the maximum, of course, `vsetvl x<not 0>, x0, ...` will do that for you.

You might read over a page boundery, but there is an instruction for that `vle8ff.v`, it's a fault-only-first uni-stride load. That is, it doesn't fault when one of the later elements goes outside our page, and adjusts the vector length accordingly. [0]

> With the various simd extensions on x86 or arm, such a routine could be carefully written to align with cache lines and so avoid depending on reading the next line when the string ends before it.

In practice, on current very early hardware, doing just that is faster, and definitely possible. [1]

> I also worry about various simd tricks that seem to rely on the width. Eg I think there’s some instruction to interpret each half-byte of one vector as an index into some vector of 16 things

I'm not aware of such an instruction, but rvv has vrgather.vv and vrgatherei16.vv to do something similar. Note that you can always return to "fixed-size" implementations if you really need to, buy just setting the vl accordingly. But for the most part I don't think this will be necessary. Do you have any specific problem in mind that may seem hard to do without fixed size SIMD?

> I guess part of my thinking here is that the ISA designers at intel, arm, aren’t stupid, but they ended up with fixed widths for sse, neon, knights landing, avx, avx-512.

I mean, ARM now has something similar with SVE, and x86 has a metric ton of legacy to work with and the market size to get adoption of whatever new instruction prefix they add. Edit: Also, didn't AVX10/AVX512vl go into a similar direction as SVE?

> Is there something reasonable written about why risc-v went with this design.

I'm not quite sure, but I remember that it was one thing from very early in the design. I think it's to have a mostly unified ecosystem for the binary app market. It also makes working with mixed precision easier, because it allows for the LMUL model.

But RISC-V is extendable, the P (Packed SIMD) extension is currently in the works and aimed at the embedded market, for in GPR SIMD operations for DSP type applications. [2]

[0] https://github.com/riscv/riscv-v-spec/blob/master/example/st...

[1] https://camel-cdr.github.io/rvv-bench-results/canmv_k230/str...

[2] https://lists.riscv.org/g/tech-p-ext/topics


Thanks for all the detailed information! That answers a bunch of my questions and the implementation of strlen is nice.

The instruction I was thinking of is pshufb. An example ‘weird’ use can be found for detecting white space in simdjson: https://github.com/simdjson/simdjson/blob/24b44309fb52c3e2c5...

This works as follows:

1. Observe that each ascii whitespace character ends with a different nibble.

2. Make some vector of 16 bytes which has the white space character whose final nibble is the index of the byte, or some other character with a different final nibble from the byte (eg first element is space =0x20, next could be eg 0xff but not 0xf1 as that ends in the same nibble as index)

3. For each block where you want to find white space, compute pcmpeqb(pshufb(whitespace, input), input). The rules of pshufb mean (a) non-ascii (ie bit 7 set) characters go to 0 so will compare false, (b) other characters are replaced with an element of whitespace according to their last nibble so will compare equal only if they are that whitespace character.

I’m not sure how easy it would be to do such tricks with vgather.vv. In particular, the length of the input doesn’t matter (could be longer) but the length of white space must be 16 bytes. I’m not sure how the whole vlen stuff interacts with tricks like this where you (a) require certain fixed lengths and (b) may have different lengths for tables and input vectors. (and indeed there might just be better ways, eg you could imagine an operation with a 256-bit register where you permute some vector of bytes by sign-extending the nth bit of the 256-bit register into the result where the input byte is n).


I'm actually doing something quite similar in my, in progress, unicode conversion routines.

For utf8 validation there is a clever algorithm that uses three 4-bit look-ups to detect utf8 errors: https://github.com/simdutf/simdutf/blob/master/src/icelake/i...

Aside on LMUL, if you haven't encountered it yet: rvv allows you to group vector registers when configuring the vector configuration with vsetvl such that vector instruction operate on multiple vector registers at once. That is, with LMUL=1 you have v0,v1...v31. With LMUL=2 you effectively have v0,v2,...v30, where each vector register is twice as large. with LMUL=4 v0,v4,...v28, with LMUL=8 v0,v8,...v24.

In my code, I happen to read the data with LMUL=2. The trivial implementation would just call vrgather.vv with LMUL=2, but since we only need a lookup table with 128 bits, LMUL=1 would be enough to store the lookup table (V requires a minimum VLEN of 128 bits).

So instead I do six LMUL=1 vrgather.vv's instead of three LMUL=2 vrgather.vv's because there is no lane crossing required and this will run faster in hardware: (see [0] for a relevant mico benchmark)

        # codegen for equivalent of that function
        vsetvli a1, zero, e16, m2, ta, ma
        vsrl.vi v16, v10, 4
        vsrl.vi v12, v12, 4
        vsetvli zero, a0, e8, m2, ta, ma
        vand.vi v16, v16, 15
        vand.vi v10, v10, 15
        vand.vi v12, v12, 15
        vsetvli a1, zero, e8, m1, ta, ma
        vrgather.vv     v18, v8, v16
        vrgather.vv     v19, v8, v17
        vrgather.vv     v16, v9, v10
        vrgather.vv     v17, v9, v11
        vrgather.vv     v8, v14, v12
        vrgather.vv     v9, v14, v13
        vsetvli zero, a0, e8, m2, ta, ma
        vand.vv v10, v18, v16
        vand.vv v8, v10, v8



This works for every VLEN greater than 128 bits, but an implementation with larger VLENs do have to do a theoretically more complex operation.

I don't think this will be much of a problem in practice though, as I predict most implementations with a smaller VLEN (128,256,512 bits) will have a fast LMUL=1 vrgather.vv. Implementations with very long VLENs (e.g. 4096 bits, like ara) could have a special fast path optimizations for smaller lookup ranges, although it remains to be seen what the hardware ecosystem will converge to.

I'm still contemplating whether or not to add a non vrgather version and runtime dispatch based on large VLENs or quick performance measurements. In my case this would require almost >30 instructions when done trivially. Your example would require about about 8 eq + 8 and vs 4 shuffle + 4 eq, that isn't that bad.

vrgather.vv is probably the most decisive instruction when it comes to scaling to larger vector lengths.

[0] https://camel-cdr.github.io/rvv-bench-results/canmv_k230/byt...

PS: I just looked over my optimized strlen implementation and realized it had a bug. That's fixed now, and the hot path didn't change, just the setup didn't work correctly.


Perhaps another way to do gather with mixed sizes would be with a combination of the vector and packed simd extensions, using a simd register for the lookup table.


I think I'll propose a vrgatherei4, analogous to vrgatherei16, but with 4 bit indices, for LUTs, when I eventually finish my project with a blog post.


TPUs tend to be specialised for matrix multiplications, often at low precision.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: