Work on transformer alternatives, especially parallelizable ones like this, is incredibly important - it would suck if we get sucked down a local optima in architecture without actually looking at nearby viable alternatives.
Work on transformer alternatives, especially parallelizable ones like this, is incredibly important - it would suck if we get sucked down a local optima in architecture without actually looking at nearby viable alternatives.