A paper [1] we wrote in 2015 (cited by the authors) uses some more sophisticated data structures (compressed suffix trees) and Kneser–Ney smoothing to get the same "unlimited" context. I imagine with better smoothing and the same larger corpus sizes as the authors use this could improve on some of the results the authors provide.
Back then neural LMs were just beginning to emerge and we briefly experimented with using one of these unlimited N-gram models for pre-training but never got any results.
The formulation of a succinct full text index as an "infinite n-gram model" is both genius in connecting with the ML literature, but also blind-eye, in that it ignores all the work on using exactly this data structure (well, the FM-index) over DNA. bwa mem is the OG ∞-gram aligner.
The idea is far older than BWA. The entire point of the Burrows-Wheeler transform is to create a compact order-k "language model" for every value of k simultaneously. And then use that for data compression.
David Wheeler reportedly had the idea in the early 80s, but he rejected it as impractical. Then he tried publishing it with Michael Burrows in the early 90s, but it was rejected from the Data Compression Conference. Algorithms researchers found the BWT more interesting than data compression researchers, especially after the discovery of the FM-index in 2000. There was a lot of theoretical work done in the 2000s, and the paper mentioned by GP cites some of that.
The primary application imagined for the FM-index was usually information retrieval, mostly due to the people involved. But some people considered using it for DNA sequences and started focusing on efficient construction and approximate pattern matching instead of better compression. And when sequencing technology advanced to the point where BWT-based aligners became relevant, a few of them were published almost simultaneously.
I find the requirement of using a laptop without being allowed to use an external monitor to view potentially very long documents (half screen!) for hours on a tiny screen to be ridiculous.
Funny enough "learning characters/words in the context of the vocabulary they are in" is exactly what NLP machine learning models use to learn "rich" word/text representations based on the "distribution hypothesis" which states that the choice of words in the same context share a common meaning.
From the first link: "Comparing OFDM to LTE today we find a better scalability to a much lower latency (an order of magnitude lower round-trip time [RTT] than LTE today) in OFDM."
Doesn't LTE already have quite good latency properties?
As I understand it, not good enough for "vehicular communication, industrial control, factory automation, remote surgery, smart grids and public safety applications" [1]. A lot of the improvement is down to the frame structure (i.e., how user and training data are distributed in time and frequency). See "Frame structure" in [1].
One data point: LTE gives me 65-90ms ping RTT in reasonable non-crowded indoor conditions. Subjectively, working interactively via LTE is totally fine for me. Compare that to UMTS (3G T-Mobile DE) where I think RTT was more like 350ms and working interactively is a pain.
I thought the NVIDIA drivers for the more fancy cards (TITAN etc) are the same as for the gforce cards. Wouldn't this restriction apply to those cards as well? Doesn't make much sense to me...
They can't price-differentiate FP64 compute out, since ML uses FP32 or even FP16. They tried discriminating FP16 performance but frameworks switched to using FP32 units and downconverting to FP16 after. They can't kill FP32 performance since that's used for gaming. They tried killing the virtualization, they tried differentiating based on clustering, they tried every reasonable technical procedure. So now they're falling back to legal means, to defend an artificial price distinction that has no reflection in card features that anyone cares about.
In the future, Teslas will have Tensor Cores and GeForces won't, so deep learning will be much faster (but also much more expensive, so it kind of cancels out) on Tesla cards.
Yeah, doesn't it seem a little weird that they seem to want to enforce the distinction between gaming and industrial cards, but they also went ahead and included tensor cores in that new Titan card?
Did they not think that through and have an 'oh shit' moment when they saw all the news articles or something? Or is this a 'the first hit is free' sort of deal where they want people to learn about using the product on a startup budget, without giving up the ability to squeeze if someone has a good idea and wants to scale?
Titan V is so expensive that it's not cannibalizing anything. Once the GeForce 1180 comes out... they'll make it just slow/expensive enough that it still doesn't cannibalize anything.
They have just released the new Titan V, which has Tensor Cores I believe. That would indicate that they do want to include them in non workstation/dedicated ML cards, no?
The Titan V is a compute card, the same way the Titan XP and Titan Xs were. Differing factors between them and the xx80[Ti] of the corresponding generation were memory bandwidth and capacity - gaming performance was nearly identical to the geforce. Titans are entry-level compute.
They want to sell two fungible products while enforcing a pricing hurdle so that certain types of customers have to pay much more for the same performance.
I was at the SIGIR'17 presentation of this paper (won best paper award btw) and have some comments in general:
- They mentioned (from what I remember) that they now use BitFunnel as they core of the complete Bing search engine not just the fresh parts.
- When I read the paper and looking at the code, it looks like their index doesn't include frequency information whereas your PEF code does. It is unclear what was counted in the experiments.
- If you look at the code, they are actually doing much more complicated stuff than just regular bloom filters by "bin packing" the hash positions for each term to reduce false positive rates (see https://github.com/BitFunnel/BitFunnel/issues/278 ). I'm nor sure if it is "fair" to compare a system developed by 10+ engineers over many years to a "phd student" code base developed over short period of time. I think the PEF code is excellent but I'm more talking about that engineering efforts can have a large impact on performance.
- I'm fairly sure you are right regarding the lack of URL-sorting. However, this can have another cause. If you consider Figure 4 in the paper which shows how "higher ranking rows" group documents together to allow faster intersection. URL sorting causes clusters in document-ids. Say, in the example in Fig. 4 there might be a cluster for that specific term for documents 0,1,2,3. This would mean the "higher ranking" row approach becomes worse (more false positives) when clustering occurs in the collection. So while URL-sorting helps PEF, it will most likely make BitFunnel worse.
> They mentioned (from what I remember) that they now use BitFunnel as they core of the complete Bing search engine not just the fresh parts.
I find it hard to believe this. Their main index is certainly not all-RAM (there must be some flash and maybe even disk), and the throughput would just not be enough for something like BitFunnel.
> When I read the paper and looking at the code, it looks like their index doesn't include frequency information whereas your PEF code does. It is unclear what was counted in the experiments.
In PEF the frequencies are not interleaved with the postings, so if you don't read them you don't pay any computational overhead (they mention this in the paper). However, it's not clear whether they included them when measuring the space.
> I'm nor sure if it is "fair" to compare a system developed by 10+ engineers over many years to a "phd student" code base developed over short period of time.
I'm not trying to compare the code :) On the contrary, I'm mostly concerned about the behavior as the collection size grows. Gov2, especially if split into 5 pieces, is relatively small.
> So while URL-sorting helps PEF, it will most likely make BitFunnel worse.
That's possible, but I don't see why they could not use different docid orderings for BitFunnel and PEF. If they use the one that is better for BitFunnel, that's not fair to PEF.
> I find it hard to believe this. Their main index is certainly not all-RAM (there must be some flash and maybe even disk), and the throughput would just not be enough for something like BitFunnel.
From looking at the github repo it does look like the system runs entirely in main memory.
Back then neural LMs were just beginning to emerge and we briefly experimented with using one of these unlimited N-gram models for pre-training but never got any results.
[1] https://aclanthology.org/D15-1288.pdf