Hacker News new | past | comments | ask | show | jobs | submit login
Inverted Transformers Are Effective for Time Series Forecasting (arxiv.org)
206 points by beefman 12 months ago | hide | past | favorite | 38 comments



Gah, this paper is hard to read, but here's my understanding:

Let's say you have 100 intersections, and you want to predict the traffic on each in cars/sec. You sample every hour, and you keep 24 hours of context, and try to predict the next 4.

First, you'd make 100 "tokens" (really stretching the meaning of token here), one for each stoplight, and loading 24 samples (the history of that stoplight) into each token, and normalize.

Next, you run each token through a Multi-Layer Perceptron (vanilla, old-school neural network) to make a vector of dim D.

Next, for each layer of the transformer, you: 1. Perform "cross-attention," i.e. the query/key/value dance. This is how the different time series (erm, tokens) get to share information. 2. Normalize across all. 3. Run another bog-standard MLP independently on each token. This is the opportunity to examine the history of each time series. 4. Normalize again across all.

Then, you map each "token" (ugh) from being D-dimensional to 4-dimensional, so for each stoplight it predicts the traffic ahead for the next 4 hours. This is also a regular MLP.

So specifically, if you're only predicting a single time series (one stoplight), this method is equivalent to running a regular neural network.

It also, interestingly enough, skips the cool sinusoidal position embedding that transformers use to embed token position. Fair enough, since here the time dimension is fixed and the index of the feed-forward neurons in each MLP layer corresponds (roughly) to the time index of the sample.

The architecture looks weird to me, but apparently it works so that's cool! But I'm not sure how well it works, and my unscientific gut feel is that there's a better and simpler architecture crying out to be found, because this looks a bit tortured. Like, nothing in it explicitly models the time dimension - that task is left to the MLPs - and that seems weird.


I had a startup a few years ago that was in the “eh we’ve got some money left from our BigTech days, let’s buy a lottery ticket that’s also a masters degree” category.

And in late 2018, attention/transformers was quite the risqué idea. We were trying to forecast price action in financial markets, and while it didn’t work (I mean really Ben), it smoked all the published stuff like DeepLOB.

It used learned embeddings of raw order books passed through a little conv widget to smooth a bit, and then learned embeddings of order book states before passing them through big-standard positional encoding and multi-head masked self-attention.

This actually worked great!

The thing that kills you is trying to reward-shape on the policy side to avoid getting eaten by taker fees, but it’s a broken ATM with artificially lowered fees.


Interesting, I'm trying to understand (much less knowledgeable about finance than ML, heh.) But it sounds like you fed it the raw order books (no time dimension), a sequence of order states corresponding to each (a time series), mapped them into the embedding dimension of a decoder-only transformer (the masking), and trained it to predict logits for the next order state?

See, that makes way more sense to me, since it sounds like you used causal self-attention , and actual position embeddings.

I've been interested in some time series stuff, like position embeddings to model actual wall-clock time offsets rather than sequence index, but for textless NLP rather than trading.


XTX markets seems to be doing something in the same genre as this. As I understand it, they are mostly taker.


why didnt you start a hedge fund?


They did say it didn't work. The overwhelming majority of finance stuff in published work doesn't work, because they're either too simplistic, poorly backtested, or they get exploited too quickly, so beating those doesn't imply you can run a hedge fund.

The main part here is that it's one thing to predict price action, it's another thing to trade profitably - and in particular they were not able to beat fees, which is a common hurdle if you're new to HFT.


Basically this. We were heavy infra pros and my cofounder was an HFT veteran so it wasn’t classic implementation shortfall so much as we didn’t solve the “do we enter” threshold on what would be a friction-free windfall.


What they describe looks like a single predictor. You can't create a strategy with a single predictor, unless it's incredibly predictive. 99% of the time, a predictor cannot beat its transaction costs alone.

You need to combine hundreds of such predictors to be able to beat costs and have a net profitable strategy.

We have a saying in French that you need a lot of rivers to create a sea.


So the group involved veterans from like Knight and DRW and stuff: we understood the model of combining lots of small signals with a low-latency regression.

We were trying to learn those signals as opposed to sweat-shop them.

But the broader point holds: signal isn’t alpha.


Wasn't the US housing crisis of the late 2000s caused by that 99% threshold?

Not in finance at all but I do use reverse Kalman filters, to which this seems similar in core concepts.

While reverse Kalman filters are incredibly helpful in reducing cloud spends by predicting when to auto scale, you still have to have metrics to quickly recover from mistakes.

Based only on tech interviews with HFT companies, I would assume someone could predict your moves using these methods based on historical data.

But perhaps I am just too risk adverse or am missing the core concept.


You might be referring to the Gaussian coppula bullshit Dave Li did? [1]

[1] https://en.m.wikipedia.org/wiki/David_X._Li


Doesn't this presuppose that all the information you need to predict the future of your time series is embedded in the past of those time series?

Don't most time series we would be interested in predicting (weather, prices, traffic volumes) tend to respond to things outside the history of the time series in question?

Or is the thesis here that we throw every random time series we can think of - wave height series from buoys in the San Francisco Bay, ticket sales from Taylor Swift concerts, Teslas per hour in the Holland tunnel, sales volume of MSFT... and get this thing to find the cross-correlated leading indicators needed so it can predict them all?


> Doesn't this presuppose that all the information you need to predict the future of your time series is embedded in the past of those time series?

Yes. But usually this is somewhat valid: There might not be data about the causes in your data, but the model should learn not be be over confident.

> Don't most time series we would be interested in predicting (weather, prices, traffic volumes) tend to respond to things outside the history of the time series in question?

Yes and no.

You really want the forecast to be a probability distribution: 95% of the time it will take you X minutes to get home from work if you leave at 17:30 but 5% of the time there will be disruptions.


Big part of it is historic dice tosses that create mirage of data just waiting to be tamed.


"I have seen the future and it is very much like the present, only longer."


Thank you! Thank you for explaining it in simpler terms. I get about 5% out of these papers but I got a lot more out of this break down.


I find crossformers easier to track:

https://openreview.net/forum?id=vSVLM2j9eie


It seems that crossformer has a very large number of tokens (patch as tokens). The author of this paper believes that one variable corresponds to one token is sufficient, and it is natural to use attention to describe their overall relationship among these individual entities.


I think this is a very similar concept compared to TiDE: https://arxiv.org/abs/2304.08424 that also came before and is linked in the paper mentioned in this post. I didn't read through the paper, so I can't point out the differences in approach yet.

However, by just looking at this post' paper results, it seems that at least for TiDE they reported the results completely different from the original paper. It seems this is cherry-picking the particular configuration as the delta is a bit too much to just blame un-reproducibility.


In the TIDE paper, the input sequence length is tuned, while this work uses the uniform input length.


Cool! I find it has been implemented in the tslib (https://github.com/thuml/Time-Series-Library), the results seem promising when I reproduce the experiments.


I also agree that modeling the time dimension by MLP can be more rational than self-attn. It learns the weightings from time points with the same physical meaning.


Really really great write up!

Thank you and yes this is very exciting

People are starting to really decompose the transformer architecture and I’m excited to see how far it can go


Hello everyone! iTransformer has released their official code implementation here (https://github.com/thuml/iTransformer).


I’m not a ML person so forgive my ignorance here.

It looks interesting, but I’m slightly confused about the way this is presented. It feels like it’s coming from the wrong angle.

Specifically, reducing a time series to a sequence of patterns and trying to predict what happens next is something that has been done for decades in some form or another. To me the unique aspect of this is that it fits the approach into a transformer.

So I’d expect to see comparisons against other approaches that do the same thing, not against other transformer approaches.

I wouldn’t be confused if the title was “Inverted Transformers are MORE EFFECTIVE THAN NORMAL TRANSFORMERS For Time Series Forecasting”.

However, if the target audience are transformer folk then it makes sense, it just seems that I’m looking at it from the other direction.


The equivalence relationship between efficient AI and universal sequence prediction has been known for decades, so it would be surprising if AI algorithms were poor at sequence prediction. Of course, optimal universal sequence prediction is profoundly intractable and memory hard, which has implications for limits of AI efficiency and scalability.

There used to be a small hobbyist subculture on the Internet in the late 1990s that designed highly efficient approximate universal sequence predictor algorithms for the challenge of it. Now that AI is a thing, I've often wondered if there were some lost insights there on maximally efficient representations of learning systems on real computers. Most of those people would be deep into retirement by now.


There’s nothing more fun than dusting off fossilized proto-AI work and running it on modern hardware.

Why don’t you share some citations?

I always enjoyed tracking down outre typewritten connectionist manuscripts from an author who had more time than compute.


Is there anything left of their output in the Internet archive? It is an interesting subject to explore.


I think we're living in a world where deep learning is winning so consistently that comparison to other methods is often just a time suck. It would be nice to provide a non-DL approach as a baseline, but I would expect it to lag behind the DL methods.

Furthermore, often pre-DL methods can be recast as hand-tuned special cases of DL models - some sequence of linear operations with hand-picked discontinuities sprinkled around. If you can implement the pre-DL method using standard neural network components, then gradient descent training of a neural network "should" find an equivalent or better solution.


Deep learning models are not better for vast problem areas which have analytical design algorithms. Deep learning's succession of triumphs has been across areas where analytical design has proven difficult.

First, there are many optimal, or near optimal, direct design algorithms for systems that are well characterized. These solutions are more concise, easier to analyze, reveal important insights, and come with guarantees regarding reliability, accuracy, stability, resource requirements, and operating regimes. Clear advantages over inductively learned solutions.

Second, just assuming that new algorithms are better than older algorithms is completely irrational. An anathema to the purpose and benefits of science, math, and responsible research in general.

If you are going to propose new algorithms, you need to compare the new algorithm against the previous state of the art.

Otherwise practitioners and future researchers will be driven into deadends, deploy pointlessly bad designs, forget important knowledge, and worst of all, lose out on what older algorithms can suggest for improving newer algorithms. With no excuse but gross carelessness.


This something that DL researchers like to think but it is definitely not true for time series forecasting. See https://forecastingdata.org/ for some examples where simple non-DL approaches beat state-of-the-art DL systems.


> I think we're living in a world where deep learning is winning so consistently that comparison to other methods is often just a time suck.

This is quite untrue. DL methods work well when there’s a lot of data in closed domains. DL works well by learning from corpuses of text and media where it can make reasonable interpolations.

When you don’t have enough data and you don’t have a known foundational model that you can do zero shot from, DL doesn’t work better than simpler conventional methods.


> It would be nice to provide a non-DL approach as a baseline, but I would expect it to lag behind the DL methods.

The M# competitions have usually shown very old forecasting algorithms work quite well, with frankly, way less training overhead and data. Ensemble models usually do best, but for a lot of use cases, DL is probably overkill versus ARIMA or triple exponential smoothing.


DL also don’t win at medium scale tabular data. This paper mentions why and how DL could might better (if it indeed can, with limited sized data)

Why do tree-based models still outperform deep learning on tabular data?

https://arxiv.org/abs/2207.08815


Sounds like the basic idea is as follows:

Typical transformers apply self-attention between tokens that vary across time. So the dot product values for each pair of tokens in the resulting attention (correlation) matrix are basically dot products between pairs of moments in time.

The iTransformer authors seem to be saying that, for certain time series forecasting tasks, it's not correct to assume that embedding channels of tokens across moments in time represent data that was collected at precisely the same moment or with similar instruments. In reality, different varieties of data are sometimes not precisely aligned in a data set and also have very different distributions relating to how the data was collected.

So the iTransformer model proposes to apply self-attention across embedding channels instead of across time. Self-attention otherwise seems to work in the same way. Query and key matrices are calculated but they project each embedding channel separately instead of projecting a collection of channel values at a single moment. Then the query-key calculation finds the degree to which all the entirely independent time series (embedding channels) are correlated. Those correlations are used to weight the value vectors and obtain new embedding channels that are weighted averages.

Then the feed-forward layer projects each channel independently, instead of projecting across channels as it would do in a standard transformer model.

Also, since layer normalization acts within an embedding channel, they claim that this can reduce noise that would result from normalizing data across channels that were collected using different methods. The distribution characteristics of each channel stay within the channel instead of bleeding across channels and potentially deleting information.

They lay out more of their reasoning for taking this approach in the paper and I feel like I agree with their intuitions. But the paper needs some serious proof reading. It's very hard to parse the verbiage.


Impressive. What turns ratio are we talking about here?


I think they left out “fly back” from the “transformer” bit. So no turns ratio. It’s basically an “inductor”. Which might as well be the name for the next iteration for self attention networks.


The real question, I think, is autobot or decepticon?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: