A decoder-only foundation model for time-series forecasting

frogamel · 2024-02-03T04:41:38 1706935298

The research in this space is very conflicting about what methods actually work. In the graph on the page, the ETS model (basically just a weighted moving average) outperforms multiple, recent deep learning models. But the papers for those models claim they outperform ETS and other basic methods by quite a bit.

You can find recent papers from researchers about how their new transformers model is the best and SOTA, papers which claim transformers is garbage for time series and claim their own MLP variant is SOTA, other papers which claim deep learning in general underperforms compared to xgboost/lightgbm, etc.

Realistically I think time series is incredibly diverse, and results are going to be highly dependent on which dataset was cherry-picked for benchmarking. IMO this is why the idea of a time series foundation model is fundamentally flawed - transfer learning is the reason why foundation models work in language models, but most time series are overwhelmingly noise and don't provide enough context to figure out what information is actually transferrable between different time series.

verticalscaler · 2024-02-03T06:25:01 1706941501

> Realistically I think time series is incredibly diverse, and results are going to be highly dependent on which dataset was cherry-picked for benchmarking

That's exactly right it is bearly one step above playing "guess which number I'm thinking of" and acting amazed that if you play long enough you'll witness an occasional winning streak.

My god, model has learned to read your mind! ;)

This smacks of when very serious soviet scientists ran Telekinesis experiments and all manner of cold reading and charlatans. https://en.wikipedia.org/wiki/Telekinesis

Somebody should come up with a decoder-only foundation model for bending-spoons.

Interesting reading: https://www.kaggle.com/competitions/m5-forecasting-accuracy/...

nerdponx · 2024-02-03T14:58:19 1706972299

I feel the same way for the most part.

However I cam imagine a kind of meta-learning foundation model that basically has a huge internal library of micro-features, and when you put a sequence into it, it matches those features against the sequence and builds up a low-noise summary of the data that it can use to make predictions.

That's of course heavily anthropomorphized, but it seems potentially in-scope for a transformer model.

The real problem with time series data is that you can't predict the future. Images and text are relatively homogeneous and exist within a kind of restricted space. "Time series" in general however could be just about anything, and there's not as much reason to believe that something like a "grammar of time series" even exists beyond what we already can do with STL etc.

two_in_one · 2024-02-03T19:21:29 1706988089

> incredibly diverse, and results are going to be highly dependent on which dataset was cherry-picked for benchmarking

This naturally comes to multi-model solution under one umbrella. Sort of MoE, with selector (router, classifier) and specialized experts. If there is something which can't be handled by existing experts then train another one.

cyanydeez · 2024-02-03T20:57:27 1706993847

the point is it's a fundamentally flawed assumption that you can figure out which statistical model suits an arbitrary strip of timeseries data just because you've imbibed a bunch of relatively different ones.

two_in_one · 2024-02-04T06:24:19 1707027859

as long as you can evaluate models' output you can select the best one. you probably have some ideas what you are looking for. then it's possible to check how likely the output is it.

the data is not a spherical horse in the vacuum. usually there is a known source which produces that data, and it's likely the same model works well on all data from that source. may be a small number of models. which means knowing the source you can select the model that worked well before. even if the data is from alien ships they are likely to be from the same civilization.

I'm not saying that it's a 100% solution, just a practical approach.

cyanydeez · 2024-02-04T16:48:47 1707065327

it's a practical approach to serve normalized data but monitoring systems are most valuable by making abnormal conditions inspectable. proper modeling of a system has this power

so while this seems persuasive, it's fundamentally about normal data which yields little value in extrapolation

rlupi · 2024-02-03T08:17:09 1706948229

I think the next jump will come from neurosymbolic approaches, merging timeseries with a description of what they are about as input.

You can use that description for system identification, i.e. build a model of how the "world" works. This can be translated into a two-part network architecture, one is essentially a world-model-informed (physics-informed, as its often called in literature) part for the known-unknowns, the other one is a bounded error term for the unknown unknowns (e.g. dense layers, or maybe dense layers + non-linearities that capture the fundamental modes of the problem space for reservoir computing). The world model is revised with another external cycle of meta-learning, via symbolic regression.

The unknown-unkowns bit that I choose is designed as a shallow network that can be trained online (traditional methods, but I'd like to see if the forward-forward algorithm from Hinton would work well for short-term online adjustments), or by well-known tools like particle filters / kalman filters.

The non-linearities (and the overall approach in general) resemble physics-informed dynamic mode decomposition (piDMD), which show remarkable resistance to noise (e.g. salt & pepper).

If you have simple timeseries, and not very complex hierarchical systems that change over time and show novel modes that you haven't encountered before, then piDMD is likely enough for what you need.

---

Essentially, what I describe is a multi modal model for timeseries + a planning step. (AlphaGeometry to the rescue?)

---

Like you say, time-series in general have an incredibly complicated domain with comparatively few data available.

For example, real-world complex physical systems (industrial plants, but also large-scale software systems) may have replicas of the same components with complex behavior, and no/few shared dependencies for reliability or other constraints (e.g. physically apart). These can be captured by transformers. Training will be much faster if you initialize weights like I describe above, and share weights among replicas. The physical structure also creates particular conditions on the covariant matrixes and on the domain of higher-level timeseries (ultrametric spaces, which changes how measurement and frequency behaves there and can lead to great simplifications, but also errors if tools like FFT are applied blindly without proper adjustments; much like in operations research / planning problems, symmetries are sought after to reduce complexity).

On the other hand, the next level that compose these building blocks often have graph structure and sometimes scale-free networks (e.g. if they represent usage or behaviors, rather than physical systems). I think we'll see graph neural networks shine on this front.

There are likely other kind of behavior that I haven't encountered yet in my work.

I think overall, we'll see planning/neurosymbolic used at the highest-layer, graph neural networks for scale-free networks and to optimize long-range connections (also when a dense model with dynamic covariant matrices would be too expensive to compute even in sparse form), and transformers or/with piDMD-like approaches for dense patches of complex behavior. I.e. graph models as a generalization for spatial locality to arbitrary spatial-like domains, transformers/piDMD or similar for sequence-/time- locality for arbitrary complex systems. (I wonder what kind of weights will they implement when trained together on problems that are fundamentally in the middle, where traditionally one would use wavelets... if you look a the GraphCast model by deepmind for weather forecasting, it looks quite similar)

stormfather · 2024-02-04T13:39:37 1707053977

"real-world complex physical systems (industrial plants, but also large-scale software systems) may have replicas of the same components with complex behavior, and no/few shared dependencies for reliability or other constraints (e.g. physically apart). These can be captured by transformers."

Could you elaborate on this please? On why the transformer architecture lends itself well to this?

rlupi · 2024-02-06T22:17:33 1707257853

The general idea is to exploit the structure of the system. Use it to pre-initialize connections (e.g. covariates, information flow as layer connections) between different parts, rather than learn it from data: make your network resemble the physical system it is modeling.

You can try pre-train a transformer to capture the behavior of a common part that is replicated, and then make replicas (sharing weights, or not depending on the problem) to train the whole ensemble. It works both for existing systems but also from high-fidelity enough simulations, or proxy systems that show the same range of behaviors (e.g. staging environments).

Even if the pre-train network part doesn't converge fully or capture everything, it can pre-condition the network and help training the whole ensemble faster.

---

For simple components, you can even just write your own simulations as custom NN layer (e.g. recurrent layers that take the current state and input, return the output and next state). It helps to avoid the performance bottleneck of going outside accelerators for simulations, or having to train too many small networks.

I'd generally just write my own recurrent layer, if the behavior is simple enough.

But you can also use existing code and tweak it cleverly: e.g. LSTM Cells can be pre-initialized to implement continuous-time markov chains, as a birth/death renewal process.

You can capture the behavior of a simple component in isolation, then use it in the whole. Either freezing it and adding an error correction layer (e.g. if the frozen part is quite big and replicated, you can share weights more efficiently), or not freezing it and letting it train further.

You can impose bounds on the complexity of the error correction, very much like LoRA you can design it as a low-rank matrix decomposition; together with the right loss (e.g. L1 or huber), it's another technique to ensure that the error correction doesn't drift too much away from the behavior that you can expect from the physics of the system (and when it no longer converges, it is a good indicator that you have model drift and new behavior is coming up... that's a way to implement robust anomaly detection).

---

PS: I do know about the bitter lesson... the problem with that is that it assumes you can throw more data and more training time to problems, and that they are stable or similar to what is in your data, this is not always the case.

jimmySixDOF · 2024-02-03T11:22:25 1706959345

Also a big difference between applications so anomaly detection has a wider set of working solutions than prediction if such a thing is even possible

snats · 2024-02-03T02:28:03 1706927283

A little bit of context. Basically, all of tabular deep learning has been stuck and SOTA has been tree based algorithms like Catboost and XGBoost. This seems like a big step forward towards getting a generalizable deep learning model besides this "tree models"

ipsum2 · 2024-02-03T07:45:55 1706946355

This isn't for tabular data though? Time series transformers have been around for a few years now, see Transformers in Time Series: A Survey https://arxiv.org/abs/2202.07125 and Are Transformers Effective for Time Series Forecasting? https://arxiv.org/abs/2205.13504

spyder · 2024-02-03T04:06:21 1706933181

Yea, and from this paper it looks like the tree model (Catboost supervised) is still beating their (zero-shot) performance and they don't show their supervised version yet, but they write at the end that they plan to do it in a future work. Will be interesting to see.

mistrial9 · 2024-02-04T14:24:24 1707056664

Catboost and XGBoost do not need to be replaced

alexey-salmin · 2024-02-03T08:15:13 1706948113

I'm not sure a zero-shot foundational model for time series makes sense at all: if you only look at the numbers it's the "continue the sequence" game, nowhere to apply your "foundational knowledge" really.

A one-shot model that takes a prompt like "this sequence is a voltage of a PV-cell every hour" will have a chance though.

hackerlight · 2024-02-03T09:07:44 1706951264

> A one-shot model that takes a prompt like "this sequence is a voltage of a PV-cell every hour" will have a chance though.

Such information will somewhat be embedded in the network (a new voltage graph will look similar to the voltage graphs it's seen before) but it should be better to make it explicit.

mistrial9 · 2024-02-04T15:16:13 1707059773

you are right, but being right is not enough! For a lot of real-world cases, the automation is worth it, despite information problems.

to be clear, I share these concerns

thatcherc · 2024-02-03T01:32:24 1706923944

Sounds like a cool model. I'd love to try it but it seems like they're not releasing it (yet?). I've really gotten spoiled with the recent language models where I can download and run any new model or fine-tune I hear about. It's gotten to the point where I don't feel a model is very relevant unless I can run it locally. Hopefully a local version of this becomes available because I have plenty of time series data I'd like to run through it!

gardnr · 2024-02-03T02:12:47 1706926367

What time series data do you have that you'd like to run through it?

mordechai9000 · 2024-02-03T02:52:31 1706928751

Stock market data. But maybe the advantage disappears if other traders have the same model. Perhaps that's why they haven't released it.

bethekind · 2024-02-03T03:02:58 1706929378

I've tried using TimeGPT to predict the SPDR future price and it couldn't make any useful predictions

Maybe this one is better

I still think humans have better pattern recognition in the stock market than neural nets...at least for now

adastra22 · 2024-02-03T09:35:22 1706952922

Humans haven’t been in the loop for about two decades now. High frequency quants just use faster, lower latency models than AI stuff.

mhh__ · 2024-02-04T16:04:24 1707062664

Only for certain timeframes.

WanderPanda · 2024-02-03T03:36:14 1706931374

I highly doubt a human can beat the market consistently looking at the timeseries of some ticker alone.

ekianjo · 2024-02-03T05:25:42 1706937942

you dont need to beat the market. you just need to beat pure chance to make money

reissbaker · 2024-02-03T05:37:05 1706938625

If you aren't beating the market you're losing money in opportunity cost, because you could've just bought an index fund and made more.

(Also, you don't just need to beat pure chance, because pure chance is not guaranteed — or I think likely — to result in net zero losses on average, since you are then in an information-asymmetric environment where your trading partner looks at data and you do not. But regardless, even if you "make" money but underperform the market, you are effectively losing money in opportunity cost!)

hackerlight · 2024-02-03T09:09:23 1706951363

You need to beat fees and 1/2 spread and a few other costs.

ekianjo · 2024-02-03T11:42:16 1706960536

Yes, this would be a more complete statement.

hackerlight · 2024-02-03T04:44:34 1706935474

What input data did you feed into TimeGPT? Did you fine tune it?

ctrw · 2024-02-03T02:25:14 1706927114

Mine of your business. That's the point of open source.

djoldman · 2024-02-03T12:48:47 1706964527

FYI: lately there has been very interesting univariate time series forecasting work mostly by Albert Gu using state space models.

https://arxiv.org/abs/2111.00396 https://arxiv.org/pdf/2111.00396.pdf

ashvardanian · 2024-02-03T06:28:25 1706941705

I like tabular ML, but very skeptical about this research direction.

Language models trained on a broad corpus make sense, as the language is similar across different domains. Time-series, however, are extremely different... Stock prices, heart-rates, brain-waves, digital signals... Patterns learned on a broad dataset should only introduce uninterpretable noise.

Smith42 · 2024-02-03T07:32:37 1706945557

If you are interested in this also check out EarthPT, which is also a time series decoding transformer (and has the code and weights released under the MIT licence): https://arxiv.org/abs/2309.07207

glial · 2024-02-03T02:22:42 1706926962

This is cool, but I'm amused that CatBoost still beats it.

old_bayes · 2024-02-03T04:00:07 1706932807

Fraction of the cost, way more interpretable.

Still, I suppose we gotta keep an eye on this kind of work in case there's a tipping point.

shiandow · 2024-02-03T12:47:43 1706964463

The fact that the mean absolute error doesn't even change all that much from ARIMA suggests to me that there's no magic here.

Though mean absolute error is a bit of a weird measure. They're basically testing how closely a model predicts the median.

naijaboiler · 2024-02-03T19:51:43 1706989903

I’m not sure that’s what MAE measures. What’s the median of a time series? What does that even mean?

shiandow · 2024-02-04T21:57:07 1707083827

The MAE is minimized by the median. I share your confusion as for why one would use it for a time series.

Clearly they had some reason to pick it over others, but I couldn't tell you why. Especially since ARIMA effectively minimizes RMS error.

naijaboiler · 2024-02-03T19:49:40 1706989780

Bet minute feature engineering + glm beats it comfortably

daxfohl · 2024-02-03T14:21:35 1706970095

I'm surprised the transformer model scales here. In NLP the amount of self attention you need is like a few paragraphs maybe, but with time series you'd need to capture millions of data points in the attention blocks in order to correlate with historical trends and anomalous events like Black Fridays or seizures, etc, right? Otherwise what's the point of using transformers?

jsenn · 2024-02-03T14:41:37 1706971297

Check the article. They have a learned preprocessing step that translates time slices containing multiple data points into tokens, so the transformer is actually predicting larger chunks of time rather than individual time points.

daxfohl · 2024-02-03T17:57:13 1706983033

Oh, right. I missed the point of that step on first read.

daxfohl · 2024-02-03T13:58:12 1706968692

Are there any multimodal models where you can annotate your time series with events in natural language, and have it identify other occurrences of similar events? Or where you can ask it to do a forecast assuming some new event, also in natural language?

I feel like this is what I'd want out of a DL time series NN. For just prediction it seems like overkill.

jerpint · 2024-02-03T14:13:51 1706969631

You could probably annotate a graph , export it as an image and ask GPT4V for its best guesses, I suspect it’ll work decently well for simpler scenarios

neodypsis · 2024-02-03T03:31:02 1706931062

> "Synthetic data helps with the basics. Meaningful synthetic time-series data can be generated using statistical models or physical simulations. These basic temporal patterns can teach the model the grammar of time series forecasting."

Can someone elaborate on what a grammar means in the context of time series forecasting?

remontoire · 2024-02-03T03:50:55 1706932255

Probably drawing an analogy to how causal pretrained models go through stages of understanding language, words -> grammar -> meaning. Gwen mentions this experience when training character level RNNs. https://gwern.net/scaling-hypothesis#why-does-pretraining-wo...

horacemorace · 2024-02-03T04:09:06 1706933346

Totally. Also basic temporal scale or cyclic properties. It’s kind of mind blowing that the shape of most recorded human patterns is reducible in this way.

nerdponx · 2024-02-03T03:55:08 1706932508

It's a bit of an anthropomorphism ation. I don't believe it has any formal meaning here. The idea is that there are certain kinds of underlying signals and patterns which are common to a wide range of time series data. So if a model is able to learn those signals and patterns, it can look at any time series and, with enough historical data, predict future data, without actually updating any model weights. Those signals constitute a "grammar" of sorts.

WanderPanda · 2024-02-03T03:31:48 1706931108

They probably mean the manifold of the universe

mnoronha · 2024-02-03T07:01:36 1706943696

ah yes, the manifold of the universe (god) :)

alextheparrot · 2024-02-03T03:46:17 1706931977

You could probably consider learning a sign wave to be a “grammar” related to periodic variations. Grammar in this context feels like “What are the core conceptual heuristics that help guide towards faster understanding”

GeoffNN · 2024-02-05T15:07:51 1707145671

They say they can optionally use some time-dependent covariates (derived from the datetime) -- did anyone figure out if/how they are actually used in this model from the paper?

mistrial9 · 2024-02-04T14:23:28 1707056608

decoder-only models from a giant corp. which only they can change. uh.. then a read-only API is installed into Android stack without options? Where is this going?

j7ake · 2024-02-03T02:35:06 1706927706

It's interesting how they say that their model has high "zero-shot" performance, even though it was trained on millions of examples.

To me, zero-shot would mean a model whose parameters was never tuned through training examples at all...

breakds · 2024-02-03T02:44:30 1706928270

In this context, "zero-shot" capability usually means "how well it performs without fine-tuning or prompting with in-context examples" after the model is trained and parameters frozen.

p1esk · 2024-02-03T02:44:08 1706928248

Zero shot means a model trained one one dataset, or for one task, performs well on another dataset, or on another task - without any finetuning.

adastra22 · 2024-02-03T09:36:11 1706952971

Thanks for the clarification, because that is NOT what zero shot used to mean in AI.

whiplash451 · 2024-02-03T09:42:22 1706953342

What did it use to mean?

adastra22 · 2024-02-03T11:17:07 1706959027

Basically the ability to perform a task on the first try, without any feedback. E.g. a robot that is trained to pick up blocks being told to shoot a basketball though a hoop, and making a reasonable attempt at doing so on the first try (doesn't have to be successful, but has to in some way resemble throwing a ball through a hoop rather than picking up blocks).

I can see how the machine learning definition is similar, but the whole idea of retraining and fine-tuning is foreign to a lot of subfields in GOFAI.

optimalsolver · 2024-02-03T15:35:45 1706974545

So a model trained on one task, performing reasonably well on an related task?

adastra22 · 2024-02-03T19:17:56 1706987876

Yeah sorry I didn’t mention the important difference, which had me surprised. One shot learning in traditional AI is (to use modern terms), one shot training. The AI doesn’t just do the thing in the first try, it learns from that inference as well and does the thing even better on the second try.

I guess attentional context is almost this, but LLMs don’t update their base model after a one shot inference.

p1esk · 2024-02-03T18:11:47 1706983907

How do you train a robot in GOFAI?

hackerlight · 2024-02-03T04:54:10 1706936050

Are they ever going to release the weights?