Just added this to my reading list. It looks and smells like a promising approach for training and inference at scale directly from relational data, and modeling the relations, instead of from flat records/files with features and labels. It's refreshing to see original work that is not about incrementally tweaking some transformer!
I'm always amazed by how little attention tabular/relational data gets in DL research, despite being far more common in applications than images or unstructured text.
I actually do see DL research on tabular problems come out, but it seems like they generally aren't better than xgboost or similar boosted tree models on the same dataset, and when they are better it's only marginal and not worth the interpretability tradeoff in the real world. Boosted trees are already really good for most tabular tasks, so it makes sense that most of the focus would be on images and natural language where they are a big improvement over traditional methods.
Additionally, it seems to me that most of the money spent on research in the DL space goes to places like OpenAI whose main goal is to develop AGI. They see large transformer models as the path to AGI, and once you've got AGI solved the AGI can solve all the other problems.
I started Approximate Labs to tackle this problem! We just recently addressed one of the biggest gaps we believe exists in the DL research community for this modality, dataset; we openly released the largest dataset of tables with annotations last month https://www.approximatelabs.com/blog/tablib
Feel free to reach out to me / join our discord if you want to talk about tabular data / relational data and its relation to AI~
This is interesting, and smart, using the knowledge present in the normalized relationships / data connectivity as part of the knowledge for training.
A properly designed relational database is a kind of propositional knowledge base, with each tuple ("row") being a "fact"; it makes sense to mine this as part of learning. It's how a human would "read" this data, so makes sense.
Nitpicking:
"The core idea is to view relational tables as a heterogeneous graph,
with a node for each row in each table, and edges specified by primary-foreign key
relations."
The word "relation" is used incorrectly here -- the "relation" is the table, the name for the foreign-key relationship is... relationship. The key distinction being that relations (big bags of facts) exist independent of those pre-declared relationships; the relationships can be restructured in any-which-way through queries, with foreign key constraints just being a convention for doing that. That's the key benefit of the relational data model over hierarchical/tree/network/graph databases where the relationships are hardcoded, and the key thing that Codd was getting at in his original paper: relational databases are the theoretically most flexible model of data (apart from just bags of text) because the relationships are derived, not hardcoded.
So what they're describing here is each tuple in the relation as a graph node, edges defined by the attributes of the relation; which is precisely how one does the inverse (describe graphs as relations) See e.g. RelationalAI's product (https://docs.relational.ai/rel/concepts/graph-normal-form), or how graphs are modeled in e.g. Datalog or Prolog.
Let me help you - SQL and the whole predicate logic is prolog from start to end and also grammar and also regex and also recursion and also generative.
The thing is to make something actually useful out of it. Stochastic guys figured how to encode information with their stochastic approach. The relational algebra guys did something with discreet relations, we’ve been only scratching grammars even though LZW, sequitur and alike are grammar tasks.
Markov chains included, we’ve done very little in this regard. Much to expound on here. Patinjali was onit 3k years ago.
I feel like there is something really synergistic between SQL and what an LLM is capable of.
I've run some experiments where I simply feedback the last sql error/result set+prior query+goal and I've had iterations reach success, in one case after hitting the wall 12 times.
OpenAI's models will dive straight into schema definition and system tables if you are clever in your prompting. Many RDBMS implementations support this. Table names, column names, indexes, foreign keys, and matching descriptions for all in the most ideal case.
The fact that you can call OAI inference APIs directly from a SQL Server sp makes a little bit nervous about the whole skynet thing.
So ... how does one evaluate the quality of a framework + benchmark without results like this? I feel like this would be a lot more compelling if they had at least some reasonable first models with result metrics and analysis showing that the tasks are both tractable and are good at revealing differences in performance and behavior (i.e. the problems should be neither too easy or to hard). For example, what if Amazon purchase/churn behavior is just largely independent from review behavior, and review behavior is mostly dictated by "did some company use this account to generate fake 5-star reviews?"
It is a great read, and I am glad to see deep relational learning getting some attention. Thanks for citing our paper (https://openreview.net/pdf?id=b4GEmjsHAB), and I'm looking forward to some collaboration or a way to contribute.
If the data is already represented in a knowledge graph - say, in a graph database - then is that data more conducive for deep learning because it's in a structured or more useful format?
It’s about choosing what to embed. We’re still way to go with embedding non textual concepts. Results in smaller models, but even more difficult to reason about
ok, someone help me out please. I get a lot of things wrong a lot of the time, and I feel like this is one of those times -- this looks like we are having a NN "memorize" tabular data and/or data tuples. I'm having trouble understanding how this is superior to training a language model to obtain results via SQL. In my mind, a database is kind of a busy thing. So, I suppose I can see the benefit of this for forensic work, but it seems like it wouldn't be very useful on an active database. However, I think it is more likely that I simply don't understand it than there is actually no benefit. So if anyone can help me out I'd appreciate it.