Sometimes I just absolutely love hacker news more than words can explain.
I am knee deep in a personal project exploring machine learning on tabular data and it’s been consuming my off-hours brain for a while. And I pop open HN on a holiday Monday to find…a package for machine learning on tabular data :)
Curious if anyone has any other suggestions of frameworks or packages to explore. It seems that the state of the art in tabular data got a lot of activity in 2019 and 2020 and the industry’s focus moved on to image processing (DALL-E, Stable Diffusion) so I’m wondering whether there’s been much advancement.
If your data doesn't require preprocessing ... Very often you need to prepare data to use xgboost/lightgbm. You need to fill missing values, convert categoricals/dates/text to numeric, do feature engineering.
What is more, MLJAR AutoML is checking much simpler algorithms for you, like Dummy Models (average response or majority vote), linear models, simple decision trees - because very often you don't need Machine Learning. The xgboost/lightgbm cant do this :)
You do not need to fill missing values or treat categorical with LightGBM.
You can pass the name of the columns that are categoricals when constructing the Booster or Data, and LightGBM will work with them under the hood treating them as a 1 hot encoded.
LightGBM also has a way of automatically treat missing values as either zeros, their own category, or the sample average (I might be mistaken on the last one)
All in all, you still need to do feature engineering and the like, but LightGBM removes a lot of the hassle from Xgboost.
One newbie question maybe you can answer: Can XGboost/LightGBM handle out of band data predictions? The specific regression problem I’m tackling involves price predictions on tabular data (similar the the kaggle housing price problem) and I know classic random forest / decision trees have trouble with time series predictions. Not sure if those models handle that better.
> classic random forest / decision trees have trouble with time series predictions
This is not true. The trick is you have to convert your longitudinal data to be cross-sectional via feature engineering of lagged features. There are also related tricks like expanding datetimes into features like day of week, day of month, etc. This can be a lot of work, though there are software tools which help do this.
Some general time-oriented feature engineering plus a vanilla random forest is a great second baseline (after LOCF), and then if needed you can spend 10x the time tuning a GBM to beat that.
If you have a GPU, checkout fastai's tabular learning stuff; easy feature eng and neural nets with embeddings can do a lot with low effort.
It incorporates a variety of cool base libraries like dirty_cat and transfomers, adds some of our own, and is heavier on text feature column support than most packages here. Think improving upon TopicBERT or dirty_cat for real text & multicolumn data and interactively visualizing, all in one line. We are currently adding end-to-end GPU support for that pipeline (the rest of our stack already supports that), primarily around cuml GPU data frame support.
It powers a lot of our new visual no-code AI layers and is driven by a bunch of enterprise projects in areas like cyber, fraud, social, misinfo, supply chain, & finance :) More niche, this is all feeding into our automatic graph ai (GNN) layers. Graph AI packages can't truly do graphs well until they can do node tables and edge tables well on their own!
This looks good. I mostly use deep learning for everything, while this project nicely automates non-deep learning ML.
When I managed a deep learning team at Capital One, my last technical project was automatic deep learning architecture search. I think that the fields of data engineering, data science, machine learning, and deep learning are all all ripe for massive automation, reducing the number of jobs in these fields. I think that there will be accelerated use of all of these fields, just much less manual work will be required.
At least for tabular data there is no clear evidence that a neural network is superior than a tree-based model [0].
Many Data Scientist colleagues go straight to neural networks thinking that they will have a better result.
Considering the complexity and computational cost, I think it is convenient to evaluate which tool to use for each case.
Five years ago, I trained a GAN deep model to synthesize data based a tabular training data. Colleagues extended thus to also handle categorical and string data. Models trained on synthetic data performed well in withheld real data. I mention this as some evidence that deep models subsume conventional ML.
This looks really great. I can imagine adding some features around the explanations from an educational standpoint. You could go from never having done this to understanding a lot more of the ins and outs of ML very quickly. Kudos
Neat! I'll test it tomorrow. My only 'complaint' from reading the docs is that it only tests one NN model (which is the same for all types of data), rather than at least a few of the top architectures.
I am knee deep in a personal project exploring machine learning on tabular data and it’s been consuming my off-hours brain for a while. And I pop open HN on a holiday Monday to find…a package for machine learning on tabular data :)
Curious if anyone has any other suggestions of frameworks or packages to explore. It seems that the state of the art in tabular data got a lot of activity in 2019 and 2020 and the industry’s focus moved on to image processing (DALL-E, Stable Diffusion) so I’m wondering whether there’s been much advancement.