Hacker News new | past | comments | ask | show | jobs | submit login
Ludwig, a code-free deep learning toolbox (uber.com)
177 points by beefman on Feb 12, 2019 | hide | past | favorite | 28 comments



Technically, it is not code-free - it is declarative programming in YML. You still have to specify your input_features, output_features, and training architecture/specification. This is not a drag-and-drop UI (although, you could probably layer one on top of this).

This should work well with a Feature Store, where features are already pre-processed and ready for input to model training. With a feature store, this could be like the Tableau/Qlik/PowerBI tool for Data Science.


Aren't most popular ml libraries declarative, e.g. keras, you aren't exactly specifying how to transform the state (for example all the specific matrix multiplication is hidden), rather you are declaring the logic of the computation (you list out the layers).


TensorFlow has mappers/reducers for feature engineering, and it's Python - so you will inevitably do what you want in there. But yes, you have a point there. How do you debug this thing, though?


You have a good point about debugging. So far my approach has been to write quite long and explanatory error messages (like this https://github.com/uber/ludwig/blob/9c9e5de56dcad89461c5d2ec... for instance) and as detailed documentation as I can and, at least when people used it internally, that has been enough. But I'm definitely open to suggestions here.


We are considering a bunch of options re feature store. I personally like EuclidDB, but I'm not sure it fits the notion of feature store that we need (while it's great for ANN inference). Anyway we are looking into it.

Regarding the code-free vs declarative, I mean, it really depends where you draw the line :) Is calling a command line program with some parameters declarative programming? What if there are a lot of parameters? And what if those parameters have a nested structure and are contained in a YAML file? And what if the parameters become so many and so detailed that you can write almost each single operation than the command line program will perform (like in Caffe configurations files for instance)? Do you see what I mean? Anyway, if it's not code-free you may agree that it's the closest open source tool to be code-free :)


I have attempted to do something similar in 2017 [1]. A couple of issues I noticed:

* The field is evolving quite rapidly, and so most options will require a lot of configuration, which IMO is not suitable for a declarative approach.

* It's hard to debug, eventually you will need to dive into the code at some point.

* Extending the library would mean touching the code anyways.

One major difference here, is that Ludwig tries to expose one interface, that for instance users without python knowledge can use, and inside a company, some machine learning engineers can extend the tool and do support by debugging other users use-cases.

I think at some point such approach can be useful for a very limited mature use-cases.

[1]: https://github.com/polyaxon/polyaxon-lib/blob/master/example...


Your configuration file looks really good! Why did you stop pursuing it? To your points:

* That may be true, but I think it also depends on the level of abstraction. Fon instance in Ludwig the encoder, although configurable, are pretty monolitic. It's a trade off with flexibility, in Caffe or in your configuration file you are super flexible in specifying each single operation, while in Ludwig that is abstracted from you, but the advantage is that, as long as you trust the encoder implementation, those encoders mimic papers / state of the art models and require much less configuration to be written in order to run (in many cases if you are happy with the default parameters you don't have to configure them at all). So if the field moves and a new text encoder comes in, one can easily add a new encoder to Ludwig too. It's a dangerous game to play catch up, but hopefully releasing it as open source and making it easy to extend may help spontaneous contributions from the community.

* debugging is an issue, that is true, I answered to another post about that, but again, it's a matter of tradeoffs. Debugging is kind of a nightmare in SQL too for instance, or is in TensorFlow (even if the tfdbg improved things a bit).

* extending requires coding, that is also true, but if you have an idea for a new encoder for instance, all you have to do is implement a function that takes a tensor of rank k ad input and output a tensor of rank z as output, and all the rest (preprocessing, training loop, distributed training, etc.) comes for free, which kind of a nice value proposition imho and let you focus on the model rather than all the rest.

Thanks for the interesting discussion!


That looks pretty cool. The official website [0] has more examples.

[0] https://uber.github.io/ludwig/examples/


"identification of points of interest during conversations between driver-partners and riders" Wait...what?! I hope this was an opt-in study.


It looks like Uber ai lab don’t have anything to show to management to prove their existence. That’s why they come up with this kind of (you know what I mean). The code-free toolbox is a myth.


Personally I've been working on the same declarative approach over the past couple of years at my company. This year I changed things to make heavy use of scikit-learn's `make_column_transformer` and `ColumnTransformer` capabilities.

It's nice to see other, (much more) reputable engineering organizations taking a similar approach and treating the construction of different predictive models as an exercise in configuration.

Although in my solution I haven't looked at any DL models, and typically default to feeding everything through XGBoost and performing a grid search for the best hyperparameter config there. My product is basically focused entirely around taking a raw dataset and a configuration file and producing an analytic dataset off of which algorithms can be tested.

I'd be really interested in hearing others' experiences with this type of stuff.


I make little DSLs using TOML to further hide data wrangling and estimator pipelining "hydraulics" in the context of a specific class of models.

So for example I'll have a section that says "data" and with variables "binned = ['var1', 'var2']" and likewise for log/power-transformed variables, and have some Python code turn that into column transformers. In other examples there's even more custom logic hidden (some stuff from my master's thesis that I'm not at liberty to discuss) so it's not just a matter of reusing scikit-learn.

I use TOML for little languages that are really flexible configuration files, and YAML for situations where there may be multiple similarly-specified objects (because the hierarchical syntax in TOML is somewhat obscure).


I've never seen TOML, but I really like it. In my pretty simple pipeline, I've just been using a JSON-like structure that I can easily import into Python:

  strategies = [
  {
      'name': 'column1',
      'kind': 'categorical',
      'strategy': 'ohe'
  },
  {
      'name': 'column2',
      'kind': 'continuous',
      'strategy': 'center'
  }
  ...
  ]

It has worked very well thus far. I received a request from a stakeholder a few weeks back about building a new model using a slightly different target. I told him it'd take a couple of days at least (it was a very similar target), but I had it completed (at least from a re-engineering perspective) in about 5 minutes. I simply changed the target variable config and removed any leaky features after changing the target.

I'm convinced that this approach, coupled with configurations for tree-based models like minimum samples per leaf and max depth, is the most efficient way of building predictive models. Those configurations specific to tree-based model software help to skirt things like the rule of five, etc., IMO.


That's basically my approach. TOML is chosen for readability and easy maintenance, as well for the ability to cut and paste some chunks of python defining constants etc. directly which makes the first steps much less bureaucratic.

I'm looking into more general parsing that would allow me to define semi-verbose little languages. In my heart Stata is still the gold standard for rapid fire usability, even at the cost of idiosyncrasy; I'd like to further make the engineering of, say, REST APIs more and more code/language independent and more logically specified, since a lot of the new "data science" crowd coming from stats and applied maths can't code their way out of a Tequila Sunrise.


That experience you talked about, being able to really quickly come up with solutions starting from an already established and working configuration, was one of the main motivating factors behind the making of Ludwig too! Basically I had a model, and they asked me first to add an input feature, then to add another one, then to add an output feature. Every time I did it with generality in mind so that the following time they would ask me it would require less time to do it. In the end the final solution is that it just costed me to add a for in a YAML file. :)


I think this is really cool. I shall try and see if I can make a Docker image out of it.


After scanning the documentation, it's an even higher-level Keras which is good, but in order to make smart use of it, you really need to know all the DL tricks, which makes the push toward nontechnical users misleading.


You may be right about that, but also it depends on the requirements that you have. Ludwig gives you a lot of options for those tricks, like for instance gradient clipping or regularizers, or learning rate and batch size scheduling, but those things usually are useful for squeezing that extra 3% performance, and ieven in those cases, having them already implemented is an advantage. My personal experience is that in many cases doing the first step, getting 80% of the final performance, is enough then to convince someone of the value of what you are doing and then you can spend time later improving over it, and with this regard, Ludwig gets you from 0 to 80% really quick.


Looks cool! Is there any particular Ludwig the name is referring to?


There are lots of us unofficially taking credit. RIP any googleability I might have had before...


Looks good. I manage a machine learning team and we write custom models. For data science teams with limited engineering support, Ludwig looks very good.


Looks good (non developer product person here)... Could this be used for time series data predictions?


Sorry for the laymen question, but, is this language independent or only for English?


Text is just one of the possible types of features supported in Ludwig. For those features, no, you can train models on any language, with a couple little caveats: you can train both character based models and word level ones. For character ones you don't really need anything, and you can train also of languages without explicit word separation like chinese. For word based ones, you need a function to separate your text into words. By default a regular expression is used, which generically kinda works for most languages that separate words, but the tokenizer from the spaCy library can also be used. SpaCy provides models for a few languages, and at the moment we are wrapping just the english ones, but it would be extremely easy to add also the other ones supported in spaCy.


The polarity of comments between this post, and the driver post, scares me.


How might one do speaker (voice) recognition?


checkout this https://github.com/NVIDIA/OpenSeq2Seq - a similar framework but with voice recognition and voice synthesis.


Speech is coming to Ludwig soon, we are currently working on that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: