Hacker News new | past | comments | ask | show | jobs | submit login

i looked over the tutorials and curious to know, whether the tutorials are representative of how Netflix does ML ?

is data really being read in .csv format and processed in memory with pandas ?

because I see "petabytes of data" being thrown everywhere, and i am just trying to understand how one can read gigabytes in .csv process do simple stats like grouping by in pandas - shouldn't simple SQL DWH do the same thing more efficiently with partitioned tables, clustered indexes and the power of SQL language ?

i would love to take a look at one representative ML pipeline (even with masked names of datasets, features) just to see how "terabytes" of data get processed into a model




Good question! A typical Metaflow workflow at Netflix starts by reading data from our data warehouse, either by executing a (Spark)SQL query or by fetching Parquet files directly from S3 using the built-in S3 client. We have some additional Python tooling to make this easy (see https://github.com/Netflix/metaflow/issues/4)

After the data is loaded, there are bunch of steps related to data transformations. Training happens with an off-the-shelf ML library like Scikit Learn or Tensorflow for training. Many workflows train a suite of models using the foreach construct.

The results can be pushed to various other systems. Typically they are either pushed to another table or as a microservice (see https://github.com/Netflix/metaflow/issues/3)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: