i looked over the tutorials and curious to know, whether the tutorials are representative of how Netflix does ML ?
is data really being read in .csv format and processed in memory with pandas ?
because I see "petabytes of data" being thrown everywhere, and i am just trying to understand how one can read gigabytes in .csv process do simple stats like grouping by in pandas - shouldn't simple SQL DWH do the same thing more efficiently with partitioned tables, clustered indexes and the power of SQL language ?
i would love to take a look at one representative ML pipeline (even with masked names of datasets, features) just to see how "terabytes" of data get processed into a model
Good question! A typical Metaflow workflow at Netflix starts by reading data from our data warehouse, either by executing a (Spark)SQL query or by fetching Parquet files directly from S3 using the built-in S3 client. We have some additional Python tooling to make this easy (see https://github.com/Netflix/metaflow/issues/4)
After the data is loaded, there are bunch of steps related to data transformations. Training happens with an off-the-shelf ML library like Scikit Learn or Tensorflow for training. Many workflows train a suite of models using the foreach construct.
is data really being read in .csv format and processed in memory with pandas ?
because I see "petabytes of data" being thrown everywhere, and i am just trying to understand how one can read gigabytes in .csv process do simple stats like grouping by in pandas - shouldn't simple SQL DWH do the same thing more efficiently with partitioned tables, clustered indexes and the power of SQL language ?
i would love to take a look at one representative ML pipeline (even with masked names of datasets, features) just to see how "terabytes" of data get processed into a model