It depends on your starting point. A baseline level of ML is needed. Otherwise ML platforms account for three basic functions: features/data, model training, and model hosting.
So do an end-to-end project where you:
- start from a CSV dataset, with the goal of predicting some output column. A classic example is predicting whether a household's income is >$50K or not from census information.
- transform/clean the data in a jupyter notebook and engineer features for input into a model. Export the features to disk into a format suitable for training.
- train a simple linear model using a chosen framework: a regressor if you're predicting a numerical field, a classifier if its categorical.
- iterate on model evaluation metrics through more feature engineering, scoring the model on unseen data to see its actual performance.
- export the model in such a way it can be loaded or hosted. The format largely depends on the framework.
- construct a docker container that exposes the model over HTTP and a handler for receiving prediction requests and transforming them for input into the model, and a client that sends requests to that model.
That'll basically get an entire end-to-end run the entire MLE lifecycle. Every other part of development is a series of concentric loop between these steps, scaled out to ridiculous scale in several dimensions: number of features, size of dataset, steps in a data/feature processing pipeline to generate training datasets, model architecture and hyperparameters, latency/availability requirements for model servers...
For bonus points:
- track metrics and artifacts using a local mlflow deployment.
- compare performance for different models.
- examine feature importance to remove unnecessary (or net-negative) features.
- use a NN model and train on GPU. Use profiling tools (depends on the framework) and Nvidia NSight to examine performance. Optimize.
- host a big model on GPU. Profile and optimize.
IMO: the biggest missing piece for ML systems/platform engineers is how to feed GPUs. If you can right-size workloads and feed a GPU with MLE workloads you'll get hired. MLE workloads vary wildly (ratio of data volume in vs. compute; size of model; balancing CPU compute for feature processing with GPU compute for model training). We're all working under massive GPU scarcity.
For the majority of usecases I have seen: solving a sufficiently large painpoint, understanding/formulating the problem, having/getting the right data, fitting well into a workflow of the users.
All the technology challenges are actually on the "cost" side of the equation. Meaning, that the aim wrt business value should be do as little of it as possible (but not less!). For some use cases this can still be quite a lot... But more often on the "all the pieces need to be in place for the whole to work at all" rather than "each piece needs to be super optimized".
this is really helpful, thanks. how much are third-party models changing these workflows (LLMs etc)? would you still spend as much time on feature engineering and evaluation? I'm wondering whether any saved time would be refocused on hosting, especially optimizing GPU utilization