Arguably that’s a separate (obviously critical) concern. I think it’s worth it to abstract that away as just a step that exists in the pipeline with its own set of concerns/challenges/methods etc that really requires its own deeper study to do well.
For instance, my ML work is almost entirely in the context of engineering simulation regression/surrogate development, where data quality/cleaning is almost no issue at all - all of the work is on the dataset generation side and on the model selection/training/deployment side.
For instance, my ML work is almost entirely in the context of engineering simulation regression/surrogate development, where data quality/cleaning is almost no issue at all - all of the work is on the dataset generation side and on the model selection/training/deployment side.
Every job is different!