Unpopular Opinion – Data Scientists Should Be More End-to-End

geebee · on Aug 25, 2020

People will agree, in theory. But here's what happens. The data science team will grill the generalist about finding a steepest descent vector and how logistic regression works, and conclude that the candidate just isn't going to be populate, run, and interpret output form a neural net.

Then, the data engineering team will ask the generalist to write code to print the path from one leaf node to another in a binary tree at the whiteboard in 45 minutes, and conclude that the candidate just isn't going to be able to figure out how to identify matching terms in different parts of a JSON tree.

Eventually, both groups withdraw into their own silos and confuse each other. They may decry the lack of generalists, but when it comes time to hire, they will resolutely not hire candidtes with 80%ile skills in both areas. They will hire only people with 95%ile skills in one area. They may get some people who have well rounded skill sets through sheer chance, but their process selects against this outcome.

alexfromapex · on Aug 25, 2020

Agreed, but first companies need to realize that you don’t need a math PhD to do 80% of the things a data scientist does. My company has data science and machine learning experts that aren’t as well-rounded in the software engineering side but there’s no cross training because they want SEs to have a CompSci or SE degree and DSs to have a math or DS degree.

aeternum · on Aug 26, 2020

The problem with this argument is it could be applied to almost every role. IE Sales team should be more end-to-end (if they spec features and maybe even write code they will understand the product better). For most companies, this would be a terrible idea.

Viliam1234 · on Aug 26, 2020

People should have way more skills than they have now, they should learn all those skills in their free time, and be available for the same salary. Then we could simply hire fewer of them to do the same work. Also, I deserve a pony!

mlthoughts2018 · on Aug 25, 2020

This is sadly misguided because it mistakes the behavior that’s good for the company (fast iteration loops, tight alignment between product and engineering) for the means to get there (make data scientists be more end to end).

The goal of tight iteration with good alignment is of course a good one, but verbal sleight of hand doesn’t mean the way to achieve it is with more end to end responsibilities for data science.

The huge cost of course is that data scientists and ML engineers have a hugely asymmetric comparative advantage when spending their time on model training and statistical solutions. You want them at full utilization for this set of tasks because nobody else you employ can do that same statistical work, and that work is often hugely valuable whereas most of the end to end work is frankly grunt work and fighting through errors that anyone can do.

If you hire Michael Jordan for your basketball team, why would you make him spend his (expensive) time cleaning up soda bottles or checking the elevators for maintenance issues? It utterly makes no sense and wastes the comparative advantage - all your Michael Jordans will be heading for the door.

ska · on Aug 25, 2020

This argument is a really good one, but only in a small (< 10%, < 1%?) of cases. Far more often the "grunt work" is generating at least as much value as the modeling, and Michael Jordan doesn't want anything to do with your office pick up game.

Especially in the case where you can't deploy meaningfully because your "data scientists" are working in a silo with poor communication.

mlthoughts2018 · on Aug 25, 2020

Based on my ~8 years of experience managing ML teams in big companies, I’d say it’s closer to 80% of cases.

“You can’t deploy correctly because of data scientists” is a failure of SRE organizations to provide support, tooling and training.

In fact, most data science and ML engineers are quite skilled in systems engineering, because you have to do so much work with GPU hardware issues, underlying scientific package management, efficient data transportation, etc.

Editing some kubernetes config, hardening a high traffic web service or optimizing a query based on an index are trivial by comparison, they are just boilerplate timewasters that need to be a different team’s job to automate.

amznthrowaway5 · on Aug 26, 2020

"In fact, most data science and ML engineers are quite skilled in systems engineering, because you have to do so much work with GPU hardware issues, underlying scientific package management, efficient data transportation, etc."

This does not align with my experience working with many data/applied scientists at very large companies. A lot of them cannot even write basic code or use git commands. The engineers who are capable are often silo'd away from the scientists, and the organizations struggle to produce any real value.

mlthoughts2018 · on Aug 26, 2020

I’m sorry you’ve had such a rare and incredibly uncommon, unrepresentative experience. It sounds very idiosyncratic to your workplace and likely to the hiring processes.

amznthrowaway5 · on Aug 29, 2020

Why do you suspect this experience is so rare and unrepresentative, as opposed to yours? It is a company wide problem in the cases I've seen, the scientist positions are not even expected to have junior software engineer level competence in things like programming.

ska · on Aug 25, 2020

There are companies who do this well, sure.

Are you seriously suggesting that this applies 80% of the companies out there trying to apply ML methods?

If so, it really doesn't match my experience in this and related fields. Of course we'll both have sampling bias here, so I could be missing the big picture; it's not like I've studied it industry(ies) wide.

Failure or non-existence of effective SRE organizations is just one of many common failure modes ime.

mlthoughts2018 · on Aug 25, 2020

Not necessarily 80% of companies, rather 80% of devops <> machine learning workflows in a given company.

winchester6788 · on Aug 26, 2020

> In fact, most data science and ML engineers are quite skilled in systems engineering, because you have to do so much work with GPU hardware issues, underlying scientific package management, efficient data transportation, etc.

Unless your data scientists are expected to build the machines they use, they won't be dealing with any hardware issues at all.

Literally every data scientist at big companies use pre-configured vms/notebooks in cloud.

mlthoughts2018 · on Aug 26, 2020

This is just deeply wrong. Most data scientists hate Jupyter notebooks and deeply recognize the flaws of the paradigm, poor modularity or testability, etc.

As an ML engineer you spend a lot of your time dealing with Cuda installations, custom compiler flags and then build/compilations of things like Tensorflow, deep internals of Docker image builds to make these environments reproducible, image processing software with opencv and tons of cross platform & software packaging headaches, writing efficient queries and understanding data structure implications for spark, arrow, hdfs, presto, postgres, etc etc, and standing up things like tensorboard for telemetry of ML training systems, deploying mlflow or kubeflow in kubernetes, and so on.

The myth of data scientists as notebook jockeys is just one more symptom of the denial of SRE orgs to admit ML engineers are great system engineers, to try to control them with parochial devops requirements coming from outside specializations.

Izkata · on Aug 26, 2020

> The myth of data scientists as notebook jockeys is just one more symptom of the denial of SRE orgs to admit ML engineers are great system engineers

It's most obvious with this statement, but overall you seem to think "ML engineer == data scientist", which just isn't the case.

mlthoughts2018 · on Aug 26, 2020

That sounds like a No True Scotsman fallacy to me. You’re trying to define “data scientist” as someone who only knows how to use notebooks, fails to put testing as a first class consideration, etc., but that’s a severe minority of people with the job title of Data Scientist.

I manage teams of both ML engineers and Data Scientists and have designed hiring processes for both within large ecommerce companies for years.

Izkata · on Aug 26, 2020

> You’re trying to define “data scientist” as someone who only knows how to use notebooks

I'm not the other person, and that is definitely not what I'm saying. There can be overlap, but the distinction is important: Data scientists use tools for analysis, ML engineers are capable of building those tools.

For example, the tool I'm aware of that some of our data scientists use is SPSS - but they have no programming experience, and could not remotely be grouped in with "ML engineers".

mlthoughts2018 · on Aug 27, 2020

I understand, I’m just saying that the “SPSS only” type of data scientist (glorified business analyst) you describe is very rare in industry and it’s not a useful broad brush to paint the field of data science with - it’s a greatly exaggerated and overrepresented stereotype.

alexfromapex · on Aug 25, 2020

I think what OP is saying is that any ostensible gains of a dedicated data scientist are lost later because they didn’t think about the solution through the lens of the full tech stack. Maybe there was a much easier way to implement a model that could save a lot of time. Lots of scenarios where general knowledge leads to better decisions.

mlthoughts2018 · on Aug 25, 2020

That is premature abstraction 101 though. SRE folks will bring parochial philosophy and first principles constraints too early, not understanding the differences between ML system needs and other systems they’ve worked with.

Assuming you could have correctly anticipated those design needs earlier, rather than going back to iterate only after you get concrete evidence of the specific changes you need, is a big trap and time waster.

I don’t agree that you lose because the data scientist failed to consider production SRE patterns early enough. You lose when those SREs fail to take a YAGNI philosophy, let the unusual ML system’s chips fall where they may, and then intentionally go back and iterate on improving it.

martindbp · on Aug 25, 2020

I think you can frame it as specialization without making it sound elitist. It's just more efficient if everyone can spend time on what they're uniquely good at. The "grunt work" can be just as complex as the data science, just in a completely different way.

mlthoughts2018 · on Aug 25, 2020

Some of it, yes. Some of it truly is just rote work that should be automated or owned by junior non-specialists. The same comparative advantage argument could apply to highly skilled DBAs or quants or frontend specialists, just depending on your company’s needs.

If people interpret it as “elitist” to refer to basic comparative advantage economics, they need to get over it and stop being so sensitive.