Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Shopify's Data Science and Engineering Foundations (2020) (shopify.engineering)
205 points by mooreds on March 11, 2022 | hide | past | favorite | 25 comments


It sounds like they have data science and data engineering in one organization. Is that team structure something that others have seen work well?


I've been in orgs where it was on same team, and on different teams, both as a modeler and a data engineer. So far, I personally prefer when they're on the same team.

Pros of same-team: fewer ideas "lost in translation" between data scientists and data engineers, better understanding of which datasets/flows are top priority, can sometimes share some stack components and help datascientists improve their code, better chances of getting data scientists to contribute their own batch jobs (there's just more trust as opposed to dealing with some "engineering" team that is less connected to you)

Cons of same team: data engineers may not be as in-the-loop on what's happening with production datasets, may not be as tightly integrated with a devops team, may get overly caught up in "business logic" as opposed to "plumbing".


One of the most interesting bits of devops work I've done was when I was embedded with a data science team. Infrastructure for data science is just so different than traditional ops - but I feel like I was able to both help the team move more quickly and also prevent them from spending all of the companies money - so at least in that case, it worked quite well.

I've never understood why data science teams are typically so far removed from "normal" engineering teams. Maybe it's the DevOps kool-aide speaking, but in my opinion, teams should be more horizontal than vertical!


In my experience you tend to get better engineering staff when it's one organization, along with a better customer/product focus.

When it's two independent teams, you tend to get a more research focused Data Science organization and a team of engineers more focused on plumbing.

Which option is better will depend on the organization goals. If you think that you have a straight research problem than a dedicated research team is useful. If you want to ship product than one team is better.


I work with operations research teams in a blended model of engineering being embedded with the OR Scientists. I really prefer it. Code can get to prod a lot quicker and we don’t have the “throw it over the fence to engineering” issues that can arise.


Data scientists are embedded in product teams and data platform engineers are in a platform engineer org


  > We test every situation that we can think of: errors, edge cases, and so on.
Pretty bold claim. Either their fantasy can't think of many weird data issues, or their data is somehow guaranteed to be always validated, or they came up with a magic secret sauce for data unit testing.

Could anybody explain what they mean by their version of unit testing?

Does it mean to test that a change in code will still successfully process the same static data sample as was tested previously?

If not using static data, what does it test when the input data changes? Does it validate the input data?


I don't know what they do at Shopify, but I have seen teams be successful with using dbt's (https://getdbt.com) unit testing methodologies and tooling. It makes it very easy to do things like cardinality and null checking, and if you're using a datamart like Snowflake with snapshots, it is super easy to have test harnesses with representative data. To me the combination of tools there is the secret sauce, since data mocking is such a giant tedious pain in the ass.


>We test every situation that we can think of

Doesn't say anything about how many situations they think of.


When I was at Shopify, it was Python unit tests for ETL code. There was a framework of reusable components that simplified things, but it was all homegrown. They were trying to move to dbt


Validating data on ingest solves many many problems down the line.


I started this expecting to be disappointed, but I really like all of the principles they're describing. I've been pushing for more of this attitude at my own company.


I liked how they took some of the essences of software development (one set of tooling, DRY, re-use) and applied it to the data science arena.


These seem like a list of common sense practices for any team working with data.

It's surprising to see Shopify sticking with the Data Warehouse model in 2020… I expected something a little more cutting edge.

Data Warehouses are fine if you are working with a small number of data sources, but at some point they start to slow down development and new analytics, and make it cumbersome for intraday reporting.

This is why I'm seeing Data Warehouses being replaced by Data Lakehouses at my company and others in the industry. The Lakehouse enables faster development and near real-time analytics, on lower cost storage and works with structured and unstructured data. Similar team practices are still applicable, but the underlying data structures and governance is different.


Data Lakes are ridiculously expensive. The benefits are almost never worth it these days given how efficient data transformation workflows are now.


Data Lakes are designed as low cost storage and can be cloud based or on-prem. Not sure what solution you are referring to as being "ridiculously expensive". If the Data is valuable, a data lake should be cost effective.


Does any e-commerce retailer have an api that lets me as a consumer get my order history and its details??

Wish I could write a script to keep retrieving my Amazon order history with one click.


Swell merchants could technically make this available since it’s part of the storefront API


As a consumer? No.


Having recently worked on a data team at FAANG, all this is an ops nightmare for the team running the platform itself if you want to ensure data quality for everyone querying the data. Im talking when you have hundreds of data sources and hundreds of query use cases.

Anyone have any solutions you've tried?


Checkout Apache Iceberg. Does a great job of handling many readers few writers. With data consistency and query consistency.

It’s a great approach for your data lake and data warehousing needs.


This timetravel/rollback feature is really interesting: https://iceberg.apache.org/docs/latest/spark-queries/#time-t...


FAANG seems to be an outlier but, it sounds a lot like the enterprise data mart strategy covered under a mix of stuff from principle #1.

If you want quality, you need structure and review. Accessible data is helpful and needed to develop some of the mature processes, but for most day to day analysis/reporting, no one wants to create their own data model from scratch.

Lots of FAANG doesn't apply to any other companies so it may just be a case of having a wholly unique use case. Though I'm surprised there isn't something already in place at this point (of course having very little knowledge of the case). For the dims/facts/marts, they tend to be business use case focused and not source/data which can reduce the targets down significantly since business use cases tend to repeat (or rhyme).


Don’t have anyone querying the data but have everyone. Trying to line up definitions, dq requirements etc. ends up with internalisation of the proliferation of interfaces of explicit producer consumer relationships. Therefore, don’t be scared to go for point to point. That’s the spirit of elt after all, but we tend to always strive for am imaginary perfect data model.


I've read stuff before about Shopify's use of Nix. Since this post doesn't mention Nix, I take it they don't use it in this department of the company?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: