Ive been in the end stage of this (worked on data validation for a good chunk of my career) and these are my thoughts on the article:
Determining blocking vs non blocking is a big issue - deciding which checks should be stoppers and which shouldn’t is often a matter of extensive debate. In my experience, only a few data checks are absolute show stoppers under any circumstance and a lot of things need to spawn tickets that should be routed to the correct team and followed up on. Some type of tracking system is necessary for this.
Defining the logic of checks themselves in YAML is a trap. We went down this DSL route first and it basically just completely falls apart once you want to add moderately complex logic to your check. AirBnB will almost certainly discover this eventually. YAML does work well for the specification of how the check should behave though (eg metadata of the data check). The solution we were eventually able to scale up with was coupling specifications in a human readable but parseable file with code in a single unit known as the check. These could then be grouped according to various pipeline use cases.
A model that plugs into an Airflow DAG as AirBnB has designed seems like a good approach. Often when it was time to incorporate checks into the pipeline we had heterogenous strategies to invoke our checks engines. Having a standardized approach helps drive adoption across the organization- oftentimes I’ve found that people are reluctant to run non critical checks if it’s a significant time and effort cost and will only run critical ones to try and push data quality accountability either upstream or downstream. If it’s really easy to turn on and incorporate that’s one less excuse that can be used to not run the checks.
You seem to know what you're talking about. Ignorant question: do you think Dagster would work better as an orchestration/validation tool than AirBnB's Wall?
I don’t know much about Dagster but it does not look like they have a validation tool equivalent to Wall, which requires Airflow. So you would not get validation with Dagster unless you brought it yourself.
For blocking checks - I personally use notion of errors and warnings, with errors definitely going to quarantine and propagated to good data, and warnings going to both good data and quarantine. It’s a trade off between not blocking all data and having a visibility of what is potentially bad. Another approach is to send everything into quarantine, but then giving users an instrument for rescuing their data, and further tuning checks to avoid this happening.
Determining blocking vs non blocking is a big issue - deciding which checks should be stoppers and which shouldn’t is often a matter of extensive debate. In my experience, only a few data checks are absolute show stoppers under any circumstance and a lot of things need to spawn tickets that should be routed to the correct team and followed up on. Some type of tracking system is necessary for this.
Defining the logic of checks themselves in YAML is a trap. We went down this DSL route first and it basically just completely falls apart once you want to add moderately complex logic to your check. AirBnB will almost certainly discover this eventually. YAML does work well for the specification of how the check should behave though (eg metadata of the data check). The solution we were eventually able to scale up with was coupling specifications in a human readable but parseable file with code in a single unit known as the check. These could then be grouped according to various pipeline use cases.
A model that plugs into an Airflow DAG as AirBnB has designed seems like a good approach. Often when it was time to incorporate checks into the pipeline we had heterogenous strategies to invoke our checks engines. Having a standardized approach helps drive adoption across the organization- oftentimes I’ve found that people are reluctant to run non critical checks if it’s a significant time and effort cost and will only run critical ones to try and push data quality accountability either upstream or downstream. If it’s really easy to turn on and incorporate that’s one less excuse that can be used to not run the checks.