I feel like groovy pushes towards the worst of both worlds between an internal dsl and external dsl. It’s an internal dsl so you get the language but oh man groovy sucks
I don’t know. I started this route and then quickly switched to only dipping into nf-core when they had actual prior art.
The interplay of nf and groovy (how I wish they hadn’t used groovy!) can be mind bending but if you’re writing your own thkng you have a different optimization model than nf-core that is trying to be one size fits all
The big difference when comparing bioinformatics systems with non are what the typical payload of a DAG node is and what optimizations that indicates. Most other domains don’t have DAG nodes that assume the payload is a crappy command line call and expecting inputs/outputs to magically be in specific places on a POSIX file system.
You can do this on other systems but it’s nice to have the headache abstracted away for you.
The other major difference is assumption of lifecycle. In most biz domains you don’t have researchers iterating on these things the way you do in bioinf. The newer ML/DS systems do solve this problem than say Aorflow
I for one have started to appreciate the fact that the shell/commandline interface means:
- We have an interface that very strongly imposes composability, that is rarely seen in other parts of IT, and making people actually "follow the rules" :D
- Data is (mostly) treated as immutable, except perhaps inside tools
- Data is cached
- The cli boundaries means that at least one can inspect inputs/outputs as a way to debug.
- Etc...
Personally, the biggest frustration is all the inconsistencies in how people design the commandline interfaces. Primarily that output filenames are so often created based on non-obvious and sometimes arbitrary rules, rather than being specified by the user. If all filenames were specified (or at least possible to specify) via the CLI, pipeline managers would have such an enormously easier time.
What happens now is that you basically need a mechanism like Nextflow has, where all commands are executed in a temp directory, and the pipeline tool just globs up all the generated files afterwards. This works, but opens a lot of possibilities for mistakes in how files are tracked (might be routed to the wrong downstream output, if you do something funny with the naming, such that two output path patterns overlap).
nextflow can't even get this right- base nextflow uses some combination of `--paramName` and `--param-name` and treats them as interchangeable, while nf-core encourages `--param_name` (but nextflow sees that as different). All trivial differences but just layers on the CLI frustration train.
As GP referenced CWL, while NF had appeared first in terms of the bioinformatics world Nextflow, CWL, Snakelike, and WDL all erupted close enough to each other to be equal-ish. The people were aware of each other but they were all so nascent that it wasn't clear if it was worth joining in or not. At the end of the day these all came from groups trying to scratch particular itches, and not everyone agreed on the right way to scratch.
However all of them were rejections of prior models as well as the workflow solutions prominent in the business space.
They try to address similar solutions, but comparing snakemake and nextflow doesn't do either tool a favour. They use different computation models, nextflow is based on dataflow programming and therefore schedules processes dynamically as new data comes in, while snakemake is pull-based and schedules the processes based on the dag defined by the dependencies. Anyhow they are both great tools.
While these two are aimed at bioinformatics, they are general purpose enough that you can apply them to any computational workflow. I can say they saved my PhD
Yeah, the thing that I find disappointing is that there is a lot of science value locked into the different systems of describing a workflow, pipeline or DAG. Like you said, they all had different itches to scratch and even some barebones "standards" like csv have flavors/extensions/etc.
> are arguably better for people who have a stronger software engineering basis
As someone who is a software developer in the bioinformatics space (as opposed to the other way around) and have spent over 10 years deep in the weeds of both the bioinformatics workflow engines as well as more standard ones like Airflow - I still would reach for a bioinfx engine for that domain.
But - what I find most exciting is a newer class of workflow tools coming out that appear to bridge the gap, e.g. Dagster. From observation it seems like a case of parallel evolution coming out of the ML/etc world where the research side of the house has similar needs. But either way, I could see this space pulling eyeballs away from the traditional bioinformatics workflow world.
A really hard aspect to this is that there's a massive impedance mismatch between the research & production side of things. Working in the research side is pretty straightforward - although software development practices are going to be a lot looser & faster. Working in a production environment is straightforward, it's like any other software job. But - working at the confluence of those two states is incredibly difficult.
Fwiw I acknowledged the good people directly from the Cromwell team on a presentation recently due to the incredible support/help somebody & their team provided my team. The WDL/Cromwell community has grown and I've heard people mention it everywhere now (far away from the Broad) and it's in no small part due to that team and its former leadership.
Hey, that's my project! (And geoffjentry is my former boss.)
Nice to hear the praise, thank you. The project has changed a lot over time and inevitably left some disappointed people filing Github issues (CWL, non-cloud backends, etc).
It's really unique and enjoyable working on OSS that has a strong community, it is the highlight of my career.
> Good luck trying to use a functional-first language, aside from maybe Scala
While they've moved away from it in the last few years, the Broad Institute had a huge investment in Scala. It's been in use there since at least 2010 and I believe longer. The primary software department was almost entirely Scala based for several years. That same department had pockets of Clojure as well.
It's a mindset shift to a more declarative model. The idea has also popped up in other niche orchestrators.
This is an oversimplification but IMO the easiest way of picturing it is instead thinking of defining your graph as a forward moving thing w/ the orchestrator telling things they can be run you shift to defining your graph nodes to know their dependencies and they let the orchestrator know when they're runnable.
Not only that, but it wasn't even the largest issue.
People point at pimentoloaf.com or whatever and laugh. But when those companies went under, they took away real dollars from "real" B2B companies. And then when those companies went under, "real" companies who depended on them went under. And so on.