They definitely target large scale companies, but you can use their SaaS offering and it can be relatively affordable. The best part is the flexibility and scaling, but the license model is awesome too. There's no usage based billing, you just pay a flat license fee per user that writes code and for the underlying cloud costs and they'll deploy it on GCP, AWS, or Azure.
They're used by a lot of large companies, but academia as well to replace or augment on-prem HPC clusters. That's what we used them for as well.
It's a shame that they don't have you writing marketing copy! The docs are indeed a lot more reasonable looking (to me at least). I work for a small proprietary fund and not some Godzilla company these days so maybe I'm just not the audience, but whew, for purchasing decision makers with subject matter background, that home page would have been a back button real fast if it wasn't linked from your thoughtful comment.
I'm interested in your opinion as a user on a bit of a new conundrum for me: for as many jobs / contracts as I can remember, the data science was central enough that we were building it ourselves from like, the object store up.
But in my current role, I'm managing a whole different kind of infrastructure that pulls in very different directions and the people who need to interact with data range from full-time quants to people with very little programming experience and so I'm kinda peeking around for an all-in-one solution. Log the rows here, connect the notebook here, right this way to your comprehensive dashboards and graphs with great defaults.
Is this what I should be looking at? The code that needs to run on the data is your standard statistical and numerics Python type stuff (and if R was available it would probably get used but I don't need it): I need a dataframe of all the foo from date to date and I want to run a regression and maybe set up a little Monte Carlo thing. Hey that one is really useful, let's make it compute that every night and put it on the wall.
I think we'd pay a lot for an answer here and I really don't want to like, break out pyarrow and start setting up tables.
I'll just say Domino presents very much as a code first solution. So, if you want staff to be able to make dashboards _without_ code like using Looker Studio, then this isn't it.
The one other big thing that Domino isn't, is it's not a database or data warehouse. You pair it with something like BigQuery or Snowflake or just S3 and it takes a huge amount of the headache of using those things away for the staff you're describing. The best way to understand it is to just look at this page: https://docs.dominodatalab.com/en/cloud/user_guide/fa5f3a/us...
People at my work, myself included, absolutely love this feature. We have an incredibly strict and complex cloud environment and this makes it, so people can skip the setup nonsense and it will just work.
This isn't to say that you can't store data in Domino, it's just not a SQL engine. Another loved feature is their datasets. It's just EFS masquerading as an NFS, but Domino handles permissions and mounting. It's great for non-SQL file storage. https://docs.dominodatalab.com/en/cloud/user_guide/6942ab/us...
So, with those constraints in mind, I'd say it's great for what you're describing. You can deploy apps or API endpoints. You can create on-demand large scale clusters. We have people using Spark, Ray, Dask, and MPI. You can schedule jobs and you can interact with the whole platform programmatically.
Looks like we're in a similar situation. What is your current go-to for setting up lean incremental data pipelines?
For me the core of the solution - parquet in object store at rest and arrow for IPC - haven't changed in years, but I'm tired of re-building the whole metadata layer and job dependency graphs at every new place. Of course the building blocks get smarter with time (SlateDB, DuckDB, etc.) but it's all so tiresome.
Yeah, last time I had to do this was about a year ago and I used parquet and arrow on S3-compatible object stores and put a bunch of metadata in postgres and the whole thing. At that time we used Prefect for orchestration which was fine but IMHO not worth what it cost, I've also used flyte seriously and dabbled with other things, nothing that I can get really excited about recommending, it's all sort of fine but kinda meh. I used to work for a megacorp with extremely serious tooling around this and everything I've tried in open source makes me miss that.
On the front end I've always had reasonable outcomes with `wandb` for tracking runs once you kind get it all set up nicely, but it's a long tail of configuration and writing a bunch of glue code.
In this situation I'm dealing with a pretty medium amount of data and very modest model training needs (closer to `sklearn` than some mega-CUDA thing) and it feels like I should be able to give someone the company card and just get one of those things with 7 programming languages at the top of the monospace text box for "here's how to log a row", we do Smart Things and now you have this awesome web dashboard and you can give your quants this `curl foo | sh` snippet and their VSCode Jupyter will be awesome.
Just reading this as well and I neglected to mention that the Domino thing we use has Flyte (They call it Flows, but it's the same thing) and MLFlow built-in as well.
We do discount heavily for academia: get 50% off for research and 100% off (i.e. free) for teaching. But I do get that our pro products largely solve problems that folks encounter in larger enterprises, and you may not see the value inside an academic department. I'm also always happy to learn how we could do better, please feel free to reach out to [email protected].
Thank you for the response! My key recommendation is to unbundle. At one point we have been told "It got 35% more expensive, but it does so much more now it supports Python" - we didn't ask for Python. To effectively use Connect with private packages you need this other full-blown institutional Package Manager license, 99% of your users will not need. Also per-named-user pricing (rather than per active seat) is so aggressive. A user uses a shiny app once in a year, still need a full per year license.... I feel Posit is one of the most agressive companies in terms of upselling, while positioning as this benevolent PBC / Open-Source institution - just go full Oracle at least we will know where things stand.
Positron looks like the next version of Rstudio, which is currently free. Do you think the plan is to phase out support for the free product and push users into the paid one?
Positron inherits many ideas from RStudio, but is a separate project with an intentionally different set of tradeoffs; it gains multi-language/multi-session support, better configuration/extensibility, etc. but at the expense of RStudio's simplicity and support for many R-only workflows.
We're still investing in RStudio and while the products have some overlap there's no attempt to convert people from one to the other.
I am talking about the RStudio Server and Connect - these are really expensive. One of the sales reps claimed that it is so expensive because they are a PBC and support open-source development. As in if they were just for profit it would be cheaper, but we should feel good about paying more. I could not take it.
As an admin and advocate for Posit Teams, Connect and Server filled a niche where a single admin could spin up infra and allow for anything deployable by end users without having to worry about scaling.
It paid for itself in terms of scientists spinning up their own projects without having to provision server hardware, VMs, or anything else.
If you take advantage of all the featuers in Teams, perhaps. But we needed a tiny bit from Workbench, a little bit from Connect, and sliver from Package Manager - and Teans ended up eating a huge portion of our IT bill, effectively stunting our efficiency as a research organization. Over the years, while our use case did not change, our Posit bill more than doubled.