Hacker Newsnew | past | comments | ask | show | jobs | submit | bothra90's commentslogin

Is this competing with Nessie (https://projectnessie.org/)?


That would be awesome!


Exciting work! Some questions:

1. Any plans on open-sourcing this?

2. Why not have a tiered architecture that can provide lower latency? p99 of 1s can be too high for some use-cases.

3. Related to 2, how does WarpStream compare to tiered storage in Pulsar?

[Edit 1] Added (3)


Yeah really kind of surprised the free tier isn't the self hosted kind. That will keep me looking.


(WarpStream Cofounder)

FWIW we’re considering a version where you can host the metadata yourself for enterprise users. For the free tier though we didn’t think it made sense since for a workload that could fit into our free tier, it didn’t seem like anyone would want to be responsible for the metadata layer themselves. Would love your feedback on that.


I'm also quite surprised there's no option for self hosting metadata on the free tier. To be fair, in my experience having managed metadata server for prefect (while they have similar Bring Your Own Server model) is quite hard to get right. But at the end we decided to keep maintaining it because our company prefer having all the data in our servers (including metadata).

But I'm still excited to try this, since at least now I could play around (and learn) with a partially kafka compatible system, without the burden of maintaining all of Kafka parts (and costs). Thanks!


Can you share some insight on how/if Kagi is different from Neeva[1] in its approach?

[1] https://neeva.com/


By an independent board do you mean that the H1B holder cannot themselves be on the company's board? Re: 50% ownership, is 50-50 split OK, or does it have to be less than 50?


Is this solving similar problems as Ray [1]?

[1] https://www.ray.io/


Hey, I am the author of Fugue.

Fugue is a higher level abstraction compared to Ray. It provides unified and non-invasive interfaces for people to use Spark, Dask and Pandas. Ray/Modin is also on our roadmap.

It provides both Python interface (not pandas-like) and Fugue SQL (standard SQL + extra features). Users can choose the one they are most comfortable with as the semantic layer for distributed computing, they are equivalent.

With Fugue, most of your logic will be in simple Python/SQL that is framework and scale agnostic. From the mindset to the code, Fugue minimizes your dependency on any specific computing frameworks including Fugue itself.

Please let me know if you want to learn more. our slack is in the README of the fugue repo

Fugue repo: https://github.com/fugue-project/fugue Tutorials: https://fugue-project.github.io/tutorials/


What kind of parser does FugueSQL use? Does it use Apache Calcite?


No, we use antlr, we have no dependency on Java.


no


Well, sort of. Fugue overall is a scaling engine like ray. The specific link to yet another SQL access layer to a dataset doesnt really have an analog on ray, but has some nice features.

I love these SQL layers but they can obfuscate how they implement their transforms. So, they can speed up filter and join creation and coding... til something breaks and then you have to go atomic anyway.


Fugue is a translation layer from SQL to underlying runtime: pandas, dask, spark.

Each of the runtimes, supported by Fugue, can be compared to Ray, but Fugue is a tool of a different kind.


That is very true. Thank you.

Fugue SQL is one way, and it also has functional API. They both can be translated into the underlying runtime. You can choose based your preference and real need.


Much better worded than my post above. Yup to this.


How does this compare with https://github.com/linkedin/greykite?


I would say that compared to Greykite, Darts really attempts to unify a wide variety of forecasting models under a common simple and user-friendly API. There are many differences, but for instance, AFAIK there's no deep learning model in Greykite (it focuses on two algorithms: their built-in algorithm and Prophet), whereas Darts tries to lower the barrier for using deep learning models for forecasting. Crucially for ML-based models, it also means being able to train on multiple (possibly thousands or more) of possibly multi-dimensional time series.


I really wish more design docs / project reports looked like this article - talking about the whys, goals / non-goals, options considered, prototypes built, challenges, and solutions, learnings and next steps. Kudos!


Although GitHub supports the commit-based review the author is arguing for (they found it too after writing the article), I don't think it completely solves the problem - we need to be able review, test, and merge each commit individually without needing to merge the entire PR. Since GitHub doesn't have first-class support for dependent PRs, we are still "stuck" with a broken workflow. An example workaround: https://wchargin.github.io/posts/managing-dependent-pull-req.... Are there any better solutions?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: