More

rdli · 2025-10-01T23:47:41 1759362461

Polar Sky | Founding AI Lead | Bay Area/Seattle | Hybrid/Onsite | Full-time

We're a well-funded, pre-seed cybersecurity startup focused on data security. I'm looking for a founding AI lead with experience in fine-tuning LLMs (expertise around RL + reasoning models a big plus). This person would own the full AI stack from data to training to eval to test-time compute.

Who's a good fit:

* If you've always thought about starting a company, but for whatever reason (funding, life, idea), this is a great opportunity to be part of the founding team. We're 2 people right now.

* You enjoy understanding customer problems and their use cases, and then figuring out the best solution (sometimes technical, sometimes not) to their problems.

* You want to help figure out what a company looks like in this AI era.

* You enjoy teaching and sharing knowledge.

Questions, interest, just email [email protected].

rdli · 2025-09-02T10:56:37 1756810597

Polar Sky | Founding AI Lead | Bay Area/Seattle | Hybrid/Onsite | Full-time

We're a well-funded, pre-seed cybersecurity startup focused on data security. I'm looking for a founding AI lead with experience in fine-tuning LLMs (expertise around RL + reasoning models a big plus). This person would own the full AI stack from data to training to eval to test-time compute.

Who's a good fit:

* If you've always thought about starting a company, but for whatever reason (funding, life, idea), this is a great opportunity to be part of the founding team. We're 2 people right now.

* You enjoy understanding customer problems and their use cases, and then figuring out the best solution (sometimes technical, sometimes not) to their problems.

* You want to help figure out what a company looks like in this AI era.

* You enjoy teaching and sharing knowledge.

Questions, interest, just email [email protected].

rdli · 2025-05-21T17:04:41 1747847081

Seems that OpenAI is acquiring Io for $6.4B in an all-equity deal.

rdli · 2025-05-20T16:04:30 1747757070

This is really interesting. For SOTA inference systems, I've seen two general approaches:

* The "stack-centric" approach such as vLLM production stack, AIBrix, etc. These set up an entire inference stack for you including KV cache, routing, etc.

* The "pipeline-centric" approach such as NVidia Dynamo, Ray, BentoML. These give you more of an SDK so you can define inference pipelines that you can then deploy on your specific hardware.

It seems like LLM-d is the former. Is that right? What prompted you to go down that direction, instead of the direction of Dynamo?

qntty · 2025-05-20T16:08:46 1747757326

It sounds like you might be confusing different parts of the stack. NVIDIA Dynamo for example supports vLLM as the inference engine. I think you should think of something like vLLM as more akin to GUnicorn, and llm-d as an application load balancer. And I guess something like NVIDIA Dynamo would be like Django.

smarterclayton · 2025-05-20T16:42:41 1747759361

llm-d is intended to be three clean layers:

1. Balance / schedule incoming requests to the right backend

2. Model server replicas that can run on multiple hardware topologies

3. Prefix caching hierarchy with well-tested variants for different use cases

So it's a 3-tier architecture. The biggest difference with Dynamo is that llm-d is using the inference gateway extension - https://github.com/kubernetes-sigs/gateway-api-inference-ext... - which brings Kubernetes owned APIs for managing model routing, request priority and flow control, LoRA support etc.

rdli · 2025-05-20T17:02:37 1747760557

I would think that that the NVidia Dynamo SDK (pipelines) is a big difference as well (https://github.com/ai-dynamo/dynamo/tree/main/deploy/sdk/doc...), or am I missing something?

smarterclayton · 2025-05-20T17:27:14 1747762034

That's a good example - I can at least answer about why it's a difference: different target user.

As I understand the Dynamo SDK it is about simplifying and helping someone get started with Dynamo on Kubernetes.

From the user set we work with (large inference deployers) that is not a high priority - they already have mature deployment opinions or a set of tools that would not compose well with something like the Dynamo SDK. Their comfort level with Kubernetes is moderate to high - either they use Kubernetes for high scale training and batch, or they are deploying to many different providers in order to get enough capacity and need a standard orchestration solution.

llm-d focuses on helping achieve efficiency dynamically at runtime based on changing traffic or workload on Kubernetes - some of the things the Dynamo SDK encodes are static and upfront and would conflict with that objective. Also, large deployers with serving typically have significant batch and training and they are looking to maximize capacity use without impacting their prod serving. That requires the orchestrator to know about both workloads at some level - which Dynamo SDK would make more difficult.

rdli · 2025-05-20T16:59:38 1747760378

In this analogy, Dynamo is most definitely not like Django. It includes inference aware routing, KV caching, etc. -- all the stuff you would need to run a modern SOTA inference stack.

qntty · 2025-05-20T23:52:01 1747785121

You're right, I was confusing TensorRT with Dynamo. It looks like the relationship between Dynamo and vLLM is actually the opposite of what I was thinking -- Dynamo can use vLLM as a backend rather than vice versa.

rdli · 2025-02-21T15:24:24 1740151464

The blog post was a little unclear, so my summary was:

- They used QwQ to generate training data (with some cleanup using GPT-4o-mini)

- The training data was then used to FT Qwen2.5-32B-Instruct (non-reasoning model)

- Result was that Sky-T1 performs slightly worse than QwQ but much better than Qwen2.5 on reasoning tasks

There are a few dismissive comments here but I actually think this is pretty interesting as it shows how you can FT a foundation model to do better at reasoning.

azinman2 · 2025-02-21T18:12:15 1740161535

I wish they would have compared to the r1 distills of qwen2.5

rdli · 2025-01-23T19:54:41 1737662081

I took a brief look (~5 minutes). My $0.02 is that it's not clear what problem you're trying to solve. I get what some of the features do (e.g., templated prompts) but it would be v helpful to have an example of how you actually use magentic, versus the non-magentic way. It feels like a lot of syntactic sugar, if I'm being honest (not a bad thing, but something you might want to be clear about, if that's the case.)

rdli · 2025-01-23T16:43:47 1737650627

(author here) I didn't put this in my post, but one of my favorite moments was when I read some of the LlamaIndex source code which pointed to the GitHub commit where they copied the code verbatim from LangChain. (LangChain is MIT-licensed, so it's OK, but I still thought it was funny!)

rdli · 2024-11-12T16:39:55 1731429595

Not a bad move by Red Hat. Red Hat lost the battle of the cloud to Azure, AWS, and Google, but AI is still a nascent space. vLLM's deployment model fits neatly into Red Hat's traditional on-premise / support-centric business model.

rdli · on May 16, 2024

I'm working on something like this! It's simple in concept, but there are lots of fiddly bits. A big one is performance (at least, without spending $$$$$ on GPUs.) I haven't found that much in terms of how to tune/deploy LLMs on commodity cloud hardware, which is what I'm trying this out on.

leobg · on May 16, 2024

You can use ONXX versions of embedding models. Those run faster on CPU.

Also, don’t discount plain old BM25 and fastText. For many queries, keyword or bag-of-words based search works just as well as fancy 1536 dim vectors.

You can also do things like tokenize your text using the tokenizer that GPT-4 uses (via tiktoken for instance) and then index those tokens instead of words in BM25.

rdli · on May 16, 2024

Thanks! I should have been clearer -- embeddings are pretty fast (relatively) -- it's inference that's slow (I'm at 5 tokens/second on AKS).

jnnnthnn · on May 16, 2024

Could you sidestep inference altogether? Just return the top N results by cosine similarity (or full text search) and let the user find what they need?

https://ollama.com models also works really well on most modern hardware

rdli · on May 16, 2024

I'm running ollama, but it's still slow (it's actually quite fast on my M2). My working theory is that with standard cloud VMs, memory <-> CPU bandwidth is an issue. I'm looking into vLLM.

And as to sidestepping inference, I can totally do that. But I think it's so much better to be able to ask the LLM a question, run a vector similarity search to pull relevant content, and then have the LLM summarize this all in a way that answers my question.

jnnnthnn · on May 16, 2024

Oh yeah! What I meant is having Ollama run on the user's machine. Might not work for the use case you're trying to build for though :)

pizza · on May 16, 2024

This style of embeddings could be quite lightweight/cheap/efficient https://github.com/cohere-ai/BinaryVectorDB

Tostino · on May 16, 2024

Embedding models are generally lightweight enough to run on CPU, can be done in the background while the user isn't using their device.

rdli · on May 2, 2024

This is cool! I've been trying out bits & pieces of the RAG ecosystem, too, exploring this space.

Here's a question for this crowd: Do we see domain/personalized RAG as the future of search? In other words, instead of Google, you go to your own personal LLM, which has indexed all of the content you care about (whether it's everything from HN, or an extra informative blog post, or ...)? I personally think this would be great. I would still use Google for general-purpose search, but a lot of my search needs are trying to remember that really interesting article someone posted to HN a year ago that is germane to what I'm doing now.

jnnnthnn · on May 2, 2024

I definitely think there are opportunities to provide more useful & personalized search than what Google offers for at least some queries.

Quality aside, I think the primary challenge is in figuring out the right UX for delivering that at scale. One of the really great advantages of Google is that it is right there in your URL bar, and that for many of the searches you might do, it works just fine. Figuring out when it doesn't and how to provide better result then seems like a big unsolved UX component of figuring out personalized search.