Nice post! Can you share a bit more about what variety of tasks you've used agents for? Agents can mean so many different things depending on who you're talking to. A lot of the examples seem like read-only/analysis tasks. Did you also work on tasks where agent took actions and changed state? If yes, did you find any differences in the patterns that worked for those agents?
Sure! So there are both read-only and write-only agents that I'm working on. Basically there's a main agent (main LLM) that is responsible for the overall flow (currently testing GPT-5 Mini for this) and then there are the sub-agents, like I mentioned, that are defined as tools.
Hopefully this isn't against the terms here, but I posted a screenshot here of how I'm trying to build this into the changelog editor to allow users to basically go:
1. What tickets did we recently close?
2. Nice, write a changelog entry for that.
3. Add me as author, tags, and title.
4. Schedule this changelog for monday morning.
Of course, this sounds very trivial on the surface, but it starts to get more complex when you think about how to do find and replace in the text, how to fetch tickets and analyze them, how to write the changelog entry, etc.
They have to be competitive. TPUs are wildly ahead of the pack. And even they aren't particularly competitive. 12 years of ecosystem development by the most advanced AI ecosystem company on the planet and your (ex-Google!) researchers are still going to pelt you with tomatoes if you tell them you are swapping out their H100 cluster with TPUs. JAX remains niche (not saying bad) and extremely hard to use efficiently without the help of Google (no CUDA for going off the beaten path).
I suspect the closed nature of the ecosystem will preclude them from winning as much as they could.
It's the de facto tool for our industry. For the vast majority of cases, users bear the burden of that complexity without gaining much benefit. And (at least for me) it doesn't guarantee the one thing I need it to do - make sure I can never lose progress.
This is just blatantly false. The 10% number is ridiculous which anyone involved with foreign aid knows. But you can easily tell that the countries want the money from the cases where the US threatens to take away aid over some disagreement and then the foreign countries capitulates. You know these are sovereign nations that can say no to the aid if they don't want it right? You don't just show up without a visa and hand out money without the approval of the foreign government.
You can ignore it, the commenter clearly has no idea what they are talking about. PTX is literally the instruction set that Cuda, Vulcan and OpenGL compile to on Nvidia cards in the end. It's assembly for GPUs. And it's infinitely harder to work with. Go to an average technical university and you'll probably find quite a few people who can write Cuda (or OpenGL or Vulcan for that matter). But it would be very surprising if you can find even a single person that can comfortably write PTX.
"Compile to" isn't exactly the correct phrase either.
PTX is not the IL used by Nvidia's drivers, but does compile directly to it with less slop involved. If you had said "PTX's instructions are analogous to writing assembly for CPUs or any other GPUs (ala Clang's AMDGPU target)", that would have probably been the better way.
Arguably, PTX is closer to being the SPIR-V part of their stack (more than just an assembler compiler, but similar in concept). None of Nvidia's tools really ever line up with good analogies with the outside world, the curse of Nvidia's NIH syndrome.
Generally, you're not going to be writing all of your code in PTX, but I find it wild you think people going to "an average technical university" would be unable to use it for the parts they need it for. That says more about you than it does them.
All of Nvidia's docs for this are online, it isn't that hard. Have you tried?
>PTX's instructions are analogous to writing assembly for CPUs
How else would you have understood it? At this level it's literally just pedantics. In the same way you can say C doesn't technically compile to assembly for CPUs. The point is that it's the lower abstraction level that is still (more or less) human readable. But just like in CUDA, you may want to write parts of your code in it if you want to benefit from things that the higher level language doesn't expose. The terminology might seem different, but in practice it is pretty analogous.
This is somewhat untrue as well. HFT because constrained similarly have to optimize on this level akin to HFT crypto doing optimizations not within solidity, nor yul but on opcode in huff. That’s the issue with these big tech companies. Just endless budget and throw bad code into larger distributed clusters to overcompensate.
Nice work! There is a gap when it comes to writing single-machine, concurrent CPU-bound python code. Ray is too big, pykka is threads only, builtins are poorly abstracted. The syntax is also very nice!
But I'm not sure I can use this even though I have a specific use-case that feels like it would work well (high-performance pure Python downloading from cloud object storage). The examples are a bit too simple and I don't understand how I can do more complicated things.
I chunk up my work, run it in parallel and then I need to do a fan-in step to reduce my chunks - how do you do that in Pyper?
Can the processes have state? Pure functions are nice, but if I'm reaching for multiprocess, I need performance and if I need performance, I'll often want a cache of some sort (I don't want to pickle and re-instantiate a cloud client every time I download some bytes for instance).
How do exceptions work? Observability? Logs/prints?
Then there's stuff that is probably asking too much from this project, but I get it if I write my own python pipeline so it matters to me - rate limiting WIP, cancellation, progress bars.
But if some of these problems are/were solved and it offers an easy way to use multiprocessing in python, I would probably use it!
Great feedback, thank you. We'll certainly be working on adding more examples to illustrate more complex use cases.
One thing I'd mention is that we don't really imagine Pyper as a whole observability and orchestration platform. It's really a package for writing Python functions and executing them concurrently, in a flexible pattern that can be integrated with other tools.
For example, I'm personally a fan of Prefect as an observability platform-- you could define pipelines in Pyper then wrap it in a Prefect flow for orchestration logic.
Exception handling and logging can also be handled by orchestration tools (or in the business logic if appropriate, literally using try... except...)
For a simple progress bar, tqdm is probably the first thing to try. As it wraps anything iterable, applying it to a pipeline might look like:
import time
from pyper import task
from tqdm import tqdm
@task(branch=True)
def func(limit: int):
for i in range(limit):
time.sleep(0.1)
yield i
def main():
for _ in tqdm(func(limit=20), total=20):
pass
if __name__ == "__main__":
main()
I haven't played with that much! This isn't really a problem in general for my approach to writing this sort of code - when I use multiprocessing, I use a Process class or a worker task function with a setup step followed by a while loop that pulls from a work/control queue. But in the Pyper functional programming world, it would be a concern.
IIRC multiprocessing.shared_memory is a much more low-level of abstraction than most python stuff, so I think I'd need to figure out how to make the client use the shared memory and I'm not sure if I could.
GNU Parallel is really neat, software that's so good it's boring. Closing in on being a quarter century old by now, no? I remember first reading about it in 2003 maybe?
I've also used 'fork in Picolisp a lot for this kind of thing, and also Elixir, which arguably has much nicer pipes.
But hey, it's good that Python after like thirty years or so is trying to get decent concurrency. Eventually people that use it as a first language might learn about such things too.
I'm sorry. It's the trauma of preaching the virtues of crude but efficient concurrency and similar multiprocessing for many years, in many workplaces, and only rarely meeting anything but distrust, disinterest or uninformed rebuttals like "OS processes are very heavy, can't do concurrency that way".
However, it's a real problem that 'beginner languages' like Python and Javascript doesn't readily do multithread computation, something which has been the default on personal computers for quite a while now and available for at least twenty years.