Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: MonkeyPatch – Cheap, fast and predictable LLM functions in Python (github.com/monkeypatch)
95 points by JackHopkins on Nov 15, 2023 | hide | past | favorite | 71 comments
Hi HN, Jack here! I'm one of the creators of MonkeyPatch, an easy tool that helps you build LLM-powered functions and apps that get cheaper and faster the more you use them.

For example, if you need to classify PDFs, extract product feedback from tweets, or auto-generate synthetic data, you can spin up an LLM-powered Python function in <5 minutes to power your application. Unlike existing LLM clients, these functions generate well-typed outputs with guardrails to mitigate unexpected behavior.

After about 200-300 calls, these functions will begin to get cheaper and faster. We've seen 8-10x reduction in cost and latency in some use-cases! This happens via progressive knowledge distillation - MonkeyPatch incrementally fine-tunes smaller, cheaper models in the background, tests them against the constraints defined by the developer, and retains the smallest model that meets accuracy requirements, which typically has significantly lower costs and latency.

As an LLM researcher, I kept getting asked by startups and friends to build specific LLM features that they could embed into their applications. I realized that most developers have to either 1) use existing low-level LLM clients (GPT4/Claude), which can be unreliable, untyped, and pricey, or 2) pore through LangChain documentation for days to build something.

We built MonkeyPatch to make it easy for developers to inject LLM-powered functions into their code and create tests to ensure they behave as intended. Our goal is to help developers easily build apps and functions without worrying about reliability, cost, and latency, while following best software engineering practices.

We're only available in Python currently but actively working on a Typescript version. The repo has all the instructions you need to get up and running in a few minutes.

The world of LLMs is changing by the day and so we're not 100% sure how MonkeyPatch will evolve. For now, I'm just excited to share what we've been working on with the HN community. Would love to know what you guys think!

Open-source repo: https://github.com/monkeypatch/monkeypatch.py

Sample use-cases: https://github.com/monkeypatch/monkeypatch.py/tree/master/ex...

Benchmarks: https://github.com/monkeypatch/monkeypatch.py#scaling-and-fi...



Nice! If I were to write a test for invariant aspects of the function (eg, it produces valid json), will the system guarantee that those invariants are fulfilled? I suppose naively you could just do this by calling over and over and 'telling off' the model if it didn't get it right


The type constraints are indeed enforced but not by the tests but by the type-hints you give to the patched functions. The constraints and enforced structure are followed, there is also a repair feedback loop in place if the original LLM output is invalid for the types you've declared. Tests are more to align the model how to act for different inputs. Hope this makes it more clear!


Can I use open source LLMs with this? Would be great if everything was available self hosted with open source models.


Native support for OS LLMs is on the roadmap - the main challenge is to figure out how to manage the knowledge distillation for local models. It’s a top priority (along with typescript support), so check back in a few weeks?

Right now any ‘plug-in’ model needs to conform to the OpenAI API.

(P.S Carthago delenda est)


This is like calling a python package "ListComprehension", that loops through a list and calls OpenAI's API on each item. Confusing and unproductive.


The python package (and repo) is called ‘monkeypatch.py’ for the avoidance of confusion.


Calling my library "listcomprehension.py" doesn't really avoid confusion. In fact, `pip install monkey-patch.py` looks downright odd.


Yeah I definitely agree on the latter point, it does look odd. PyMonkeyPatch?


I don't think you grasp the point.


I understand the point. I would ideally like an association with monkey-patching something as that is relevant to the behaviour of the package. However, not so similar that it shadows the technique of monkey-patching!


LlamaPatch? Once open source model support is added of course :)


There seems to be a lot of (justified) concern about the name. Maybe call it LLMonkeyPatch?


LLMonkeyPatch is one of the best suggestions here, only adds two letters that fence nicely the scope of the monkey patching, and looks playful.

Other suggestions, like PyMonkeyPatch, leave the reader to guess what is being monkey patched.


PyMonkeyPatch? MonkeyPatch.py?

I would quite like a short and distinctive name!


Hey Jack! Thanks for sharing this. The incremental fine-tuning of smaller and cheaper models for cost reduction is definitely a really interesting differentiator. I had a few questions regarding the reliability of the LLM-powered functions MonkeyPatch facilitates and the testing process. How does MonkeyPatch ensure the reliability of LLM-powered functions it helps developers create, and do the tests employed provide sufficient confidence in maintaining consistent output? If tests fall short of 100% guarantee, how does MonkeyPatch address concerns similar to historical challenges faced with testing traditional LLMs? Thanks.


Heya, no worries - I’m glad to share it.

MonkeyPatch ensures reliability through what we call ‘test-driven alignment’, in which the tests that reference the patched functions are guaranteed to pass. The more align ‘tests’ you create, the more rigorous a contract that the functions have to fulfil.

The other way to increase consistency is using more constrained type annotations (i.e using pydantic field annotations), which is a similar concept to MarvinAI and Magentic.


Why 'Monkeypatch', when it's for Python, where that has an established and as far as I can tell (?) completely irrelevant meaning?


(As I understand it) monkey-patch means to modify code in runtime. I thought the naming was relevant because by adding ‘@monkey.patch’ to an unimplemented function, this library gives it an implementation at runtime.


I feel like you're hijacking a term at least 15 to maybe 20 years old though. It's kind of a brilliant marketing idea, but you're just confounding the vocabulary.

It's not to say this isn't a fucking great idea (because it is!). But you know, just don't piss in the community well.

Further, PostgreSQL means one and only one thing. But Monkey Patch means 2 now, apparently!

For reference: Here's the google ngram viewer link for "monkey patch" from 2000 to 2019.

https://books.google.com/ngrams/graph?content=monkey+patch&y...


The specific package name on GitHub and PyPi is ‘monkeypatch.py’ for the avoidance of doubt!


The problem isn't accidentally downloading the wrong package. How am I supposed to talk about this package? I regularly use monkeypatching (the existing meaning) to prototype out concepts. "I monkey patch dot pied this function"? Cause that doesn't seem to be the name used everywhere.

Why not name it something like Gorillapatch? Gorillas are stronger than monkeys as a slogan or whatever.

The core issue is how am I supposed to talk about regular monkeypatching and your library in the same sentence.


This is a solid point, at the end of the day creating confusion with the name wasn't the goal in any way as there already are 100s of (often overlapping) terms floating around in this scene. I appreciate the critique and we'll have a think over this. If you have any other naming ideas, would love to hear them!


Maybe keep the “Monkey” and work around that?

MonkeySeeDo —> it sees what you’re doing and does it better

CutMonkey —> it’s a monkey patch that’s cut weight to lean fighting trim

TypeMonkey —> it uses types to intelligently monkey patch your code

MonkeyZipper —> monkey patches that compresses your code

MonkeyModels —> monkey patches your models

Learning Monkey —> learns how to improve your Large Models

Branching out from there…

SimianStudy —> it’s a monkey patch that learns

PyStill —> it distills your Python functions

AutoSqueeze —> automatically squeezes your AI code into efficient implications


That's a distinction without a difference...


I guess. I think of it more like overriding an existing thing with another for a given scope/time/test, whereas this is providing an implementation for a thing which exists only as a stub.

Maybe it 'counts', I'm not meaning to be picky about the term, my point is more like even if this works by monkeypatching, I wouldn't personally use the term for the product which does something else by those means, if that makes sense? MonkeyLLM or MonkAIpatch or something, sure.

The other comment with a 'list comprehension' example puts it well I think.

(Seeing as you asked for name suggestions elsewhere, I think I prefer the 'stub' theme: Stubby, StubAI, Stubmonkey if you like monkeys/wanted an easier logo, something like that.)


I do get the point and the difference from the classical monkey-patching. I like the stub ideas though!


MonkeyPatch is a specific programming term that people have been using for decades. What would posses someone to name a programming tool "MonkeyPatch" when the tool doesn't even have something to do with patching?


Monkeypatch (as I understand it) means to modify code at runtime. This library modifies functions at runtime to use an LLM as an execution target - I thought it was an apt (if admittedly cheeky) name! Appreciate the critique regardless.


The package is called ‘monkeypatch.py’ on GitHub and PyPi.


Many people here have said something about the naming, and you keep repeating that it's "monkeypatch.py" in response as if it fixes it. Maybe take the advice and just rename it to something while it's still early. You'll have a tough-enough time convincing people to use this novel/odd/unique concept without having the name confusion and bad-will from the community stemming from you appropriating a common term.


Don't get me wrong, I do appreciate the criticism of the current naming! It does seem to create some unwanted friction of using or talking about the library, I was just trying to explain the thought process and ideate on top of it but we will have a second look regarding the name and how to make using and talking about the library as unconfusing as possible


Thoughts on something like PyMonkeyPatch? GorillaPatch?


Honestly dude, just drop the simian theme and pick a different, possibly-endangered animal instead.

Monkeys have nothing to do with your project and were a meme over a decade ago, which makes your brand-new project look dated.

Apes are associated with being heavyweight and freakishly strong and have been long-associated with racial slurs in America. You're only ever one degree removed from coming out with "chimpout.py" or "statutoryape.py" or something else that'll get you cancelled for unintended racism.

Your tool seems like it's meant to be reliable, used for work, and possibly elegant in its code. Consider the name of a work animal for their efficiency or birds for lightweight, graceful maneuverability.


Yeah this is fair. I’m not attached to a simian theme if we’re ditching specific association to monkey-patching something. Or indeed, a ‘patching’ theme for that matter.

A new name is definitely in order. I will think about it over the weekend.

Thanks for the feedback, I appreciate it.


Monkey patching isn't what your library is supposed to do, it's just a mechanic that it uses to get there. This would be like me making a scripting language and calling it "virtual machine". Then when people ask "why is a scripting language called virtual machine" I would say "it uses a virtual machine and the file is called virtualmachine.py".


Slightly tangential: is it unfair/unreasonable to judge a project by its name? It's hard not to interpret this project's name as the result of poor judgement. Is that sufficient cause to write off the project entirely? That may seem a tad dramatic but I feel that it's a fairly strong signal for how little effort I need to put into evaluating it.


Do any other names jump out at you as preferable?


Not including "pass" in a function definition in Python makes the code not compilable, and if we're using VSCode, PyCharm, etc. our IDEs will complain about this whenever the code is viewed. Is this an intentional design decision?


The IDEs shouldn't complain if the function has a docstring (which all the MP functions should have as that's the instruction that is executed) and the @patch decorator, atleast the ones we have tried it with have liked the syntax in that sense so far. But adding a "pass" is also permissible if the IDE does complain


Would love to try a typescript implementation. Any plans to do that?


Great to know! We're working on extending MonkeyPatch to typescript, the work-in-progress repository can be found here https://github.com/monkeypatch/monkey-patch.ts

We will keep you posted on when it'll be ready for trying out!


I built a similar library for Typescript: https://github.com/jumploops/magic

Please note: it requires the use of ttypescript or ts-patch, as Typescript transformers aren’t supported by default!


Cool! Thanks for sharing. What do you mean that Ts transformers aren’t supported by default? Is this like a runtime modification of types?


tl;dr - TypeScript transformers are used to modify the AST before Javascript is emitted.

The magic functions library uses a transformer to take the TypeScript types and port them to JSON schema, such that they're available during runtime. This JSON schema is then used to validate that the response from the LLM matches the expected type signature of the function (and err if it doesn't).

Because TypeScript doesn't support 3rd party "transformers" by default, you're forced to hack around it (via ttypescript or ts-patch). This is especially problematic when TypeScript has a major version change, as the workarounds need to be modified accordingly; this often takes significant time.

Here's the long-lived Github issue: https://github.com/microsoft/TypeScript/issues/14419

And here's the newest proposal to add official support: https://github.com/microsoft/TypeScript/issues/54276


Tests to align your model seems neat. How reliable is it? Won’t models still hallucinate time to time? How do you think about performance monitoring/management?


Great questions! The tests act as few-shot examples for the LLMs, which has been shown to guide the style and accuracy of model outputs and improve performance quite well. For instance we’ve seen accuracy go from <70% to 93%+ vs without including the tests. The hallucinations are still an inherent risk with LLMs, especially with long-form context, but adding more diverse and well aligned examples as tests does reduce the hallucination risk and align the outputs with user intent. In terms of performance management and monitoring, QA for LLMs is a difficult process to get right and we’re looking into ways how to a) make it easy for users to test out different function descriptions and tests on their own datasets to gauge performance and b) introduce ways how to seamlessly carry out continuous monitoring of function outputs with low effort. Still WIP but will keep you posted!


Makes sense. Looking forward to testing it out.


This is really interesting! What would be a good example of when I would want to use monkeypatch vs langchain or OpenAI functions?


Thanks! A big part of MonkeyPatch, which Langchain or OpenAI are lacking, is the model distillation aspect, which can reduce costs up to 10x and latency up to 6x in some of the tests we've been running. This means the more you use MonkeyPatch the cheaper the function calls get, which is beneficial for high usage applications with lots of calls


How does that work?


Currently we distill the general GPT-4 down to function specific GPT3.5 turbo model using pseudo-labelling. The input-output pairs from the aligned few-shot GPT-4 are saved and this dataset is used to finetune a function-specific GPT3.5 model. Then that finetuned GPT3.5 is switched as the primary model used to carry out the function, which results in multiple times lower costs as the need for few-shot examples is removed and lower latency as well. If the finetuned model output does not follow the enforced constraints, we employ GPT-4 to "repair" the output and include that datapoint in the dataset used for future finetuning resulting in continuous improvements.


How much control do I have over this process? I might not want this to be abstracted.


Currently the distillation happens automatically in the background for all functions but we're aiming to implement ways for the user to be able to turn it off if they wish to keep using the teacher models. Good to know that this'd be a wanted feature!


Does it ever just use the code that works and no longer makes calls to any LLM?


Great question! That is one of the ideas that we have on the roadmap and seems quite exciting to us. The general feasibility of switching the function execution over from a LLM to synthesised code depends on the specific use-case and if a deterministic program can solve the use-case well enough (or atleast as well as the SOTA LLMs can). But for all those cases where this could be done, the cost and latency of executing the program would become essentially 0


The guardrails are cool!

I think more details of where the data goes and when it goes from few-shot to fine-tune will be helpful.


Good to know, we'll make it more clear in the docs! To answer regarding these 2 areas,

1) The data for finetuning currently is saved on disk for low latency reading and writing. Both test statements and datapoints from the function execution are saved to the dataset. We also are aware that saving to disk is not the best option and limits many use-cases so we're currently working on creating persistence layers to allow communication with S3 / Redis / Cloudflare as the external data storage.

2) Currently starting the fine-tuning job happens after the dataset has at least 200 datapoints from GPT-4 executions and align statements. Once the finetuning is completed, the execution model for the function is automatically switched to the finetuned GPT 3.5 turbo model. Whenever the finetuned model breaks the constraints, the teacher (GPT4) is called upon to fix the datapoint and this datapoint will be saved back to the dataset for future iterative finetuning and improvements. We are also working on adding in ways for the user to include a "test-set" which could be used to evaluate if the finetuned model achieves the required performance before switching it as the primary executor of the function

Hope this makes it more clear, if you have any additional questions, let me know!


dope yea that's awesome!


tried a shot , quite impressed. I am implementing the bedrock interface (OpenAPI is limited access from my location). Look promised. Will check it out the fine-tuning with bedrock. But not sure we can do that or not. Appreciate your work


Could you explain the differences to Marvin AI? I see a large overlap.


Hey! There are 2 main similarities to Marvin, namely: (a) functions that act as APIs to the LLM backend, and (b) type coercion to ensure that the responses fit into the data model of your application.

However, there are a couple of big additions to Marvin as well, namely: Test Driven Alignment - by using ‘assert’ statements that declare the behavior of a patched function, we create a contract that makes invocations much more predictable, which makes it possible to use these functions in production settings.Automatic distillation - a combination of the function contract defined in function type signature and the alignment tests means we can automatically swap out bigger models for smaller ones. This saves up to 80% of the latency, and 90% of the cost of running these functions (check the benchmarks) Check out the readme, as there is more detail on these points there!


Awesome stuff! What other potential integrations are on the roadmap?


The big one is a Typescript implementation. Other than that, the plan is to support other models (e.g Llama) that can be fine-tuned.

Finally, other persistence layers like S3 and Redis, to support running on execution targets (like AWS Lambda and CloudFlare workers) that don’t have persistent storage.

I think it could be really interesting to support Vercel more tightly too. We currently support Vercel with Python, but I think Typescript + Redis would really enable serverless AI functions - which is where I think this project should go!


Where in the codebase are you performing the distillation process?


Check out the ‘function_modeler’. Currently it’s OpenAI only, but local models are on the immediate roadmap.

https://github.com/monkeypatch/monkeypatch.py/blob/master/sr...


This is incredibly cool, I’m excited to try it out


Thanks a lot. I’d really appreciate any feedback you have on the design!


this is super cool! what's the use case you're most excited about?


Thanks! I find the enforced typed outputs and structured object creation from unstructured inputs very useful, for instance we created a use-case around creating structured support-ticket objects that could be processed in downstream applications without worries of anything breaking or bugs


Super cool Jack


Cheers!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: