Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
TinyLlama: An Open-Source Small Language Model (arxiv.org)
143 points by matt1 on Jan 5, 2024 | hide | past | favorite | 44 comments



It was fun to follow the public TinyLlama loss curves in near real-time, although it showed that it can be frustrating since the loss curves barely moved down even after an extra trillion tokens: https://wandb.ai/lance777/lightning_logs/reports/metric-trai... (note the log-scaled X-axis)

But they did move down and that's what's important.

There should probably be more aggressive learning rate annealing for models trying to be Chinchilla-optimal instead of just cosine-with-warmup like every other model nowadays.


More aggressive learning rate? Wouldn’t we need more information on why they decided to stop training at this point to conclude that?

My current understanding of the story is, to recap:

- First the game was increase model size massively

- For example GPT3 had 175B parameters, but less than 0.5T tokens of training data

- Then Chinchilla showed for a given compute budget we can scale better by increasing training data

- Now we have models like this, and Phi, that have over 1T trained tokens

For any model, the loss curve going down could mean it’s learning, or could mean it’s overfitting, we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.

So getting back to your comment, I thought they were actually a multitude of indicators that should be used, not just validation loss, to determine what would be gained with more training.


Overfitting is quite unlikely with a smaller model though. Model parsimony provides a kind of regularization "for free", in fact with the extra benefit of saving on compute costs.


The dirty secret behind modern selfsupervised training is that no one cares about a test/validation dataset anymore.


does overfitting even matter if your dataset is large enough?


I think a lot of it depends on what you mean by “large enough”.

In principle, a data set could be infinitely large in size, but not cover little edge cases here and there due to repetition. So you might be OK if you had infinite size and infinite diversity.

Even if you had very large finite data, let’s say all language ever conceived by mankind… The second you finish training, what your overfit model knows is locked in.

The world as we know it would continue to generate vast amounts of new data that you might not be able to generalize to.


> For any model, the loss curve going down could mean it’s learning, or could mean it’s overfitting, we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.

You want to look at validation accuracy.


Accuracy is a bad metric for LLMs, especially since a LLM tokenizer can have thousands of "classes": 32,000 in the case of TinyLlama.


I guess it comes down to whether your usecase has a single correct answer vs multiple possible ones. For example a lot of what we do has one and only one correct sequence of tokens. Need to look at both but so much of the learning material out there just focuses on loss. YMMV.


That is already accounted for with categorical cross-entropy loss.


> Wouldn’t we need more information on why they decided to stop training at this point to conclude that?

The experiment was fixed at 3 epochs on 1T tokens, they didn't decide to "stop" at a given criterion.

> we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.

The data I linked shows the validation loss, which has the same behavior as the training loss.


I'd love to see someone go for another few epochs in the future. Two of the benchmarks got a significant jump almost at the end of training. I wonder if there's a chance for more of that - looks like an interesting effect on its own.


The jump was due to them fixing a bug. There’s a footnote about it on the bottom of page 5.

In the Discord, they mentioned a TinyLLaMa v2, presumably that would have this bug (and another bug, footnote page 4) fixed.


How crucial is it to freeze the learning rate schedule a priori, instead of tweaking it on the fly?


Constant learning rates were the default in older ML implementations, but linear decay became an obvious optimization, and now we have both warmup and cosine decay to handle common training patterns, especially with the AdamW optimizer.

If the learning rate is too high at a given point in training, it can result in either a) the model stopping learning or b) exploding gradients, which is very bad.


Adaptive learning rate is a thing. For example, one scheme I've used before is to decrease the learning rate if the validation loss stops decreasing.

It's not clear to me if this is applicable to LLMs though.


From the GitHub repo Readme:

> we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs

I knew the computational power required to train LLMs was absurd, but seeing the figures of larger networks (which are just too large to intuitively understand) it didn't really register. With this one I could actually imagine the 16 machines with A100 GPUs sitting on a server room running at full blast for 90 days so it was more tangible... And now to think about the larger ones is kinda scary

Edit: Did the math and just the GPUs (at 250W each) consumed around 8.64 MWh, which is at the same ballpark of the power consumption of the average US home in one year (10.5MWh)


So, four A100-years. Unit cost $8,000 (from a quick search) and electricity cost under $2,000. If you reckon the useful life of an A100 to be four years then that’s a training cost approaching $10,000. I have no idea of the forecast useful life of the GPU, but I’d hope it’d be a lot longer; if it was about ten years, then this training cost would be around $5,000.

Of course, we’re probably both simplifying thing too much, but if these numbers are good enough it’s an interesting perspective.

At these sorts of costs and a final size of 2.2GB, each MB cost a few dollars to produce.


I've been using one of the earlier checkpoints for benchmarking a Llama implementation. Completely anecdotally I feel at least as good or better about this one than the earlier openllama 3B. I wouldn't use either of them for RAG or anything requiring more power, just to say that it's competitive as a smaller model, whatever you use those for, and easy to run on CPU at FP16 (meaning without serious quantization).


Also, I should promote the code I wrote for running this. It runs models in ggml format, the one I made available is an older checkpoint though. It's easy to convert the newer one. And it's in Fortran but it should be easy to get gfortran if you don't have it installed.

https://github.com/rbitr/llm.f90/tree/optimize16/purefortran


Here is another inference implementation in Python (only dependency is PyTorch).

https://github.com/99991/SimpleTinyLlama

The new checkpoints did not seem much better and they changed the chat format for some reason, so I did not port the new checkpoints yet. Perhaps I'll get to it this weekend.


Man I didn't recognize your username but once you said Fortran I recognized you immediately. You are an inspiration, an example of true software engineering vis a vis what in day to day becomes what you can hire for.

Edit: you have some rare knowledge, I'm curious if you have any thoughts on small models good enough for RAG. Mistral 7B is in my testing buts it's laughably slow and 7B is just too much for mobile, both iOS and Android get crashy. (4 tkns/s on Pixel Fold, similar on iOS). Similar problems on web from a good-enough 2 year old i7.

I'd try Phi-2 but I want to charge for my app and the non-commercial usage license bars that. (all these hours building ain't free! And I can't responsibly give search away, scraping locally is too risky for the user, and the free search API I know of has laudable goals, but ultimately, is "trust me bro" as far as privacy goes)

I'm starting to think we might not get an open, RAG capable model sub 7B without a concerted open source effort. Stabilitys distracted and spread thin, MS is all in on AI PCs(tm), and it's too commercially valuable for the big boys to give away


So much changed in a day. What a field!


What use cases would you say it is good enough for?


That's the billion dollar question. These are all research models, the point was to see what happens when you keep training a smaller model.

My best guess (and if I had a concrete answer I'd be out building it) is that, absent a breakthrough, smaller models will be mostly for downstream tasks, like classifiers, that aren't generative. Or fine tuned for specialized generative models that only know one domain. I don't know how well this works for real use cases, but certainly way smaller models generate Shakespeare-like text for example, I don't actually know why you'd do that though.


>I wouldn't use either of them for RAG

What's RAG?


Since the models have a limited context size, you pre-process a bunch of data that might be related to the task (documentation, say) and generate a semantic vector for each piece. The when you ask a question, look up just the few pieces that are semantically most simlar and load them into the context along with the question. Then the LLM can generate a new answer with the most relevant pieces of data.


Retrieval augmented generative, basically giving it some text passage and asking questions about the text.


If you want more on RAG with a concrete example: https://neuml.hashnode.dev/build-rag-pipelines-with-txtai


What is good for RAG?


The smallest model your users agree meets their needs. It really depends.

The retrieval part is way more important.

I've used the original 13B instruction tuned llama2, quantized, and found it gives coherent answers about the context provided, ie the bottleneck was mostly getting good context.

When I played with long context models (like 16k tokens, and this was a few months ago, maybe they improved) they sucked.


>The retrieval part is way more important.

I don't agree with this - at Intercom we've put a lot of work into our Fin chatbot, which uses a RAG architecture, and we're still using GPT-4 for the generation part.

GPT-4 is a really powerful and expensive model but we find we need this power to 1) reduce hallucinations acceptably, and 2) keep the quality of inferences made using the retrieved text high.

Now, our bot is answering customer support questions unsupervised - maybe it'd be different for a human in the loop system - but at least in our case, we feel we need a very powerful generation model to reduce errors, even after having benchmarked this thoroughly.

We've also done work on the retrieval end of things, including a customised model, but found the generation side is where we need the most capable models.


That's interesting, thanks. My experience is with technical documentation Q&A, returning summaries and relevant passages. My takeaway was that the summary is basically as good as the passages. I do think overall response quality is very subjective and really depends on how it's being used, so whatever users do best with wins the day.



GitHub repo with links to the checkpoints: https://github.com/jzhang38/TinyLlama


Needs an onnx folder to use it with transformer.js out of the box.

Hopefully @xenova will make a copy with it soon.


Proud to see this work built using Lit-GPT coming through.


What would you use this for?


how does it compare it to phi-1?


OP here with a shameless plug: for anyone interested, I'm working on a site called Emergent Mind that surfaces trending AI/ML papers. This TinyLlama paper/repo is trending #1 right now and likely will be for a while due to how much attention it's getting across social media: https://www.emergentmind.com/papers/2401.02385. Emergent Mind also looks for and links to relevant discussions/resources on Reddit, X, HackerNews, GitHub, and YouTube for every new arXiv AI/ML paper. Feedback welcome!


I visit your site every day. Thank you for creating it and evolving it past simple summaries to show paper details!

I recall you were looking to sell it at some point. Was wondering what that process looked like, and why you ended up holding on to the site.


Hey, thanks for the kind words.

To answer your question: an earlier version of the site focused on surfacing AI news, but that space is super competitive and I don't think Emergent Mind did a better job than the other resources out there. I tried selling it instead of just shutting it down, but ultimately decided to keep it. I recently decided to pivot to covering arXiv papers, which is a much better fit than AI news. I think there's an opportunity with it to not only help surface trending papers, but help educate people about them too using AI (the GPT-4 summaries are just a start). A lot of the future work will be focused in that direction, but I'd also love any feedback folks have on what I could add to make it more useful.


Thank you for the detailed response!

Pivoting into arXiv is a good idea. It helps you have focused prompts and templates.

A natural progression is aggregation, categorization, and related paper suggestions. Since arXiv has HTML versions of papers now, you can also consider allowing deeplinked citations directly from the LLM summaries.

A GPT-curated comments section for papers would also be nice, automatically filtering out any spam that gets past the regular Disqus filters, then scoring/hiding comments based on usefulness or insight.


I am new to this space. Is it hard to fine tune this model?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: