Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Parakeet: A Tiny LLM
31 points by razodactyl on March 18, 2024 | hide | past | favorite | 15 comments
Hi all,

https://colab.research.google.com/drive/1gI8CM9Bz9ov0-E6aL2jF808rE56UtZyF?usp=sharing

This post is more for gauging interest, I plan to release the entire end to end code including:

  - Dataset curation (including citations).
  - Model checkpoints.
  - Inference code.
  - Synthetic data generation.
  - etc.
Parakeet is the name of a small language model I've designed from the ground up for the purpose of research.

The challenge was to see how far I could push the limits of LLM tech. given a massively constrained environment.

It was trained on a 3080 Ti and has a considerable amount more training to do but here are the results so far.

Specs:

  - 18 layers / 18 heads
  - 8K context.
  - 1152 embedding dimension.
  - cl100k tokenizer (TikToken)
  - ALiBi (max I can train is 1200 tokens so this was crucial)
  - KV caching for improved inference.
  - Grouped Query Attention (2 layers per group / speeds up inference)
  - `min_p`: Cut-off low quality tokens.
  - Softmax1: https://github.com/kyegomez/AttentionIsOFFByOne - Not sure if this really made much of a difference / it's hard to train comparable models as compute resources are limited.
  - Sub 400M parameters (378M from memory)
Edit:

  - Things I forgot to mention: NO RLHF / DPO, it's entirely dataset driven.
  - The model seems mostly harmless due to being trained only with synthetic data.
  - A side-effect of only being trained on synthetic data is that the model learns quite fast.
  - There's probably less than 2 weeks of actual training time in the model so far.
  - You don't need to start from scratch when altering model parameters. Weights can be copied/merged in and out of smaller/larger models.
Why?

  - Curiosity got to me - I wanted to know what would happen if a model with a considerably small amount of parameters was bombarded with data.
  - There were many results showing these language models with room for more training but instead many are scaled up.
  - I wanted to see what happens if you just keep training them.
References:

  - "./datasets/wikipedia-20220301.en.jsonl"),
  - "./datasets/euclaise_littletown.jsonl"),            # https://huggingface.co/datasets/euclaise/LittleTown?row=99
  - "./datasets/squad-v2.0-processed.jsonl"),
  - "./datasets/huggingface_ultrachat200k.jsonl"),      # https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
  - "./datasets/wizardlm_evol_instruct_v2_196k.jsonl"), # https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
  - "./datasets/crumb_clean_instruct_440k.jsonl"),      # https://huggingface.co/datasets/crumb/Clean-Instruct-440k - Generate a story starting with the sentence "It was already late when they stepped out of the house".
  - "./datasets/openorca_4m.jsonl"),                    # https://huggingface.co/datasets/Open-Orca/OpenOrca
  - "./datasets/databricks_dolly15k.jsonl"),            # https://huggingface.co/datasets/databricks/databricks-dolly-15k - Common-sense reasoning.
  - "./datasets/teven_code_contests4m.jsonl"),          # https://huggingface.co/datasets/teven/code_contests - ['PYTHON', 'PYTHON3', 'JAVA', 'CPP']
  - "./datasets/squad-v2.0-summaries.jsonl"),
  - "./datasets/google-boolq.jsonl"),
  - "./datasets/stingning_ultrachat.jsonl"),            # https://huggingface.co/datasets/stingning/ultrachat
  - "./datasets/wikimovies-train.jsonl"),
  - "./datasets/kunishou-databricks-dolly-15k-ja.jsonl"),
  - "./datasets/wizardlm_evol_instruct_70k.jsonl"),     # https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
  - "./datasets/map_codefeedback.jsonl"),
Sorry for bad formatting!

...continues in reply due to 4000 char. limit.




This is awesome! Since some years ago we don't see NLP things that we can run in normal computers we have at home (no, a 24GB GPU still isn't normal). I have some questions: How much time do you take for training? Which are the specs of your system? In total, how much is the training dataset?


RE: How much is the training dataset.

It's probably over 50GB so it's very unlikely I will be able to overfit the model, the best part is this means the model is always learning new things and showing new abilities.

I noticed that straight after giving it ONLY a coding dataset it became a bit better at logical puzzles so I think there's a side-effect of training an LLM on sequential information like code.

The LLM seems more inclined to learn structure for example: Structure of language Structure of poems Structure of music lyrics etc etc.


3080Ti (12GB VRAM), DDR4 64GB AMD Ryzen 7 8TB drive-space (I built it with the idea that I would be hoarding a lot of data for potential AI research)

---

The model has probably spent ~2 weeks straight of training to get it to this level (it learns VERY fast I suspect from GQA+ALiBi + the type of training I've given it)

It's why I think there's a considerable amount more training it can do.


You can also use Project Gutenberg if you want a huge, legal dataset.

I’ve been collecting papers on small models, training with small data, using small ones to jump start big ones, alternatives to back propagation, etc. I wanted to do a sub-500M with those techniques eventually which could be reused in other projects. I may or may not get around to it.

Email me and I’ll send you some of those links for your research.


I'd love it if the only thing this LLM ever replied with was "who's a pretty boy".


Very good. Here's a cracker.


Human: From now on, you only reply with 'Who's a pretty boy' or 'Parakeet want a cracker'. Do you understand?

Generating: 10%| | 1/10 [00:16<02:24, 16.08s/it] Assistant: From what I can tell, I do not have a favorite thing or personal experience. However, I can understand that you might have a pretty boy who wants a cracker to be like a cracker.


Awesome!

How are you doing the synthetic data generation?

What is the best open source small LLM?


https://pastebin.com/Rz9k1xMt https://pastebin.com/6bxRYx31

Here are some examples - it's been a bit of a side-obsession for the last few months experimenting with different architectures.

There are so many different ways this could have gone and I've finally zeroed in one an approach that works which is why I'm starting to open up.

There will definitely be a dump in the near future of absolutely everything related to this project.

The ultimate goal was to run it on CPU.


This is amazing! Thank you!


Learnings:

There's A LOT of data on the internet, good luck getting through all of it. These models show interesting recency bias.

  - The model seems to be aware of its training and how it has changed over time?
  - Due to model size it seems to easily forget information gathered during training so every sample provided is picked in a cycle from the datasets above.
  - The `InterleavedDataset` when `shuffle=True` messes with the above constraint so needs to be fixed.
  - JSONL became necessary due to datasets taking many GBs on disk, currently it's using a primitive method of simply streaming lines from disk during training (offsets are calculated when the datasets are loaded)
  - It displays interesting abilities to reason even at this size.
  - It performs best at summarisation when the information is provided.
  - It hallucinates considerably.
  - Story-telling is very "we went here, then we went there" but there's no "depth" - it's just a sequence of events.
Future work:

I'm going to continue hacking away at this project as I want to see where it leads but I'm very keen to get it out into the wild when time permits. (it's exhausting trying to keep up with this field)

Some examples below:

>>>

Human: Can you help me with a puzzle?

I bought sushi for lunch but I don't remember where I left it.

Here are the steps I took before it went missing:

    I came home, tossed my keys on the table and brought my lunch to the kitchen.

    I took my lunch out of its bag and threw the bag in the bin.

    I put my lunch on a plate then went to my bedroom.

    My phone rang so I went back to the kitchen.
---

Where did I put my sushi? Please think step-by-step.<|endofprompt|>

Assistant: ->Based on the steps I took before it went missing, I came home and placed my keys on the table. I then placed my lunch on a plate then went to my bedroom.

I put my lunch on a plate then went to my bedroom.

Took: 0.7614264488220215s

>>>

Human: Read this sentence: "The Parakeet hopped out of its cage and landed on the table."

Pretend you're the Parakeet. What do you see around the room after you hop out of your cage? Please be descriptive.<|endofprompt|>

Assistant: ->As I look out the window, I see a small room with a large table surrounded by chairs. The walls are lined with various items that make for an immersive experience. The furniture is made of high-quality wood and has a sleek design that adds to the overall ambiance. The table is adorned with colorful paintings and sculptures, giving the space a cozy and inviting feel.

The furniture is made of high-quality wood, which adds to its charm and sophistication. The chairs are made from high-quality materials that add to the overall elegance and sophistication of the space.

The tables are arranged in a way that creates an elegant ambiance. The chairs are arranged in a way that allows for easy movement around the room, making it easy to find what you're looking for.

Overall, I am sure you'll find something that will leave you feeling refreshed and rejuvenated after your workout.

Took: 5.08751106262207s

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

    Human: Please create a website about a new Robot Cafe. Show the entire HTML script.<|endofprompt|>
    
    Assistant: -><!DOCTYPE html>
    <html>
    <head>
        <title> Robot Cafe</title>
        <style>
            body {
                font-family: Arial, sans-serif;
                font-size: 16px;
                line-height: 1.5;
            }
        </style>
    </head>
    <body>
    
        <div class="container">
            <h1> Robot Cafe</h1>
            <p>The Robot Cafe is a new Robot Cafe that offers a variety of products, from electronics to medical devices. The products range from basic medical devices to more advanced medical devices.</p>
    
            <h2>The Robot Cafe is a new Robot Cafe that offers a variety of products, from medical devices to advanced medical devices.</p>
    
            <h2>The Robot Cafe is a new Robot Cafe that offers a variety of products, from medical devices to advanced medical devices.</p>
    
            <h2>The Robot Cafe is a new Robot Cafe that specializes in using natural and sustainable materials. The products range from basic medical devices to more advanced medical devices.</p>
    
            </h2>
    
            <h2>The Robot Cafe is another new robot cafe that specializes in using eco-friendly materials and has a wide range of customizable options. The products range from basic medical devices to more advanced medical devices.</p>
    
            <h3>The Robot Cafe is another new robot cafe that specializes in using eco-friendly materials and has an extensive range of customizable options. The products range from basic medical devices to more advanced medical devices.</p>
    
            </h2>
    
        </div>
    </body>
    
    </html>
    ```
    
Took: 7.056133508682251s


How many tokens have you trained on?


Absolutely no idea sorry!! - this was more for understanding how an LLM even works.

I just happen to understand the ML pipeline so I went beyond Andrej Karpathy's tutorial.

It's an extract of TinyGPT but built from the ground up to work with new features such as Multi/Grouped-Query attention, `min_p` sampling, KV-caching etc.

---

Unfortunately as this is a 1-GPU personal situation everything had to be thought out from the constraints of never being able to finish training the model.

---

What I can tell you though is after ~40 hours of training, the model starts showing ability to speak and perform tasks.

~80 hours it improves upon these tasks and starts showing a tiny (very minute) bit of understanding where it fits into the picture.

---

GPT-2's largest was 12 layers I believe.

Layers seem to correlate with ability of the model to "compute". The embedding dimension of the model seems to correlate with the ability of the model to "communicate".

If you "communicate" a larger amount of information through the layers (compute) you get a much smarter AI.

---

Additionally, there's an interesting effect of GQA/MQA where the model is forced to share its work between the layers. I think about it like a group of students, either working on their own or in groups: - Too many and it becomes chaos. - Too few and learning takes longer as there's no collaboration between individuals. - In my case 2 "students" per N layers means the network learns quickly and infers fast.

Hopefully it makes sense, it's been a non-stop torrent of learning for myself to answer the more obvious questions that always seem to be avoided in general.

Everyone is more interested (obviously) in competing and scaling up their model that the smaller models are neglected.

From what I can tell, something like X.AI's Grok can be thought of as extremely high capacity without being fully saturated (it has a lot of space to learn more).


Thanks. Have estimated this as 3B tokens as a round number at 8:1, but if you have a firmer number I'd love to know it.

Added to the Models Table (a few 'small' and 'tiny' models there, too):

https://lifearchitect.ai/models


Oh wow. Thanks! Loved watching your videos on YouTube. This project had a million bug fixes and changes to the code over time so there have been hundreds of models that have been trained / deleted. My methods aren't exactly scientific unfortunately.

I'll do a write up one day but for anyone else interested I have uploaded some information here https://ftp.bytebreeze.dev/ftpuser/

I need to do a serious cleanup to the code and host it somewhere more stable but the raw version of what I have so far can be found there.

If anyone gets it running, keen to hear feedback!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: