Hi all,
https://colab.research.google.com/drive/1gI8CM9Bz9ov0-E6aL2jF808rE56UtZyF?usp=sharing
This post is more for gauging interest, I plan to release the entire end to end code including:
- Dataset curation (including citations).
- Model checkpoints.
- Inference code.
- Synthetic data generation.
- etc.
Parakeet is the name of a small language model I've designed from the ground up for the purpose of research.
The challenge was to see how far I could push the limits of LLM tech. given a massively constrained environment.
It was trained on a 3080 Ti and has a considerable amount more training to do but here are the results so far.
Specs:
- 18 layers / 18 heads
- 8K context.
- 1152 embedding dimension.
- cl100k tokenizer (TikToken)
- ALiBi (max I can train is 1200 tokens so this was crucial)
- KV caching for improved inference.
- Grouped Query Attention (2 layers per group / speeds up inference)
- `min_p`: Cut-off low quality tokens.
- Softmax1: https://github.com/kyegomez/AttentionIsOFFByOne - Not sure if this really made much of a difference / it's hard to train comparable models as compute resources are limited.
- Sub 400M parameters (378M from memory)
Edit:
- Things I forgot to mention: NO RLHF / DPO, it's entirely dataset driven.
- The model seems mostly harmless due to being trained only with synthetic data.
- A side-effect of only being trained on synthetic data is that the model learns quite fast.
- There's probably less than 2 weeks of actual training time in the model so far.
- You don't need to start from scratch when altering model parameters. Weights can be copied/merged in and out of smaller/larger models.
Why?
- Curiosity got to me - I wanted to know what would happen if a model with a considerably small amount of parameters was bombarded with data.
- There were many results showing these language models with room for more training but instead many are scaled up.
- I wanted to see what happens if you just keep training them.
References:
- "./datasets/wikipedia-20220301.en.jsonl"),
- "./datasets/euclaise_littletown.jsonl"), # https://huggingface.co/datasets/euclaise/LittleTown?row=99
- "./datasets/squad-v2.0-processed.jsonl"),
- "./datasets/huggingface_ultrachat200k.jsonl"), # https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
- "./datasets/wizardlm_evol_instruct_v2_196k.jsonl"), # https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
- "./datasets/crumb_clean_instruct_440k.jsonl"), # https://huggingface.co/datasets/crumb/Clean-Instruct-440k - Generate a story starting with the sentence "It was already late when they stepped out of the house".
- "./datasets/openorca_4m.jsonl"), # https://huggingface.co/datasets/Open-Orca/OpenOrca
- "./datasets/databricks_dolly15k.jsonl"), # https://huggingface.co/datasets/databricks/databricks-dolly-15k - Common-sense reasoning.
- "./datasets/teven_code_contests4m.jsonl"), # https://huggingface.co/datasets/teven/code_contests - ['PYTHON', 'PYTHON3', 'JAVA', 'CPP']
- "./datasets/squad-v2.0-summaries.jsonl"),
- "./datasets/google-boolq.jsonl"),
- "./datasets/stingning_ultrachat.jsonl"), # https://huggingface.co/datasets/stingning/ultrachat
- "./datasets/wikimovies-train.jsonl"),
- "./datasets/kunishou-databricks-dolly-15k-ja.jsonl"),
- "./datasets/wizardlm_evol_instruct_70k.jsonl"), # https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
- "./datasets/map_codefeedback.jsonl"),
Sorry for bad formatting!
...continues in reply due to 4000 char. limit.