Home

Awesome

This repository contains the code and documents in pre-training, fine-tuning, and evaluating PhoneLM, a highly capable and efficient small language model family. The end-to-end demo of PhoneLM running on smartphone is available at mllm.

Model Downloads

HuggingFace
PhoneLM-1.5B
PhoneLM-1.5B-Instruct
PhoneLM-1.5B-Call
PhoneLM-0.5B
PhoneLM-0.5B-Instruct

Evaluation Results

Comprehensive Evaluation

ModelHellaSwagWinoGrandePIQASciQBoolQARC EasyARC ChallengeAverage
PhoneLM-1.5B66.963.077.388.865.569.739.967.31
Pythia-1.4B52.057.271.179.263.253.928.357.84
OPT-1.3B53.759.071.078.157.251.328.056.90
BLOOM-1.1B43.054.967.274.659.145.425.652.83
TinyLlama-1.1B59.158.973.082.358.655.731.059.80
MobileLLaMA-1.4B56.159.473.081.956.755.830.359.03
MobiLlama-1B62.259.374.882.860.356.431.761.07
OpenELM-1.1B64.861.775.683.663.655.432.362.43
DCLM-1.4B53.666.377.094.071.474.841.268.33
SmolLM-1.7B49.660.975.893.266.076.443.566.49
Qwen 1.5-1.8B60.960.574.289.466.559.134.763.61
Galactica-1.3B41.054.463.887.762.058.630.556.86
StableLM 2-1.6B68.864.175.176.980.060.339.266.34
Cerebras-GPT-1.3B38.451.966.873.059.345.825.351.50
MiniCPM-1B67.563.775.191.070.562.938.166.97
MiniCPM-2B67.263.976.192.574.669.042.769.43
Gemma-2B71.465.278.491.469.972.342.070.09
Gemma 2-2B55.068.778.796.073.680.346.971.31

Android Function Call

To enhance the model’s capability in smartphone operation, we fine-tuned the PhoneLM on the DroidCall dataset, a synthetic dataset specifically focused on Android intent invocations generated by GPT4.

Currently we use two simple metric to reflect the ability of function calling:

ModelAccuracySoft Accuracy
PhoneLM-1.5B-Instruct17.517.8
PhoneLM-1.5B-Call76.589.3
Qwen2.5-Coder-1.5B50.063.5
Qwen2.5-1.5B-Instruct58.575.3
Phi-3.5-mini-instruct62.077.7
MiniCPM3-4B70.085.7
Gemma-2-2b-it56.575.8
TinyLlama-1.1B-Chat-v1.018.018.7
Llama-3.2-1B-Instruct36.043.8
Llama-3.2-3B-Instruct47.557.9
GPT-40-mini71.086.1

Runnning PhoneLM

Huggingface

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = 'mllmTeam/PhoneLM-1.5B-Instruct'
question = "Hello, who are you?"
prompt = [{"role": "user", "content": question}]

model = AutoModelForCausalLM.from_pretrained(model_name, device_map='cuda', trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)

inp = tokenizer(input_text, return_tensors="pt")
inp = {k: v.to('cuda') for k, v in inp.items()}
out = model.generate(**inp, 
                     max_length=256,
                     do_sample=True,
                     temperature=0.7,
                     top_p=0.7
                     )
text = tokenizer.decode(out[0], skip_special_tokens=True)
print(text)

mllm

We have provided the mllm formats of PhoneLM, which can be used in mllm.

Install mllm

git clone https://github.com/UbiquitousLearning/mllm.git
cd mllm/scripts/
build.sh

Inference

cd ../bin
./demo_phonelm -m /path/to/model.mllm 

Training PhoneLM

Install Python Environment

pip install -r requirement.txt

Stable Training Stage

We use the following dataset in stable training stage.

typedatasettoken
webDCLM-baseline1.35T
codeStarCoderData112.75B
mathOpenWebMath13.25B
academicDolma-algebraic12.75B
academicDolma-arxiv29B
total1.5T

Download The Original Data

You can download the dataset from the links provided in the table above using any method.As an example, we use huggingface-cli to download DCLM-baseline. Here is an example command:

huggingface-cli download --repo-type dataset --local-dir ./dclm-baseline --local-dir-use-symlinks False --resume-download mlfoundations/dclm-baseline-1.0-parquet

You can decide how to download the dataset through the links in the table above.

Preprocess the dataset

Before pretraining, it is necessary to perform tokenization on the dataset in advance. Before tokenization, you should first know the format of the dataset and the field in the dataset used to pretrain. Take dclm-baseline as an example, the data files format is parquet. And in its Dataset Card, it can be seen that the text field of each data entry is used for pretraining. After knowing the format type, we can use the following command to tokenize the data in advance

python path/to/dataset path/to/output_dir\
  --prefix prefix_of_output_file\ 
  --handler file_format\
  --field field_used_to_pretrain\
  --num_workers  workers_to_process\
  --tokenizer_path path/to/tokenizer\
  --max_size max_tokens_of_each_output_file

For example, to tokenize dclm-baseline, use following command in PhoneLM

python pretokenize.py path/to/dclm-baseline ./train_datasets/dclm-baseline 
  --prefix dclm-baseline 
  --handler parquet 
  --field text
  --tokenizer_path tokenizer

The output will look like:

train_datasets/
└── dclm-baseline
    ├── dclm-baseline-000-00000.data
    ├── dclm-baseline-001-00000.data
    ├── dclm-baseline-002-00000.data
    ├── dclm-baseline-003-00000.data
    ...

Train

After performing the same operation on all datasets, the tokenized datasets are stored in train_datasets. Subsequently, you can start pretraining with the following command:

deepspeed train.py --config config_phonelm_1.5b.yaml

Decay Stage

In the decay stage, the data contains some dataset from stable training stage, including DCLM-baseline, StarCoderData, and Dolma. And it also contains some high-quality fine-tuning data, which is used in fine-tuning stage. Following table shows the data

TypeDatasetToken
webDCLM-baseline10B
codeStarCoderData1.575B
codeThe Stack Smol0.95B
acadamicDolma-arxiv2.325B
acadamicDolma-pes2o2.35B
math instructMathInstruct65.25M
chat instructUltraChat1.775B
chat instructOpenAssistant 242.25M
chat instructOpenHermes77.25M
code instructMagicoder Evol Instruct30.25M
code instructCommitPackFT0.35B
code instructMagicoder OSS Instruct43.5M
function callingSlimOrca209.75M
function callingAPIGen48.25M
function callingGlaive Function Calling57.5M
total20B

Unfortunately, the datasets in the table above, excluding those used for pretraining, each have their own format. To standardize the datasets in this phase, we have processed all SFT data into a chat format and formatted them as text using a unified template.

We will show you an example. First download the dataset as shown above.Then use the following command to process:

python prepare_chat.py path/to/MathInstruct chat/MathInstruct --dataset_name MathInstruct # process MathInstruct

python prepare_chat.py ../datasets/Magicoder-OSS-Instruct-75K/ chat/Magicoder --dataset_name Magicoder # process Magicoder

After processing the dataset, the chat directory will looks like

chat/
├── Magicoder
│   └── 000_Magicoder_00000.parquet
└── MathInstruct
    └── 000_MathInstruct_00000.parquet

Format of processed data is as following:

{
  "text": "pretrain data",
  "chat": [
    {"role": "...", "content": "..."},
    ...
  ]
}

Then you can tokenize the text field to get the Decay Stage pretrain data using pretokenize.py.

Train

Subsequently, you can start decay stage training with the following command:

deepspeed train.py --config config_phonelm_1.5b_stage2.yaml

Instruct Following Tuning

In this stage you need to initial dataset structure as followed:

train_datasets_instructs/
├── commitpackft
│   ├── 000_commitpackft_00000.parquet
│   └── ...
└── ...

The dataset construction is the same as in Decay Stage.

Train

Launch train command

deepspeed train_instruct.py --config config_phonelm_1.5b_instruct.yaml

If it is the first time loading train_datasets_instruct, two directories train_dataset_test and val_dataset_test will be generated in the train_datasets_instruct directory. Subsequently, data will be read directly from these two directories.

Function Call Tuning

We fine-tuned our model on the DroidCall datasets to quip the model with the capability to operate Android phones. We have provided an instance for fine-tuning on DroidCall, you can also use your own way to fine-tune.

First, download the DroidCall dataset and rename it to train_datasets_DroidCall. The dataset structure is as follows:

train_datasets_DroidCall/
└── DroidCall_code_short.jsonl

Train

We provide a simple config to run the fine-tuning on DroidCall, you can simply start the training using the following command

deepspeed train_instruct.py --config config_phonelm_1.5b_call.yaml

License

The source code of PhoneLM is under the License of GPL-2.0.

Citation

@misc{yi2024phonelmanefficientcapablesmall,
      title={PhoneLM:an Efficient and Capable Small Language Model Family through Principled Pre-training}, 
      author={Rongjie Yi and Xiang Li and Weikai Xie and Zhenyan Lu and Chenghua Wang and Ao Zhou and Shangguang Wang and Xiwen Zhang and Mengwei Xu},
      year={2024},
      eprint={2411.05046},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.05046}, 
}