Awesome

This repository contains the code and documents in pre-training, fine-tuning, and evaluating PhoneLM, a highly capable and efficient small language model family. The end-to-end demo of PhoneLM running on smartphone is available at mllm.

Model Downloads

HuggingFace
PhoneLM-1.5B
PhoneLM-1.5B-Instruct
PhoneLM-1.5B-Call
PhoneLM-0.5B
PhoneLM-0.5B-Instruct

Evaluation Results

Comprehensive Evaluation

Model	HellaSwag	WinoGrande	PIQA	SciQ	BoolQ	ARC Easy	ARC Challenge	Average
PhoneLM-1.5B	66.9	63.0	77.3	88.8	65.5	69.7	39.9	67.31
Pythia-1.4B	52.0	57.2	71.1	79.2	63.2	53.9	28.3	57.84
OPT-1.3B	53.7	59.0	71.0	78.1	57.2	51.3	28.0	56.90
BLOOM-1.1B	43.0	54.9	67.2	74.6	59.1	45.4	25.6	52.83
TinyLlama-1.1B	59.1	58.9	73.0	82.3	58.6	55.7	31.0	59.80
MobileLLaMA-1.4B	56.1	59.4	73.0	81.9	56.7	55.8	30.3	59.03
MobiLlama-1B	62.2	59.3	74.8	82.8	60.3	56.4	31.7	61.07
OpenELM-1.1B	64.8	61.7	75.6	83.6	63.6	55.4	32.3	62.43
DCLM-1.4B	53.6	66.3	77.0	94.0	71.4	74.8	41.2	68.33
SmolLM-1.7B	49.6	60.9	75.8	93.2	66.0	76.4	43.5	66.49
Qwen 1.5-1.8B	60.9	60.5	74.2	89.4	66.5	59.1	34.7	63.61
Galactica-1.3B	41.0	54.4	63.8	87.7	62.0	58.6	30.5	56.86
StableLM 2-1.6B	68.8	64.1	75.1	76.9	80.0	60.3	39.2	66.34
Cerebras-GPT-1.3B	38.4	51.9	66.8	73.0	59.3	45.8	25.3	51.50
MiniCPM-1B	67.5	63.7	75.1	91.0	70.5	62.9	38.1	66.97
MiniCPM-2B	67.2	63.9	76.1	92.5	74.6	69.0	42.7	69.43
Gemma-2B	71.4	65.2	78.4	91.4	69.9	72.3	42.0	70.09
Gemma 2-2B	55.0	68.7	78.7	96.0	73.6	80.3	46.9	71.31

Android Function Call

To enhance the model’s capability in smartphone operation, we fine-tuned the PhoneLM on the DroidCall dataset, a synthetic dataset specifically focused on Android intent invocations generated by GPT4.

Currently we use two simple metric to reflect the ability of function calling:

Accuracy: A sample contains a user query and its corresponding ground-truth function calls. A sample is considered correct only if the model generates all function calls with both correct functions and parameters.Accuracy is defined as the ratio of correctly predicted samples to the total number of samples.
Soft Accuracy: To provide a more fine-grained evaluation when the model generates partially correct results (i.e., correct functions with partially correct parame- ters), we define soft accuracy. For each function call, a score is calculated as the ratio of correctly predicted parameters to the total number of parameters. Soft ac- curacy is then computed as the average of these scores across all function calls.

Model	Accuracy	Soft Accuracy
PhoneLM-1.5B-Instruct	17.5	17.8
PhoneLM-1.5B-Call	76.5	89.3
Qwen2.5-Coder-1.5B	50.0	63.5
Qwen2.5-1.5B-Instruct	58.5	75.3
Phi-3.5-mini-instruct	62.0	77.7
MiniCPM3-4B	70.0	85.7
Gemma-2-2b-it	56.5	75.8
TinyLlama-1.1B-Chat-v1.0	18.0	18.7
Llama-3.2-1B-Instruct	36.0	43.8
Llama-3.2-3B-Instruct	47.5	57.9
GPT-40-mini	71.0	86.1

Runnning PhoneLM

Huggingface

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = 'mllmTeam/PhoneLM-1.5B-Instruct'
question = "Hello, who are you?"
prompt = [{"role": "user", "content": question}]

model = AutoModelForCausalLM.from_pretrained(model_name, device_map='cuda', trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)

inp = tokenizer(input_text, return_tensors="pt")
inp = {k: v.to('cuda') for k, v in inp.items()}
out = model.generate(**inp, 
                     max_length=256,
                     do_sample=True,
                     temperature=0.7,
                     top_p=0.7
                     )
text = tokenizer.decode(out[0], skip_special_tokens=True)
print(text)

mllm

We have provided the mllm formats of PhoneLM, which can be used in mllm.

Install mllm

git clone https://github.com/UbiquitousLearning/mllm.git
cd mllm/scripts/
build.sh

Inference

cd ../bin
./demo_phonelm -m /path/to/model.mllm

Training PhoneLM

Install Python Environment

pip install -r requirement.txt

Stable Training Stage

We use the following dataset in stable training stage.

type	dataset	token
web	DCLM-baseline	1.35T
code	StarCoderData	112.75B
math	OpenWebMath	13.25B
academic	Dolma-algebraic	12.75B
academic	Dolma-arxiv	29B
total		1.5T

Download The Original Data

You can download the dataset from the links provided in the table above using any method.As an example, we use huggingface-cli to download DCLM-baseline. Here is an example command:

huggingface-cli download --repo-type dataset --local-dir ./dclm-baseline --local-dir-use-symlinks False --resume-download mlfoundations/dclm-baseline-1.0-parquet

You can decide how to download the dataset through the links in the table above.

Preprocess the dataset

Before pretraining, it is necessary to perform tokenization on the dataset in advance. Before tokenization, you should first know the format of the dataset and the field in the dataset used to pretrain. Take dclm-baseline as an example, the data files format is parquet. And in its Dataset Card, it can be seen that the text field of each data entry is used for pretraining. After knowing the format type, we can use the following command to tokenize the data in advance

python path/to/dataset path/to/output_dir\
  --prefix prefix_of_output_file\ 
  --handler file_format\
  --field field_used_to_pretrain\
  --num_workers  workers_to_process\
  --tokenizer_path path/to/tokenizer\
  --max_size max_tokens_of_each_output_file

For example, to tokenize dclm-baseline, use following command in PhoneLM

python pretokenize.py path/to/dclm-baseline ./train_datasets/dclm-baseline 
  --prefix dclm-baseline 
  --handler parquet 
  --field text
  --tokenizer_path tokenizer

The output will look like:

train_datasets/
└── dclm-baseline
    ├── dclm-baseline-000-00000.data
    ├── dclm-baseline-001-00000.data
    ├── dclm-baseline-002-00000.data
    ├── dclm-baseline-003-00000.data
    ...

Train

After performing the same operation on all datasets, the tokenized datasets are stored in train_datasets. Subsequently, you can start pretraining with the following command:

deepspeed train.py --config config_phonelm_1.5b.yaml

Decay Stage

In the decay stage, the data contains some dataset from stable training stage, including DCLM-baseline, StarCoderData, and Dolma. And it also contains some high-quality fine-tuning data, which is used in fine-tuning stage. Following table shows the data

Type	Dataset	Token
web	DCLM-baseline	10B
code	StarCoderData	1.575B
code	The Stack Smol	0.95B
acadamic	Dolma-arxiv	2.325B
acadamic	Dolma-pes2o	2.35B
math instruct	MathInstruct	65.25M
chat instruct	UltraChat	1.775B
chat instruct	OpenAssistant 2	42.25M
chat instruct	OpenHermes	77.25M
code instruct	Magicoder Evol Instruct	30.25M
code instruct	CommitPackFT	0.35B
code instruct	Magicoder OSS Instruct	43.5M
function calling	SlimOrca	209.75M
function calling	APIGen	48.25M
function calling	Glaive Function Calling	57.5M
total		20B

Unfortunately, the datasets in the table above, excluding those used for pretraining, each have their own format. To standardize the datasets in this phase, we have processed all SFT data into a chat format and formatted them as text using a unified template.

We will show you an example. First download the dataset as shown above.Then use the following command to process:

python prepare_chat.py path/to/MathInstruct chat/MathInstruct --dataset_name MathInstruct # process MathInstruct

python prepare_chat.py ../datasets/Magicoder-OSS-Instruct-75K/ chat/Magicoder --dataset_name Magicoder # process Magicoder

After processing the dataset, the chat directory will looks like

chat/
├── Magicoder
│   └── 000_Magicoder_00000.parquet
└── MathInstruct
    └── 000_MathInstruct_00000.parquet

Format of processed data is as following:

{
  "text": "pretrain data",
  "chat": [
    {"role": "...", "content": "..."},
    ...
  ]
}

Then you can tokenize the text field to get the Decay Stage pretrain data using pretokenize.py.

Train

Subsequently, you can start decay stage training with the following command:

deepspeed train.py --config config_phonelm_1.5b_stage2.yaml

Instruct Following Tuning

In this stage you need to initial dataset structure as followed:

train_datasets_instructs/
├── commitpackft
│   ├── 000_commitpackft_00000.parquet
│   └── ...
└── ...

The dataset construction is the same as in Decay Stage.

Train

Launch train command

deepspeed train_instruct.py --config config_phonelm_1.5b_instruct.yaml

If it is the first time loading train_datasets_instruct, two directories train_dataset_test and val_dataset_test will be generated in the train_datasets_instruct directory. Subsequently, data will be read directly from these two directories.

Function Call Tuning

We fine-tuned our model on the DroidCall datasets to quip the model with the capability to operate Android phones. We have provided an instance for fine-tuning on DroidCall, you can also use your own way to fine-tune.

First, download the DroidCall dataset and rename it to train_datasets_DroidCall. The dataset structure is as follows:

train_datasets_DroidCall/
└── DroidCall_code_short.jsonl

Train

We provide a simple config to run the fine-tuning on DroidCall, you can simply start the training using the following command

deepspeed train_instruct.py --config config_phonelm_1.5b_call.yaml

License

The source code of PhoneLM is under the License of GPL-2.0.

Citation

@misc{yi2024phonelmanefficientcapablesmall,
      title={PhoneLM:an Efficient and Capable Small Language Model Family through Principled Pre-training}, 
      author={Rongjie Yi and Xiang Li and Weikai Xie and Zhenyan Lu and Chenghua Wang and Ao Zhou and Shangguang Wang and Xiwen Zhang and Mengwei Xu},
      year={2024},
      eprint={2411.05046},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.05046}, 
}