Home

Awesome

PFLlib: Personalized Federated Learning Library

🎯We create a beginner-friendly algorithm library and benchmark platform for those new to federated learning. Join us in expanding the FL community by contributing your algorithms, datasets, and metrics to this project.

πŸ‘ PFLlib now has its official website and domain name: https://www.pfllib.com/!!!

πŸ‘ The Leaderboard is live! Our methodsβ€”FedCP, GPFL, and FedDBEβ€”lead the way. Notably, FedDBE stands out with robust performance across varying data heterogeneity levels.

πŸ‘ We will change the license to Apache-2.0 in the next release.

πŸ”₯ Four new datasets have been added, two of which address real-world scenarios: (1) tumor tissue patches from breast cancer metastases in lymph node sections sourced from different hospitals, and (2) wildlife photos captured by different camera traps. The other two datasets focus on the label-skew scenario: chest X-ray images from hospitals for COVID-19 and endoscopic images from hospitals for gastrointestinal disease detection. These datasets are also compatible with our HtFLlib

arXiv License: GPL v2

Figure 1: An Example for FedAvg. You can create a scenario using generate_DATA.py and run an algorithm using main.py, clientNAME.py, and serverNAME.py. For a new algorithm, you only need to add new features in clientNAME.py and serverNAME.py.

🎯If you find our repository useful, please cite the corresponding paper:

@article{zhang2023pfllib,
  title={PFLlib: Personalized Federated Learning Algorithm Library},
  author={Zhang, Jianqing and Liu, Yang and Hua, Yang and Wang, Hao and Song, Tao and Xue, Zhengui and Ma, Ruhui and Cao, Jian},
  journal={arXiv preprint arXiv:2312.04992},
  year={2023}
}

Key Features

The origin of the data heterogeneity phenomenon is the characteristics of users, who generate non-IID (not Independent and Identically Distributed) and unbalanced data. With data heterogeneity existing in the FL scenario, a myriad of approaches have been proposed to crack this hard nut. In contrast, the personalized FL (pFL) may take advantage of the statistically heterogeneous data to learn the personalized model for each user.

Algorithms with code (updating)

Traditional FL (tFL)

Basic tFL

Personalized FL (pFL)

Meta-learning-based pFL

Datasets and scenarios (updating)

We support 3 types of scenarios with various datasets and move the common dataset splitting code into ./dataset/utils for easy extension. If you need another data set, just write another code to download it and then use the utils.

label skew scenario

For the label skew scenario, we introduce 16 famous datasets:

The datasets can be easily split into IID and non-IID versions. In the non-IID scenario, we distinguish between two types of distribution:

  1. Pathological non-IID: In this case, each client only holds a subset of the labels, for example, just 2 out of 10 labels from the MNIST dataset, even though the overall dataset contains all 10 labels. This leads to a highly skewed distribution of data across clients.

  2. Practical non-IID: Here, we model the data distribution using a Dirichlet distribution, which results in a more realistic and less extreme imbalance. For more details on this, refer to this paper.

Additionally, we offer a balance option, where data amount is evenly distributed across all clients.

feature shift scenario

For the feature shift scenario, we utilize 3 widely used datasets in Domain Adaptation:

real-world scenario

For the real-world scenario, we introduce 5 naturally separated datasets:

For more details on datasets and FL algorithms in IoT, please refer to FL-IoT.

Examples for MNIST in the label skew scenario

cd ./dataset
# python generate_MNIST.py iid - - # for iid and unbalanced scenario
# python generate_MNIST.py iid balance - # for iid and balanced scenario
# python generate_MNIST.py noniid - pat # for pathological noniid and unbalanced scenario
python generate_MNIST.py noniid - dir # for practical noniid and unbalanced scenario
# python generate_MNIST.py noniid - exdir # for Extended Dirichlet strategy 

The command line output of running python generate_MNIST.py noniid - dir

Number of classes: 10
Client 0         Size of data: 2630      Labels:  [0 1 4 5 7 8 9]
                 Samples of labels:  [(0, 140), (1, 890), (4, 1), (5, 319), (7, 29), (8, 1067), (9, 184)]
--------------------------------------------------
Client 1         Size of data: 499       Labels:  [0 2 5 6 8 9]
                 Samples of labels:  [(0, 5), (2, 27), (5, 19), (6, 335), (8, 6), (9, 107)]
--------------------------------------------------
Client 2         Size of data: 1630      Labels:  [0 3 6 9]
                 Samples of labels:  [(0, 3), (3, 143), (6, 1461), (9, 23)]
--------------------------------------------------
<details> <summary>Show more</summary>
Client 3         Size of data: 2541      Labels:  [0 4 7 8]
                 Samples of labels:  [(0, 155), (4, 1), (7, 2381), (8, 4)]
--------------------------------------------------
Client 4         Size of data: 1917      Labels:  [0 1 3 5 6 8 9]
                 Samples of labels:  [(0, 71), (1, 13), (3, 207), (5, 1129), (6, 6), (8, 40), (9, 451)]
--------------------------------------------------
Client 5         Size of data: 6189      Labels:  [1 3 4 8 9]
                 Samples of labels:  [(1, 38), (3, 1), (4, 39), (8, 25), (9, 6086)]
--------------------------------------------------
Client 6         Size of data: 1256      Labels:  [1 2 3 6 8 9]
                 Samples of labels:  [(1, 873), (2, 176), (3, 46), (6, 42), (8, 13), (9, 106)]
--------------------------------------------------
Client 7         Size of data: 1269      Labels:  [1 2 3 5 7 8]
                 Samples of labels:  [(1, 21), (2, 5), (3, 11), (5, 787), (7, 4), (8, 441)]
--------------------------------------------------
Client 8         Size of data: 3600      Labels:  [0 1]
                 Samples of labels:  [(0, 1), (1, 3599)]
--------------------------------------------------
Client 9         Size of data: 4006      Labels:  [0 1 2 4 6]
                 Samples of labels:  [(0, 633), (1, 1997), (2, 89), (4, 519), (6, 768)]
--------------------------------------------------
Client 10        Size of data: 3116      Labels:  [0 1 2 3 4 5]
                 Samples of labels:  [(0, 920), (1, 2), (2, 1450), (3, 513), (4, 134), (5, 97)]
--------------------------------------------------
Client 11        Size of data: 3772      Labels:  [2 3 5]
                 Samples of labels:  [(2, 159), (3, 3055), (5, 558)]
--------------------------------------------------
Client 12        Size of data: 3613      Labels:  [0 1 2 5]
                 Samples of labels:  [(0, 8), (1, 180), (2, 3277), (5, 148)]
--------------------------------------------------
Client 13        Size of data: 2134      Labels:  [1 2 4 5 7]
                 Samples of labels:  [(1, 237), (2, 343), (4, 6), (5, 453), (7, 1095)]
--------------------------------------------------
Client 14        Size of data: 5730      Labels:  [5 7]
                 Samples of labels:  [(5, 2719), (7, 3011)]
--------------------------------------------------
Client 15        Size of data: 5448      Labels:  [0 3 5 6 7 8]
                 Samples of labels:  [(0, 31), (3, 1785), (5, 16), (6, 4), (7, 756), (8, 2856)]
--------------------------------------------------
Client 16        Size of data: 3628      Labels:  [0]
                 Samples of labels:  [(0, 3628)]
--------------------------------------------------
Client 17        Size of data: 5653      Labels:  [1 2 3 4 5 7 8]
                 Samples of labels:  [(1, 26), (2, 1463), (3, 1379), (4, 335), (5, 60), (7, 17), (8, 2373)]
--------------------------------------------------
Client 18        Size of data: 5266      Labels:  [0 5 6]
                 Samples of labels:  [(0, 998), (5, 8), (6, 4260)]
--------------------------------------------------
Client 19        Size of data: 6103      Labels:  [0 1 2 3 4 9]
                 Samples of labels:  [(0, 310), (1, 1), (2, 1), (3, 1), (4, 5789), (9, 1)]
--------------------------------------------------
Total number of samples: 70000
The number of train samples: [1972, 374, 1222, 1905, 1437, 4641, 942, 951, 2700, 3004, 2337, 2829, 2709, 1600, 4297, 4086, 2721, 4239, 3949, 4577]
The number of test samples: [658, 125, 408, 636, 480, 1548, 314, 318, 900, 1002, 779, 943, 904, 534, 1433, 1362, 907, 1414, 1317, 1526]

Saving to disk.

Finish generating dataset.
</details>

Models

Environments

Install CUDA v11.6.

Install conda latest and activate conda.

conda env create -f env_cuda_latest.yaml # You may need to downgrade the torch using pip to match the CUDA version

How to start simulating (examples for FedAvg)

Note: It is preferable to tune algorithm-specific hyper-parameters before using any algorithm on a new machine.

Easy to extend

This library is designed to be easily extendable with new algorithms and datasets. Here’s how you can add them:

Privacy Evaluation

You can use the following privacy evaluation methods to assess the privacy-preserving capabilities of tFL/pFL algorithms in PFLlib. Please refer to ./system/flcore/servers/serveravg.py for an example. Note that most of these evaluations are not typically considered in the original papers. We encourage you to add more attacks and metrics for privacy evaluation.

Currently supported attacks:

Currently supported metrics:

Systematical research supprot

To simulate Federated Learning (FL) under practical conditions, such as client dropout, slow trainers, slow senders, and network TTL (Time-To-Live), you can adjust the following parameters:

Thanks to @Stonesjtu, this library can also record the GPU memory usage for the model.

Experimental Results

If you're interested in experimental results (e.g., accuracy) for the algorithms mentioned above, you can find results in our accepted FL papers, which also utilize this library. These papers include:

Please note that while these results were based on this library, reproducing the exact results may be challenging as some settings might have changed in response to community feedback. For example, in earlier versions, we set shuffle=False in clientbase.py.

Here are the relevant papers for your reference:

@inproceedings{zhang2023fedala,
  title={Fedala: Adaptive local aggregation for personalized federated learning},
  author={Zhang, Jianqing and Hua, Yang and Wang, Hao and Song, Tao and Xue, Zhengui and Ma, Ruhui and Guan, Haibing},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={37},
  number={9},
  pages={11237--11244},
  year={2023}
}

@inproceedings{Zhang2023fedcp,
  author = {Zhang, Jianqing and Hua, Yang and Wang, Hao and Song, Tao and Xue, Zhengui and Ma, Ruhui and Guan, Haibing},
  title = {FedCP: Separating Feature Information for Personalized Federated Learning via Conditional Policy},
  year = {2023},
  booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}
}

@inproceedings{zhang2023gpfl,
  title={GPFL: Simultaneously Learning Global and Personalized Feature Information for Personalized Federated Learning},
  author={Zhang, Jianqing and Hua, Yang and Wang, Hao and Song, Tao and Xue, Zhengui and Ma, Ruhui and Cao, Jian and Guan, Haibing},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={5041--5051},
  year={2023}
}

@inproceedings{zhang2023eliminating,
  title={Eliminating Domain Bias for Federated Learning in Representation Space},
  author={Jianqing Zhang and Yang Hua and Jian Cao and Hao Wang and Tao Song and Zhengui XUE and Ruhui Ma and Haibing Guan},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023},
  url={https://openreview.net/forum?id=nO5i1XdUS0}
}