FedPD: personalized federated learning based on partial distillation

Xu YANG; Ji-Yuan FENG; Song-Yue GUO; Bin-Xing FANG; Qing LIAO

doi:10.1007/s11704-025-40840-4

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (3) : 2003604 DOI: 10.1007/s11704-025-40840-4

Information Systems

RESEARCH ARTICLE

FedPD: personalized federated learning based on partial distillation

Author information +

History +

PDF (2008KB)

Abstract

In recent years, personalized federated learning (PFL) has gained widespread attention for its robust performance in handling heterogeneous data. However, most PFL methods require client models to share the same architecture, which is impractical in real-world scenarios. Therefore, federated distillation learning is proposed, which allows clients to use different architecture models for FL training. Nevertheless, these methods do not consider the importance of different distillation knowledge aggregated by the client knowledge, resulting in poor client collaboration performance. In this paper, we propose a novel personalized federated learning method based on partial distillation (FedPD) that assesses the relevance of the different distillation knowledge and ensemble knowledge for each client, thereby achieving selective knowledge transfer. Specifically, FedPD contains two key modules. One is the partial knowledge transfer (PKT) which uses the partial distillation coefficient to identify the importance of each distillation knowledge to select more valuable distillation knowledge. The other is the partial knowledge ensemble (PKE), which maintains a server model for each client to extract distillation knowledge to guide the client. Extensive experiments on real-world datasets in various experimental settings show that FedPD significantly improves client model performance compared to state-of-the-art federated learning methods.

Graphical abstract

Keywords

federated learning / knowledge distillation / model heterogeneity

Cite this article

Download citation ▾

Xu YANG, Ji-Yuan FENG, Song-Yue GUO, Bin-Xing FANG, Qing LIAO. FedPD: personalized federated learning based on partial distillation. Front. Comput. Sci., 2026, 20(3): 2003604 DOI:10.1007/s11704-025-40840-4

登录浏览全文

4963

注册一个新账户忘记密码

1 Introduction

Federated learning (FL) jointly trains models based on the collaboration of various clients while protecting the data privacy of each client [1]. In FL, clients collaborate to train models without sharing local private data. During FL training, participating clients upload their model parameters trained on local data to the server, which then aggregates client models. Currently, FL is widely used in fields such as healthcare [2,3], smart city [4,5], and recommendation systems [6–8].

Although FL has been successful in practical applications, it still faces some challenges [8–14]. For example, considering the client’s dataset is often non-independent and identically distributed (non-IID), personalized federated learning (PFL) is proposed to train a personalized local model for each client. The recent PFL works mainly include the regularization-based methods [15–17] that add a regularization term of the global model to improve the local model performance, similarity-based methods [18,19] that reinforce the collaboration between clients with similar data distributions, and model architecture-based methods [20–22] that utilize partial aggregation model parameters to improve local models. Despite the promising results achieved by these PFL methods, they require assuming that client models have the same architecture, which is impractical in the real world.

To break the constraints of the same architecture model, many federated distillation methods have been proposed. The federated distillation method uses the public database to achieve knowledge transfer between heterogeneous client models, thereby addressing the issue of model heterogeneity [23–28]. For example, FedMD [25] utilizes the client model to generate soft predictions for public dataset samples as local knowledge, then distills the average aggregated global knowledge to the local model. KT-pFL [28] enhances distillation by aggregating personalized global knowledge based on the similarity of local knowledge for each client. FedHeNN [26] takes the representations generated by distillation samples as local knowledge from clients, achieving client collaboration by aligning client knowledge with the averagely aggregated global knowledge. However, these methods do not take into account the importance of different distillation knowledge to the client, leading to inefficient collaboration among client models.

We propose a novel personalized federated learning method based on partial distillation (FedPD), which uses partial distillation to selectively transfer knowledge based on the client data distribution, thereby enhancing the heterogeneous personalized model performance. FedPD tackles the challenges of model heterogeneity with two key modules. First, we propose the partial knowledge transfer (PKT) module which utilizes the partial distillation coefficient to identify the importance of each distillation sample, filtering global knowledge to benefit the client. Second, we propose the partial knowledge ensemble (PKE) module which extracts distillation knowledge for each client to improve client model performance. The main contributions are summarized as follows:

● We propose a novel personalized federated learning method called FedPD, which leverages partial distillation to achieve selective knowledge transfer, thereby improving the performance of client models.

● We propose the PKT module, which uses the partial distillation coefficient to measure the importance of different distillation knowledge, enabling more effective distillation for clients.

● We propose the PKE module, which integrates distillation knowledge for each client to guide their training, thereby enhancing client model performance.

● We evaluate the FedPD on various datasets and settings. Extensive experiments show that the proposed method achieves the best performance.

2 Related work

Personalized federated learning has attracted much attention in addressing non-IID challenges in FL. For example, some regularization-based methods improve client model performance by adding regularization terms about the global model during local training. Li et al. [15] proposed Ditto, which enhances client model performance by adding regularization terms between the client model and the global model. Dinh et al. [16] proposed pFedMe, which optimizes both the local and global models by adding

L 2

loss between the client model and global model, thereby improving the generalization performance of the client model. Some similarity-based methods improve client model performance by encouraging collaboration among clients with similar data distributions. Ghosh et al. [29] proposed IFCA, which groups clients according to the minimum loss of client data on each group model, and the clients in the group share the model. Liu et al. [30] proposed PFA, which uses the distance between the privacy representations of local data to build a data similarity matrix for each client and groups clients according to the similarity matrix. Huang et al. [18] proposed FedAMP, which maintains a personalized cloud model based on similarity aggregation for each client on the server and trains a personalized model for each client based on these cloud models. Another type of solution is architecture-based methods, which improve client model performance by dividing the model into private and shared parts. Collins et al. [20] proposed FedRep, which splits the client model into a private classifier and a shared feature extractor, enhancing client models by sharing the global feature representation. Sun et al. [31] proposed PartialFed, where each client uses only part of the global model parameters, selected through an automatically generated strategy, to boost performance. Zhang et al. [22] proposed FedALA, which adaptively aggregates the global model and the local model according to each client’s objective to initialize the local model, thereby achieving better client model performance. However, these approaches require all clients to use the same architecture model.

Recent research has started focusing on FL with different model architectures. Federated distillation learning is proposed to transfer knowledge by aligning local client knowledge with global knowledge, enabling clients to share useful information and improve model performance [32–34]. For example, Lin et al. [24] proposed FedDF, which aggregates models with the same architecture and uses average aggregated knowledge for distillation to update models with the same architecture. Makhija et al. [26] proposed FedHeNN that utilizes Centered Kernel Alignment (CKA) distance as the distillation loss to perform distillation, aiming to align client representations with the average aggregated global representations. Cho et al. [27] proposed FedET that uses distillation to transfer knowledge from multiple local models to a global model on an unlabeled proxy dataset. He et al. [35] proposed FedGKT that uses knowledge distillation to transfer client model knowledge to a large server model. Zhang et al. [28] proposed KT-pFL, which utilizes the knowledge collaboration matrix to aggregate personalized group knowledge for each client on the server, using this personalized global knowledge as guidance knowledge. Wang et al. [32] proposed DaFKD, which optimizes the aggregation of client predictions through a discriminator that identifies the correlation between distillation samples and client models, aligning the client’s predictions accordingly during distillation. Tan et al. [36] used the average representation of local data samples as prototypes and introduced a prototype regularization term in the local loss function. He et al. [35] achieved knowledge transfer by minimizing the predictions of the server model and the soft predictions of the client. However, most existing federated distillation learning methods do not consider the importance of different global knowledge in local distillation, which may introduce redundant information for the client. In this paper, we propose FedPD, which selects valuable knowledge for distillation to adapt to local goals across different clients, thereby improving client model performance.

3 Methodology

3.1 Preliminary

We aim to collaboratively train personalized models for N total clients with different model architectures in FL. We consider each client n can only access to its local dataset

D n = {x i, y i}

, where

x i

is the ith sample, with label

y i

D = ∪ n = 1 N D n

is composed of all client datasets, containing K classes. The goal of PFL is to learn a local model for each client, minimizing the total empirical loss of the dataset on the client. For the client n, we denote its personalized objective as:

F n := E (x i, y i) ∼ D i L n (x i, y i; ω n),

where

L n (⋅)

and

ω n

are the local loss function and model parameter of client n. For all clients, their objective is:

{ω 1, . . ., ω N} = arg ⁡ min G (F 1, . . ., F N),

where

G

is a function that aggregates the local objectives

{F n} n ∈ [N]

of each client.

We refer to the features extracted by client models from public dataset samples as client knowledge, and the features extracted by the server model as server knowledge. The public dataset

D^

is stored on the server and each client as in the typical federated distillation setting [25,28]. For the ith sample

x^i

∈

D^

, the distillation knowledge

z n, i

and

z n, i s

extracted by the client model

ω n

and the server model

ω n s

are defined as follows:

z n, i = f (x^i; ω n f),

z n, i s = H (x^i; ω n s),

where

f (⋅)

is the feature extractor of client model

ω n

with parameters

ω n f

;

H (⋅)

is output function of server model with parameters

ω n s

. Note that the server model consists of the feature extractor part and output layer, without the classifier. The output of the server model is the server knowledge. All server models have the same architecture. The size of the last output layer corresponds to the length of the specific client knowledge. The local knowledge of client n is

Z n = {z n, i} i = 1 | D^|

. Similarly, the server model extracts the distillation knowledge to guide client n as

Z n s = {z n, i s} i = 1 | D^|

3.2 FedPD

In this section, we introduce the proposed FedPD in detail. The proposed FedPD consists of two key components: partial knowledge transfer (PKT) and partial knowledge ensemble (PKE). Fig.1 shows the workflow of FedPD. First, clients extract features from public dataset samples and upload them to the server. Second, the server uses PKE to ensemble knowledge from various clients and generate distilled knowledge for each client. Finally, each client uses PKT to assess the importance of the received server knowledge and update its local model parameters.

3.2.1 Partial knowledge transfer (PKT)

Global knowledge may contain useless information, such as missing class samples from other clients. Therefore, clients need to distinguish useful global knowledge based on its local model during distillation, rather than aligning with global knowledge indiscriminately. To address this, we propose the PKT module, which identifies the importance of each piece of distillation knowledge and only transfers partially useful distillation knowledge to different clients.

1) Partial distillation coefficient.

To identify the importance of different distilled knowledge, we introduce a partial distillation coefficient

α

into the distillation loss. For client n, the partial distillation loss is defined as follows:

(1)

min ω n, α n L D (ω n, α n) = 1 | D^| ∑ i = 1 | D^| α n, i L 1 (f (x^i; ω n f), z n, i s) + τ 2 | | α n − 1 | | 2,

where

τ

is a regularization hyperparameter greater than 0;

L 1 (⋅)

stands for

L 1

loss function.

1

is the vector of ones, where each element equals 1;

α n

represents the partial distillation coefficient vector for client n and

| α n | = | D^|

, indicating the importance of each distillation knowledge. The regularization term

| | α n − 1 | | 2

increases generalization and prevents clients from assigning too high weight to knowledge with large differences. The regularization term prevents client models from overfitting to the knowledge of their corresponding server models. It constrains the importance weights assigned by clients to different distillation knowledge to stay close to 1. Without this regularization term, clients tend to assign excessively high weights to knowledge that differs significantly from their local knowledge.

2) Optimization objective.

The objective of FedPD is to solve the joint optimization problem of clients with data and model heterogeneity. FedPD achieves collaboration among heterogeneous clients by introducing partial distillation loss while minimizing the loss of local training. The personalized loss function of client n is defined as follows:

(2)

min ω n L n (ω n) = L C E (ω n) + λ L D (ω n; α n),

where

λ

is a distillation hyperparameter.

L C E (⋅)

is the cross-entropy loss function, defined as follows:

L C E (ω n) = − ∑ (x i, y i) ∼ D n log ⁡ p (y i | x i; ω n),

where

p (y i | x i; ω n)

is the probability that the sample

x i

is predicted to be

y i

by client model

ω n

3) Update partial distillation coefficient.

Furthermore, to adaptively adjust the partial distillation coefficients

α n

in each training round, we propose updating

α n

alternately with model parameters. This allows computing

α n

for different knowledge based on the model parameters. By adopting this alternating optimization method, the local model can effectively converge towards an optimal solution.

Specifically, we first fix

ω n

on the client n to calculate the partial distillation coefficient

α n

. According to Eq. (5), the update of

α n

is written as follows:

(3)

α n ← α n − η α n ∇ α n L D (ω n, α n),

where

η α n

is the learning rate of

α n

. After updating

α n

, we fix

α n

to solve the model parameter

ω n

4) Update client model.

Then the local model parameter

ω n

is updated during local private data training and partial distillation. According to Eq. (6), the update of client n local model parameters is written as following:

(4)

ω n ← ω n − η ω n ∇ ω n L n (ω n),

where

η ω n

is learning rate of

ω n

. FedPD builds a positive feedback loop that can adaptively select more valuable distillation knowledge for the client model in each round and optimize local model parameters to improve overall model performance.

3.2.2 Partial knowledge ensemble (PKE)

Considering the challenge of data heterogeneity, a single aggregated global knowledge cannot meet the personalized needs of all clients, and significant biases may exist in local knowledge, potentially compromising the quality of linear aggregated global knowledge. As a result, generating individual global knowledge for each client is necessary. In this work, we propose the PKE module to partially ensemble distillation knowledge and extract distillation knowledge for each client to guide the client model training.

1) Server model.

First, we establish a corresponding server model for each client on the server side. All server models have feature extractors with the same architecture. The purpose of the server model is to extract knowledge from the corresponding client model and convert heterogeneous knowledge into a unified representation with the same architecture.

Then, we utilize the client n knowledge

Z n

to train the server model

ω n s

, transferring the representation knowledge from the client model to the server model. Since the server models have the feature extractor with the same architecture, we can use the server models to achieve collaboration between heterogeneous client models. To collaborate with different clients and retain certain client information, a regularization term is introduced to the loss function. The loss function of the server model

ω n s

is defined as follows:

(5)

min ω n s L (ω n s) = 1 | D^| ∑ i = 1 | D^| L s (H (x^i; ω n s), z n, i) + μ L R (ω n s, f, ω ¯ s, f),

where

μ

is a server regularization hyperparameter;

ω n s, f

represents the feature extractor parameters of the server model

ω n s

;

ω ¯ s, f

is the global basic model which contains the global representation knowledge of all clients;

L s (⋅)

is the loss function for server model training, and we use

L 1

loss as the loss function;

L R (⋅)

is the server regularization term used to achieve collaboration between

ω n s, f

and

ω ¯ s, f

, which is defined as:

L R (ω n s, f, ω ¯ s, f) = | | ω n s, f − ω ¯ s, f | | 2 .

2) Global basic model.

The global basic model is defined as follows:

(6)

ω ¯ s, f = 1 N ∑ n = 1 N ω n s, f .

Based on previous work [20,37], the global basic model

ω ¯ s, f

aggregated by all model feature extractors naturally contains the global representation information of all clients. The regularization term

L R (⋅)

allows the server model to retain personalized preferences while approaching global representation knowledge and achieving the ensemble of partial global knowledge.

3) Update server model.

According to Eq. (10), the update of server model

ω n s

is as follows:

(7)

ω n s ← ω n s − η ω n s ∇ ω n s L (ω n s),

where

η ω n s

is learning rate.

After server models training, server model

ω n s

extracts partial global knowledge

Z n s

to guide the client n. Compared with the global knowledge generated by the linear combination of local knowledge, PKE captures more complex relationships between local knowledge and makes full use of the rich information within client knowledge.

We summarize all the procedures of FedPD on the server and client in Algorithm 1. Each client finally gets a personalized model after FL training. In each training round, clients first extract local knowledge for public distillation data (Line 4). Then the server partially ensembles the client knowledge and extracts guiding knowledge for each client (Line 5). Finally, the clients utilize local data to update model parameters based on the server knowledge guidance (Line 6). Partial distillation is implemented in two parts: first, the PKE module partially ensembles knowledge for each client (Lines 11−14), and second, the PKT module achieves selective distillation based on partial distillation coefficients (Lines 17−19). These two parts aim to filter valuable collaborative information for clients as much as possible, thereby enhancing collaboration effectiveness.

4) Complexity analysis.

FedPD consists of two key modules: PKT and PKE. Assume that the size of the public dataset is M, the number of client model parameters is W, the number of server model parameters is P, and the feature size of each sample is d. When all clients participate in training, in the PKT module, the computational complexity for calculating the importance coefficient

α n

is approximately O(dMN) and the complexity of extracting client knowledge is about O(WMN). In the PKE module, the computational complexity for model training and knowledge extraction is about O(PMN). Therefore, the total computational complexity of FedPD is

O ((d + W + P) M N)

. In each round, clients need to upload and download client knowledge and server knowledge respectively. The communication cost for each client in each communication round is about 2dM.

4 Experiments

4.1 Experimental setup

4.1.1 Datasets

We evaluate the effectiveness of the proposed method through image classification tasks on these four datasets. The details are as follows:

● CIFAR-10 [38] dataset consists of 60, 000 32 × 32 images, containing 10 classes, each with 6, 000 images. The training set and test set have 50, 000 and 10, 000 images, respectively.

● CIFAR100 [38] is similar to CIFAR-10, but it has 100 classes. There are 60, 000 images in total, and the size of each image is 32 × 32 pixels. Each class contains 500 training images and 100 test images.

● EMNIST [39] is the handwritten character dataset derived from NIST Special Database 19, comprising 62 classes. Each image in the dataset represents a handwritten digit in a 28 × 28 format. There are 814, 255 images in total.

● Fashion-MNIST (FMNIST) [40] contains a training set of 60, 000 samples and a test set of 10, 000 samples. Each image is a 28 × 28 grayscale image containing 10 classes.

For each dataset, we randomly allocate data for each client. Following [20,41], we use the data of the first 10 classes of CIFAR100 and EMNIST. Since it is easy to collect various public data samples in the real world, We select 100 samples from each class as the public dataset samples. The public dataset samples do not appear in the clients’ training or testing sets. Additionally, to evaluate the impact of different public datasets on the proposed method, we conduct experiments using samples from different classes of other datasets as public dataset samples (Subsection 4.4).

4.1.2 Data segmentation

Following previous works [24,27,32,42], we use Dirichlet distribution to allocate data to each client. To verify the impact of different data heterogeneity levels on model performance, we adopt two non-IID experimental settings of

β = 0.1

and

β = 0.5

. The size of the Dirichlet distribution coefficient

β

controls the degree of data heterogeneity. As shown in Fig.2, the smaller

β

is, the higher the data heterogeneity is, and vice versa. Fig.2 illustrates the data distribution of 20 clients on FMNIST, where each column represents the number of samples of each class that the client has, and the size of the points represents the number of samples.

4.1.3 Models

We set up models of different scales depending on the difficulty of the client classification task. Following the settings of [28], for the CIFAR10 and CIFAR100 dataset, we select four different architectures of models LeNet [43], ResNet-18 [44], MobileNetv2 [45], and ShuffleNetV2 [46] as client models. The server models use ResNet-34 [44]. For the EMNIST and FMNIST datasets, we use four types of models as client models, including two MLP models and two CNN models, as in previous works [20,35,47]. And the server model is a larger-scale CNN model. The four types of client models are evenly assigned among the clients.

4.1.4 Comparison methods

We compare FedPD with six popular baseline methods that support heterogeneous model, including:

● FedMD [25] averages the predictions of distillation samples uploaded by the client as global knowledge and aligns local knowledge with global knowledge during the distillation process.

● FedDF [24] averages the model parameters of models with the same model architecture and updates the average global model based on the predictions of all clients.

● KT-PFL [28] uses the similarity of client soft predictions to aggregate other clients’ predictions as the client’s global knowledge.

● FedHeNN [26] utilizes CKA distance to align local knowledge with global knowledge based on the weighted average of local knowledge.

● FedProto [36] uses the average representation generated by each type of sample of local data as the prototype. The client introduces the regularization term of the prototype in the local update to update the model.

● FedHKD [42] aggregates each class’s prototype and soft prediction as hyper-knowledge and adds hyper-knowledge regularization terms to the local model update.

4.1.5 Implementation details

We evaluate FedPD under various experimental settings. For all datasets, the number of clients is set to 20 and 50. Considering that clients may not be able to participate in every training round in real-world scenarios, we randomly select only 20% of the clients to participate in FL training in each round. All experiments are conducted on eight NVIDIA Tesla V100 GPUs with 32Gb memory. The proposed FedPD is implemented on Pytorch 1.12 in Python 3.7 environment. All additional hyperparameters in the compared methods use their original settings. We run all methods for 200 communication rounds. We execute five local epochs of SGD with momentum to train the local model. For the CIFAR10 and CIFAR100 datasets experiments, we set the client local model learning rate

η ω n = 0.005

and batch size

b = 40

. For EMNIST and FMNIST datasets, client model learning rate

η w = 0.01

and batch size

b = 20

. In all experiments, for FedPD, the best performance is obtained by setting the server model learning rate

η ω n s = 0.001

, batch size

b = 40

, and setting the partial distillation coefficient

α

learning rate

η α n = 0.05

. Same as work [20,21], we evaluate all methods by the average accuracy of the last 10 training epochs for all clients on the local test set.

4.2 Experimental results and analysis

4.2.1 Performance comparison

Tab.1 shows the average client model test accuracy (ACC), Precision (PRE), Recall (REC), and AUC-ROC (AUC) of the comparison methods and FedPD in low data heterogeneity (

β = 0.5

) settings on the CIFAR10, CIFAR100, EMNIST, and FMNIST datasets. The best-performing method is marked in bold, and the second-best method is underlined. When the number of clients is 20, on the CIFAR10 dataset, FedPD outperforms the best baseline method KT-pFL by 2.78% in the ACC metric and outperforms FedHKD by 8.02% in the AUC metric. Additionally, our method outperforms the best baseline FedHKD by 1.74% and 2.23% in ACC and PRE metrics on the FMNIST dataset, respectively. When the number of clients is 50, FedPD outperforms the best method, FedHKD, by 6.93% and 4.57% in the ACC and REC metrics on the CIFAR10, respectively, and by 1.61% and 1.82% on FMNIST. This is because tasks on the CIFAR10 and CIFAR100 are more challenging compared to other datasets, leading to greater improvements.

Tab.2 shows the results of the baseline methods and FedPD in high data heterogeneity (

β = 0.1

) settings on four datasets. When the number of clients is 20, FedPD outperforms the best baseline, FedHeNN, by 2.49% and 4.46% on CIFAR100, and surpasses FedHKD by 2.16% and 3.86% on FMNIST in terms of ACC and PRE, respectively. When the number of clients is 50, on the CIFAR100 dataset, FedPD outperforms the best baseline FedHKD by 5.44% in the ACC and outperforms FedHeNN by 6.83% in the AUC metric. FedPD consistently outperforms other methods in various heterogeneous settings because it allows clients to select more valuable knowledge during the distillation process, thereby enhancing model performance.

Tab.3 shows the performance of FedPD under different non-IID settings. It can be seen that FedPD outperforms the best baseline methods by 2.78%, 2.05%, 3.30%, and 2.23% respectively under

β

=0.5, 0.1, 0.05, and 0.01. FedPD consistently achieved improvements across various experimental settings, demonstrating its effectiveness in addressing non-IID data challenges.

In addition to the client model heterogeneous setting, we also explore the performance of FedPD in the homogeneous model setting. We compare the performance of Scaffold [48], FedDC [49], and FedFed [50] on the FMNIST and EMNIST datasets with 20 clients. The results are shown in Tab.4. It can be observed that FedPD outperforms other baseline methods in terms of the ACC metric and achieves better performance in most cases on other metrics, such as PRE, REC, and AUC.

4.2.2 Convergence evaluation

Fig.3 shows the average test accuracy curves during training on four datasets with high data heterogeneity (

β = 0.1

) and 20 clients, including all baseline methods and FedPD. The horizontal axis represents the communication rounds, and the vertical axis represents the test accuracy. As shown in Fig.3, FedHeNN is the best baseline, but our method reaches a higher accuracy of 68.14%. As shown in Fig.3, FedPD still surpasses the best baseline method FedHKD. This indicates that our method not only performs better but also converges faster. The reason is that FedPD adaptively selects distilled knowledge to adjust local models, alleviating the negative effects during the distillation process.

4.2.3 Impact of public dataset

Tab.5 shows the impact of using different public datasets on FedPD performance. We compare the test accuracy of FedPD using different datasets as the public dataset. Note that even though the task dataset and the public dataset are the same, the samples used for distillation do not appear in the client’s private dataset. From Tab.5, it is observed that replacing the public dataset in FedPD does not result in significant accuracy changes, with a specific decrease of 0.06% on EMNIST and 1.1% on FMNIST. This shows that FedPD does not rely heavily on unbiased public datasets and is robust to the setting of public datasets. But the performance increases by 0.81% on CIFAR100. The reason is that the partial distillation coefficient can adaptively assign different importance weights to different distillation knowledge, even if the public dataset samples differ significantly from local data.

4.2.4 Ablation study

To verify the effect of the PKT and PKE, we conduct the ablation experiments as shown in Tab.6. The replacement of PKT is to set the same importance for all global knowledge, while the replacement of PKE is to averagely aggregate all client knowledge to obtain global knowledge. From Tab.6, it can be observed that the accuracy of FedPD_-V1, which does not use both components, is the lowest at only 82.81%. FedPD_-V2 using only PKT, shows a 3.14% increase compared to baseline FedPD_-V1, while using only PKE increases accuracy by 2.52%. FedPD can increase the performance by 10% compared to the baseline by utilizing both components simultaneously. This indicates that: 1) the effectiveness of PKT in generating importance coefficients for different distillation knowledge; 2) the effectiveness of PKE in ensemble knowledge for clients. FedPD using both components simultaneously can achieve greater improvements.

4.2.5 Impact of hyperparameters

To evaluate the impact of hyperparameters on FedPD training, we conduct multiple experiments on four datasets.

Regularization parameter $τ$ . As shown in Eq. (5),

τ

compromises the client model’s personalization and generalization capabilities. When

τ

increases, the partial distillation coefficients will approach 1, making the predictions closer to the server model’s predictions. Tab.7 shows the FedPD results for different

τ

values. The results indicate that an appropriate

τ

value needs to be selected to achieve better FedPD performance.

Server epoch e. The server model requires multiple epochs to learn client knowledge and perform partial knowledge ensemble to facilitate collaboration between clients. Therefore, we use various epoch settings to evaluate the impact of server epochs, with results shown in Tab.8. The results show that a larger e is beneficial to the knowledge ensemble, but too large e is likely to damage FedPD performance.

Server regularization parameter $μ$ . In Eq. (10), the server model’s predictions are directly related to the server regularization parameter

μ

. The larger the

μ

, the more the knowledge extracted by the server model tends to average global knowledge, and vice versa. Tab.9 shows the FedPD results under different

μ

settings. The results show that a larger

μ

does not bring better performance improvement, which means that a trade-off between local and global knowledge is needed to achieve the best performance.

Model learning rates. As the learning rate directly influences the model’s training performance, we evaluate the performance of the client and server models using multiple learning rates

η ω n

and

η ω n s

in Eqs. (9) and (13). Tab.10 shows the performance of FedPD under different server model learning rates. It can be seen that a smaller server model learning rate will significantly affect the performance of FedPD. Tab.11 shows the performance of FedPD under different client learning rates. It can be seen that an appropriate client model learning rate can achieve the best performance, and a lower or too high learning rate will damage the model performance.

Partial distillation coefficient learning rates $η α n$ . We evaluate the performance of FedPD under different partial distillation coefficient

α n

learning rate

η α n

in Eq. (8), as shown in Tab.12. It can be observed that the proposed method is insensitive to the learning rate of the partial distillation coefficient. We select 0.05 as the learning rate for the partial distillation coefficient.

4.2.6 Case study

To demonstrate the effectiveness of FedPD, we conduct a case study as shown in Fig.4. The separability of features extracted from each class is a key metric for evaluating the classification model. For client models, extracting more separable features benefits subsequent classification tasks. We show the features extracted from clients trained locally in Fig.4. Features extracted from client models trained with averaged knowledge without FedPD are shown in Fig.4. Features extracted from FedPD training are shown in Fig.4. It can be seen that the features extracted by the client trained with FedPD are more distinguishable. The results also confirm the effectiveness of FedPD.

4.2.7 Scalability analysis

In this experiment, we evaluat the performance of FedPD under different client scales. Fig.5 shows the results on the FEMNIST and EMNIST datasets with

β = 0.1

when the number of clients is 5, 20, 35, and 50. It can be observed that as the number of clients increases, the accuracy of FedPD improves and stabilizes. While computational costs rise with the number of clients, the accuracy improvement gradually slows. This is because the proportion of clients participating in each round of training remains unchanged, and the number of inactive clients increases in scenarios with more clients. These results effectively demonstrate the scalability and adaptability of FedPD to varying client scales in FL training.

5 Conclusion

In this paper, we proposed a novel FL method named personalized federated learning method based on partial distillation (FedPD) to address the challenges of data heterogeneity and model heterogeneity. In FedPD, we design two key modules, namely the partial knowledgetransfer (PKT) and partial knowledge ensemble (PKE), to collaborate and improve client model performance. PKT assigns different importance to distillation samples to improve distillation performance. PKE extracts personalized global knowledge for each client based on client knowledge. Furthermore, FedPD allows clients to utilize completely different architectural models without further constraints. Extensive experiments conducted on multiple datasets indicate that, compared to existing FL methods, FedPD consistently achieves higher accuracy in various settings involving multiple models and data heterogeneity.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	McMahan B, Moore E, Ramage D, Hampson S, Arcas B A Y. Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, 1273−1282

[2]	Antunes R S, da Costa A C, Küderle A, Yari I A, Eskofier B . Federated learning for healthcare: systematic review and architecture proposal. ACM Transactions on Intelligent Systems and Technology (TIST), 2022, 13( 4): 54

[3]	Xu J, Glicksberg B S, Su C, Walker P, Bian J, Wang F . Federated learning for healthcare informatics. Journal of Healthcare Informatics Research, 2021, 5( 1): 1–19

[4]	Jiang J C, Kantarci B, Oktug S, Soyata T . Federated learning in smart city sensing: challenges and opportunities. Sensors, 2020, 20( 21): 6230

[5]	Pandya S, Srivastava G, Jhaveri R, Babu M R, Bhattacharya S, Maddikunta P K R, Mastorakis S, Piran M J, Gadekallu T R . Federated learning for smart cities: a comprehensive survey. Sustainable Energy Technologies and Assessments, 2023, 55: 102987

[6]	Yang L, Tan B, Zheng V W, Chen K, Yang Q. Federated recommendation systems. In: Yang Q, Fan L, Yu H, eds. Federated Learning: Privacy and Incentive. Cham: Springer, 2020, 225−239

[7]	Wang Q, Yin H, Chen T, Yu J, Zhou A, Zhang X . Fast-adapting and privacy-preserving federated recommender system. The VLDB Journal, 2022, 31( 5): 877–896

[8]	Tan A Z, Yu H, Cui L, Yang Q . Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34( 12): 9587–9603

[9]	Luo M, Chen F, Hu D, Zhang Y, Liang J, Feng J. No fear of heterogeneity: classifier calibration for federated learning with non-IID data. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 5972−5984

[10]	Ye M, Fang X, Du B, Yuen P C, Tao D . Heterogeneous federated learning: state-of-the-art and research challenges. ACM Computing Surveys, 2024, 56( 3): 79

[11]	Sun N, Wang W, Tong Y, Liu K . Blockchain based federated learning for intrusion detection for internet of things. Frontiers of Computer Science, 2024, 18( 5): 185328

[12]	Liu F, Zheng Z, Shi Y, Tong Y, Zhang Y . A survey on federated learning: a perspective from multi-party computation. Frontiers of Computer Science, 2024, 18( 1): 181336

[13]	Xu J, Wong R C W. Efficiently answering top-k window aggregate queries: calculating coverage number sequences over hierarchical structures. In: Proceedings of the 39th IEEE International Conference on Data Engineering. 2023, 1300−1312

[14]	Dhasarathan C, Hasan M K, Islam S, Abdullah S, Khapre S, Singh D, Alsulami A A, Alqahtani A. User privacy prevention model using supervised federated learning-based block chain approach for internet of medical things. CAAI Transactions on Intelligence Technology, 2023

[15]	Li T, Hu S, Beirami A, Smith V. Ditto: fair and robust federated learning through personalization. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 6357−6368

[16]	Dinh C T, Tran N H, Nguyen T D. Personalized federated learning with Moreau envelopes. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1796

[17]	Li Q, He B, Song D. Model-contrastive federated learning. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 10708−10717

[18]	Huang Y, Chu L, Zhou Z, Wang L, Liu J, Pei J, Zhang Y. Personalized cross-silo federated learning on non-IID data. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 7865−7873

[19]	Ruan Y, Joe-Wong C. FedSoft: soft clustered federated learning with proximal local updating. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence. 2022, 8124−8131

[20]	Collins L, Hassani H, Mokhtari A, Shakkottai S. Exploiting shared representations for personalized federated learning. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 2089−2099

[21]	Oh J, Kim S, Yun S Y. FedBABU: toward enhanced representation for federated image classification. In: Proceedings of the 10th International Conference on Learning Representations. 2022, 1−29

[22]	Zhang J, Hua Y, Wang H, Song T, Xue Z, Ma R, Guan H. FedALA: adaptive local aggregation for personalized federated learning. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2023, 11237−11244

[23]	Li L, Gou J, Yu B, Du L, Tao Z Y D. Federated distillation: a survey. 2024, arXiv preprint arXiv: 2404.08564

[24]	Lin T, Kong L, Stich S U, Jaggi M. Ensemble distillation for robust model fusion in federated learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 198

[25]	Li D, Wang J. FedMD: heterogenous federated learning via model distillation. In: Proceedings of the 33rd Conference on Neural Information Processing Systems. 2019, 1–8

[26]	Makhija D, Han X, Ho N, Ghosh J. Architecture agnostic federated learning for neural networks. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 14860−14870

[27]	Cho Y J, Manoel A, Joshi G, Sim R, Dimitriadis D. Heterogeneous ensemble knowledge transfer for training large models in federated learning. In: Proceedings of the 31st International Joint Conference on Artificial Intelligence. 2022, 2881−2887

[28]	Zhang J, Guo S, Ma X, Wang H, Xu W, Wu F. Parameterized knowledge transfer for personalized federated learning. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 10092−10104

[29]	Ghosh A, Chung J, Yin D, Ramchandran K . An efficient framework for clustered federated learning. IEEE Transactions on Information Theory, 2022, 68( 12): 8076–8091

[30]	Liu B, Guo Y, Chen X. PFA: privacy-preserving federated adaptation for effective model personalization. In: Proceedings of the Web Conference 2021. 2021, 923−934

[31]	Sun B, Huo H, Yang Y, Bai B. PartialFed: cross-domain personalized federated learning via partial initialization. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 23309−23320

[32]	Wang H, Li Y, Xu W, Li R, Zhan Y, Zeng Z. DaFKD: domain-aware federated knowledge distillation. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 20412−20421

[33]	Pfeiffer K, Rapp M, Khalili R, Henkel J . Federated learning for computationally constrained heterogeneous devices: a survey. ACM Computing Surveys, 2023, 55( 14s): 334

[34]	Gao D, Yao X, Yang Q. A survey on heterogeneous federated learning. 2022, arXiv preprint arXiv: 2210.04505

[35]	He C, Annavaram M, Avestimehr S. Group knowledge transfer: federated learning of large CNNs at the edge. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1180

[36]	Tan Y, Long G, Liu L, Zhou T, Lu Q, Jiang J, Zhang C. FedProto: federated prototype learning across heterogeneous clients. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence. 2022, 8432−8440

[37]	Chen H Y, Chao W L. On bridging generic and personalized federated learning for image classification. In: Proceedings of the 10th International Conference on Learning Representations. 2022, 1−32

[38]	Krizhevsky A. Learning multiple layers of features from tiny images. University of Toronto, Dissertation, 2009

[39]	Cohen G, Afshar S, Tapson J, van Schaik A. EMNIST: extending MNIST to handwritten letters. In: Proceedings of 2017 International Joint Conference on Neural Networks. 2017, 2921−2926

[40]	Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. 2017, arXiv preprint arXiv: 1708.07747

[41]	Achituve I, Shamsian A, Navon A, Chechik G, Fetaya E. Personalized federated learning with Gaussian processes. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 8392−8406

[42]	Chen H, Wang C, Vikalo H. The best of both worlds: accurate global and personalized models through federated learning with data-free hyper-knowledge distillation. In: Proceedings of the 11th International Conference on Learning Representations. 2023, 1−24

[43]	LeCun Y, Bottou L, Bengio Y, Haffner P . Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86( 11): 2278–2324

[44]	He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778

[45]	Sandler M, Howard A G, Zhu M, Zhmoginov A, Chen L C. MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 4510−4520

[46]	Ma N, Zhang X, Zheng H T, Sun J. ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 122−138

[47]	Qin Z, Deng S, Zhao M, Yan X. FedAPEN: personalized cross-silo federated learning with adaptability to statistical heterogeneity. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023, 1954−1964

[48]	Karimireddy S P, Kale S, Mohri M, Reddi S J, Stich S U, Suresh A T. Scaffold: stochastic controlled averaging for federated learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 476

[49]	Gao L, Fu H, Li L, Chen Y, Xu M, Xu C Z. FedDC: federated learning with non-iid data via local drift decoupling and correction. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 10102−10111

[50]	Yang Z, Zhang Y, Zheng Y, Tian X, Peng H, Liu T, Han B. FedFed: feature distillation against data heterogeneity in federated learning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2639