Federated learning is a distributed framework that trains a centralised model using data from multiple clients without trans-ferring that data to a central server. Despite rapid progress, federated learning still faces several unsolved challenges. Specif-ically, communication costs and system heterogeneity, such as nonidentical data distribution, hinder federated learning's progress. Several approaches have recently emerged for federated learning involving heterogeneous clients with varying computational capabilities (namely, heterogeneous federated learning). However, heterogeneous federated learning faces two key challenges: optimising model size and determining client selection ratios. Moreover, efficiently aggregating local models from clients with diverse capabilities is crucial for addressing system heterogeneity and communication efficiency. This paper proposes an evolutionary multiobjective optimisation framework for heterogeneous federated learning (MOHFL) to address these issues. Our approach elegantly formulates and solves a biobjective optimisation problem that minimises communication cost and model error rate. The decision variables in this framework comprise model sizes and client selection ratios for each Q client cluster, yielding a total of 2 × Q optimisation parameters to be tuned. We develop a partition-based strategy for MOHFL that segregates clients into clusters based on their communication and computation capabilities. Additionally, we implement an adaptive model sizing mechanism that dynamically assigns appropriate subnetwork architectures to clients based on their computational constraints. We also propose a unified aggregation framework to combine models of varying sizes from het-erogeneous clients effectively. Extensive experiments on multiple datasets demonstrate the effectiveness and superiority of our proposed method compared to existing approaches.
Monge-Ampère equations (MAEs) are fully nonlinear second-order partial differential equations (PDEs), which are closely related to various fields including optimal transport (OT) theory, geometrical optics and affine geometry. Despite their sig-nificance, MAEs are extremely challenging to solve. Although some classical numerical approaches can solve MAEs, their computational efficiency deteriorates significantly on fine grids, with convergence often heavily dependent on the quality of initial estimate. Research on deep learning methods for solving MAEs is still in its early stages, which predominantly addresses simple formulations with basic Dirichlet boundary conditions. Here, we propose a deep learning method based on physics- driven deep neural networks, enabling the solution of both simple and generalised MAEs with transport boundary condi-tions. In this method, we deal with two first-order sub-equations separated from MAE instead of solving the single MAE directly, which facilitates the imposition of transport boundary conditions and simplifies the training of neural networks. Moreover, we constrain the convexity of solution using the Lagrange multiplier method and maintain the optimisation process differentiable with bilinear interpolation. We provide three progressively complex examples ranging from a simple MAE with an analytical solution to a highly nonlinear variant arising in phase retrieval to validate the effectiveness of our method. For comparison, we benchmark against state-of-the-art deep learning approaches that have been systematically adapted to accommodate the specific requirements of each example.
Large language model-based (LLM-based) text-to-SQL methods have achieved important progress in generating SQL queries for real-world applications. When confronted with table content-aware questions in real-world scenarios, ambiguous data content keywords and nonexistent database schema column names within the question lead to the poor performance of existing methods. To solve this problem, we propose a novel approach towards table content-aware text-to-SQL with self-retrieval (TCSR-SQL). It leverages LLM's in-context learning capability to extract data content keywords within the question and infer possible related database schema, which is used to generate Seed SQL to fuzz search databases. The search results are further used to confirm the encoding knowledge with the designed encoding knowledge table, including column names and exact stored content values used in the SQL. The encoding knowledge is sent to obtain the final Precise SQL following multi- rounds of generation-execution-revision process. To validate our approach, we introduce a table-content-aware, question- related benchmark dataset, containing 2115 question-SQL pairs. Comprehensive experiments conducted on this benchmark demonstrate the remarkable performance of TCSR-SQL, achieving an improvement of at least 27.8% in execution accuracy compared to other state-of-the-art methods.
Although Transformer-based image restoration methods have demonstrated impressive performance, existing Transformers still insufficiently exploit multiscale information. Previous non-Transformer-based studies have shown that incorporating multiscale features is crucial for improving restoration results. In this paper, we propose a multiscale Transformer (MST) that captures cross-scale attention among tokens, thereby effectively leveraging the multiscale patch recurrence prior of natural images. Furthermore, we introduce a channel-gate feed-forward network (CGFN) to enhance inter-channel information aggregation and reduce channel redundancy. To simultaneously utilise global, local and multiscale features, we design a multitype feature integration block (MFIB). Extensive experiments on both image super-resolution and HEVC compressed video artefact reduction demonstrate that the proposed MST achieves state-of-the-art performance. Ablation studies further verify the effectiveness of each proposed module.
Increased awareness of Tibetan cultural preservation, along with technological advancements, has led to significant efforts in academic research on Tibetan. However, the structural complexity of the Tibetan language and limited labeled handwriting data impede advancements in Optical Character Recognition (OCR) and other applications. To address these challenges, this paper proposes an innovative Tibetan data augmentation technique, using Generative Adversarial Networks (GANs) to synthesise arbitrary handwriting images in variable calligraphic styles based on inputs. Moreover, our method leverages a Real-Fake Cross Inputs Strategy during training to enhance generation diversity and improve model generalisability in generating handwritten text beyond the training set and pre-defined corpus. The model was trained on three Tibetan handwriting datasets, including Umê style numerals, Uchen style consonants, and Khyug-yig style words. Experimental results demonstrate that the model successfully generates realistic and recognisable Tibetan numeral and consonant handwriting, achieving Fréchet Inception Distance (FID) scores of 14.45 and 27.63, respectively. The proposed method's effectiveness in augmenting OCR models was validated as evidenced by a reduced OCR Word Error Rate (WER) on the augmented datasets.
Anomaly detection (AD) aims to identify abnormal patterns that deviate from normal behaviour, playing a critical role in applications such as industrial inspection, medical imaging and autonomous driving. However, AD often faces a scarcity of labelled data. To address this challenge, we propose a novel semi-supervised anomaly detection method, DASAD (Deviation- Guided Attention for Semi-Supervised Anomaly Detection), which integrates deviation-guided attention with contrastive reg-ularisation to reduce the unreliability of pseudo-labels. Specifically, a deviation-guided attention mechanism is designed to combine three types of deviations: latent embeddings, residual direction vectors and hierarchical reconstruction errors to capture anomaly specific cues effectively, thereby enhancing the credibility of pseudo-labels for unlabelled samples. Further-more, a class-asymmetric contrastive loss is constructed to promote compact representations of normal instances while pre-serving the structural diversity of anomalies. Extensive experiments on 8 benchmark datasets demonstrate that DASAD consistently outperforms state-of-the-art methods and exhibits strong generalisation across 6 anomaly detection domains.
Mobile wheel-legged robots exhibiting mobility, stability and reliability have garnered heightened research attention in demanding real-world scenarios, especially in material transport, emergency response and space exploration. The kinematics model merely delineates the geometric relationship of the controlled objective, disregarding force feedback. This study in-vestigates model predictive trajectory tracking control utilising the robot dynamic model (DRMPC) in the context of unpre-dictable interactions. The predictive tracking controller for the wheel-legged robot is introduced in the context of position tracking. A dynamic approximator is employed to address the uncertain interactions in the tracking process. Ultimately, co- simulation and empirical tests are conducted to demonstrate the efficacy of the devised control methodology, which ach-ieves high precision and dependable robustness. This work can elucidate the technical and practical oversight of autonomous movement in complicated environments and enhance the manoeuverability and fiexibility.
Compared to monocular depth estimation, multi-view depth estimation often yields more accurate results. However, traditional multi-view depth estimation methods often fail to leverage semantic information fully and struggle to effectively fuse infor-mation from multiple views, leading to suboptimal prediction performance in challenging scenarios such as texture-less regions and refiective surfaces. To address these limitations, we present MVI-Depth, a novel framework with two core innovations: (1) a Semantic Fusion Module (SFM) that establishes semantic correspondence, and (2) a Depth Updating Module (DUM) enabling iterative depth refinement. Specifically, MVI-Depth initially establishes a main view representation that integrates single-view depth, depth features, and semantic features. Subsequent feature extraction from neighbouring views enables the construction of the original cost volume. Recognising the inherent limitations of direct cost volume utilisation in complex scenes, the proposed SFM constructs an aligned semantic cost volume to utilise the complementarity between semantic and depth in-formation, forming an improved final cost volume. The final cost volume is updated through the proposed DUM to achieve iterative depth optimisation. Comprehensive evaluations demonstrate that MVI-Depth achieves superior performance across all standard metrics on both ScanNet and KITTI benchmarks, outperforming existing methods. Additional experiments on the 7- Scenes dataset further confirm the framework's robust generalisation capabilities in diverse environments.
Nonconvex optimisation plays a crucial role in science and industry. However, existing methods often encounter local optima or provide inferior solutions when solving nonconvex optimisation problems, lacking robustness in noise scenarios. To address these limitations, we aim to develop a robust, efficient and globally convergent solver for nonconvex optimisation. This is achieved by combining the efficient local exploitation ability of a parameter-learnt neural dynamics (PLND) model with the global search capability of the coevolutionary mechanism. We combine their characteristics to design a coevolutionary neural dynamics with learnable parameters (CNDLP) model. The gradient information is used to find the optimal solution more effectively, and neural dynamics models have robustness, which ensures that the infiuence of noise can be effectively sup-pressed in the calculation process. Theoretical analyses show the global convergence and robustness of the designed CNDLP model. Numerical experiments on 9 benchmark functions and a practical engineering design example are conducted with five existing meta-heuristic algorithms. Benchmarks cover diverse problems, from classical landscapes like benchmark Shubert to high-dimensional cases such as 30-dimensional Rosenbrock. Results confirm CNDLP's excellent performance in both solution quality and convergence speed under noise.
Humans can learn complex and dexterous manipulation tasks by observing videos, imitating and exploring. Multiple end- effectors manipulation of free micron-sized deformable cells is one of the challenging tasks in robotic micromanipulation. We propose an imitation-enhanced reinforcement learning method inspired by the human learning process that enables robots to learn cell micromanipulation skills from videos. Firstly, for the microscopic robot micromanipulation videos, a multi-task observation (MTO) network is designed to identify the two end-effectors and the manipulated objects to obtain the spatio-temporal trajectories. The spatiotemporal constraints of the robot's actions are obtained by the task-parameterised hidden Markov model (THMM). To simultaneously address the safety and dexterity of robot micromanipulation, an imitation learning optimisation-based soft actor-critic (ILOSAC) algorithm is proposed in which the robot can perform skill learning by demonstration and exploration. The proposed method is capable of performing complex cell manipulation tasks in a realistic physical environment. Experiments indicated that compared with current methods and manual remote manipulation, the proposed framework achieved a shorter operation time and less deformation of cells, which is expected to facilitate the development of robot skill learning.
Discovering meaningful face morphing is critical for applications in image synthesis. Traditional unsupervised methods rely on global or layer-wise representations, neglecting finer local details and thus limiting the control over specific facial attributes. In this work, we introduce an improved unsupervised approach that leverages contrastive learning and K-means clustering to learn both layer-wise and local features (LLF) in the latent space of StyleGAN. Our method segments latent representations into multiple local components across different layers, enabling fine-grained control over attributes such as hair, eyes, and mouth. Experimental results demonstrate that LLF outperforms existing methods by providing more interpretable facial trans-formations while preserving high image realism, offering a promising solution for enhanced unsupervised face morphing ap-plications. The code is available at https://github.com/disanda/LLF.
With the increase of semiconductor integration density, in order to cope with the increase of wafer defect complexity and types, especially the low recognition accuracy of overlapping mixed defects and unknown wafer defects, this study proposes a lightweight model for wafer defect detection called LightWMNet. First, using a hierarchical attention Encoder-Decoder ar-chitecture, the features of wafer defect pattern (WDP) are channel recalibrated to generate high-resolution fine-grained features and low-resolution coarse-grained features. Secondly, the backbone network incorporates two novel attention modules— feedforward spatial attention (FFSa) and feedforward channel attention (FFCa)—to amplify responses in critical defect re-gions and suppress noise from stochastic discrete pixels. These mechanisms synergistically enhance feature discriminability without introducing significant parametric overhead. Finally, the Dice loss function and the cross entropy loss function are combined to jointly evaluate the segmentation and classification accuracy of the model. Experimental results on the public mixed wafer defect dataset MixedWM38 show that the pixel accuracy (PA), intersection over union (IoU) and Dice coefficient of the proposed network reach 98.26%, 94.83% and 97.22%, respectively. Without significantly increasing the computational complexity and size of the model, compared with the existing state-of-the-art (SOTA) model, the classification accuracy of lightWMNet in single defect, three mixed defects and four mixed defects is improved by 0.5%, 0.25% and 0.89% respectively. Furthermore, we used transfer learning for the first time to evaluate the model's generalisation ability for unseen defect cat-egories. The results showed that LightWMNet still has a certain recognition ability even in untrained wafer defects.
The global shift towards sustainable energy has intensified research into renewable sources, particularly wave energy. Pakistan, with its long coastline, holds significant potential for wave energy development. However, identifying optimal locations for wave energy plants involves evaluating complex, multi-faceted criteria. This study employs a multi-criteria group decision- making (MCGDM) approach using single-valued neutrosophic numbers (SVNNs) to address both qualitative and quantita-tive uncertainties inherent in real-world scenarios. To enhance decision quality, we introduce two novel operators: the single- valued neutrosophic prioritised averaging (SVNPAd) operator and the single-valued neutrosophic prioritised geometric (SVNPGd) operator, both incorporating priority degrees. These tools allow decision-makers to express preferences better and handle ambiguous data. The proposed model is validated through comparative analysis with prior studies and demonstrates improved robustness in site selection. Furthermore, we analyse how variations in priority degrees infiuence decision outcomes, enabling a more dynamic and tailored decision-making process. Our method contributes a more holistic and adaptive framework for selecting locations for wave energy projects, ultimately supporting informed investments in renewable energy infrastructure and improving energy access in underserved coastal regions.
This paper presents an adaptive formation control method for a heterogeneous robot swarm, utilising a multilevel formation task tree to model various types of formation tasks and a single-state distributed k-winner-take-all (S-DKWTA) algorithm to address the MRTA problem. In addition, we propose an enhanced load reassignment algorithm to resolve confiicts when using S-DKWTA. The S-DKWTA algorithm demonstrates the capability to manage multiple objectives and dynamically select leaders in real-time, thereby optimising formation efficiency and reducing energy consumption. The proposed approach integrates an enhanced artificial potential field (APF) to govern the motion of heterogeneous robot systems which encompasses both un-manned ground vehicles (UGVs) and unmanned aerial vehicles (UAVs), thereby achieving collision and obstacle avoidance. Simulations employing UGVs and UAVs swarm to achieve formation movement demonstrate the efficacy of this approach. The amalgamation of S-DKWTA and improved APF ensures stable and adaptable formation control, underscoring its potential for diverse multirobot applications.
This work presents UNO, a unified monocular visual odometry framework that enables robust and adaptable pose esti-mation across diverse environments, platforms and motion patterns. Unlike traditional methods that rely on deployment- specific tuning or predefined motion priors, our approach generalises effectively across a wide range of real-world sce-narios, including autonomous vehicles, aerial drones, mobile robots and handheld devices. To this end, we introduce a mixture-of-experts strategy for local state estimation, with several specialised decoders that each handle a distinct class of ego-motion patterns. Moreover, we introduce a fully differentiable Gumbel-softmax module that constructs a robust inter- frame correlation graph, selects the optimal expert decoder and prunes erroneous estimates. These cues are then fed into a unified back-end that combines pretrained scale-independent depth priors with a lightweight bundling adjustment to enforce geometric consistency. We extensively evaluate our method on three major benchmark datasets: KITTI (outdoor/ autonomous driving), EuRoC-MAV (indoor/aerial drones) and TUM-RGBD (indoor/handheld), demonstrating state-of-the- art performance.
Brain tumours disrupt the normal functioning of the brain and, if left untreated, can invade surrounding tissues, blood vessels, and nerves, posing a severe threat. Consequently, early detection is crucial to prevent tragic outcomes. Distinguishing brain tumours through manual detection poses a significant challenge given their diverse features, such as differing shapes, sizes, and nucleus characteristics. Therefore, this research introduces an improved architecture for tumour detection named as Brain- RetinaNet, an extension of the RetinaNet model. Brain-RetinaNet is specifically designed for automated detection and iden-tification of brain tumours in MRI images. It utilises an advanced multiscale feature fusion mechanism within the X-module, complemented by the channel attention module. The feature fusion mechanism within the X-module progressively merges features from different scales, producing enriched feature maps that encompass valuable information. At the same time, the attention module dynamically allocates optimal weights to individual channels within the feature map, enabling the network to prioritise relevant features while reducing interference from unnecessary ones. Moreover, this study employs data augmentation technique to address the limitation of a limited number of available samples. Experimental results indicate that Brain-RetinaNet outperforms existing detectors such as YOLO, SSD, Centernet, EfficientNet, and M2det for the brain tumour detection from MRI images.
Audio-visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms. Multimodal fusion methods can improve both the accuracy and robustness of speaker tracking. However, in complex multispeaker tracking scenarios, critical challenges such as cross-modal feature discrepancy, weak sound source localisation ambiguity and frequent identity switch errors remain unresolved, which severely hinder the modelling of speaker identity consistency and consequently lead to degraded tracking accuracy and unstable tracking trajectories. To this end, this paper proposes a multimodal multispeaker tracking network using audio-visual contrastive learning (AVCLNet). By integrating heterogeneous modal representations into a unified space through audio-visual contrastive learning, which facili-tates cross-modal feature alignment, mitigates cross-modal feature bias and enhances identity-consistent representations. In the audio-visual measurement stage, we design a vision-guided weak sound source weighted enhancement method, which lever-ages visual cues to establish cross-modal mappings and employs a spatiotemporal dynamic weighted mechanism to improve the detectability of weak sound sources. Furthermore, in the data association phase, a dual geometric constraint strategy is introduced by combining the 2D and 3D spatial geometric information, reducing frequent identity switch errors. Experiments on the AV16.3 and CAV3D datasets show that AVCLNet outperforms state-of-the-art methods, demonstrating superior robustness in multispeaker scenarios.
Recently, the zeroing neural network (ZNN) has demonstrated remarkable effectiveness in tackling time-varying problems, delivering robust performance across both noise-free and noisy environments. However, existing ZNN models are limited in their ability to actively suppress noise, which constrains their robustness and precision in solving time-varying problems. This paper introduces a novel active noise rejection ZNN (ANR-ZNN) design that enhances noise suppression by integrating computational error dynamics and harmonic behaviour. Through rigorous theoretical analysis, we demonstrate that the pro-posed ANR-ZNN maintains robust convergence in computational error performance under environmental noise. As a case study, the ANR-ZNN model is specifically applied to time-varying matrix inversion. Comprehensive computer simulations and robotic experiments further validate the ANR-ZNN's effectiveness, emphasising the proposed design's superiority and potential for solving time-varying problems.
This paper presents an adaptive multi-agent coordination (AMAC) strategy suitable for complex scenarios, which only requires information exchange between neighbouring robots. Unlike traditional multi-agent coordination methods that are solved by neural dynamics, the proposed strategy displays greater fiexibility, adaptability and scalability. Furthermore, the proposed AMAC strategy is reconstructed as a time-varying complex-valued matrix equation. By introducing a dynamic error function, a fixed-time convergent zeroing neural network (FTCZNN) model is designed for the online solution of the AMAC strategy, with its convergence time upper bound derived theoretically. Finally, the effectiveness and applicability of the coordination control method are demonstrated by numerical simulations and physical experiments. Numerical results indicate that this method can reduce the formation error to the order of 10 − 6 within 1.8 s.
In intelligent transportation systems, object detection for a surveillance video is one of the important functions. The performance of existing surveillance video object detection algorithms is affected by the confiict between the features of the objects, which leads to a decline in precision. Therefore, an object detection algorithm based on deep learning and salient feature fusion is proposed. The proposed method introduces a non-weight-sharing network to process the salient features of the image and fuse them with the features extracted from the red blue green branch. Different from the previous solutions, the salient feature extraction branch uses the boundary features and statistical features of the image and fuses the features of the two branches in the efficient layer aggregation networks structure. At the same time, the attention module is used in efficient layer aggregation networks with convolutional block attention module to improve the efficiency of feature utilisation. The training and evaluation are carried out in the constructed surveillance video feature confiict dataset, and eight scenes are constructed in the way of orthogonal exper-iments. The experimental results show that the performance of object detection can be significantly improved by using the proposed method in the object detection task of the intelligent transportation system surveillance video feature confiict scene.