Journal of Beijing Institute of Technology

2025-04-02 2025, Volume 34 Issue 1

Previous Next

Select all

HIET: Hybrid Information Enhancement Transformer Network for Single-Photon Image Reconstruction

2025, 34(1): 1-17. https://doi.org/10.15918/j.jbit1004-0579.2024.093

Download PDF

Single-photon sensors are novel devices with extremely high single-photon sensitivity and temporal resolution. However, these advantages also make them highly susceptible to noise. Moreover, single-photon cameras face severe quantization as low as 1 bit/frame. These factors make it a daunting task to recover high-quality scene information from noisy single-photon data. Most current image reconstruction methods for single-photon data are mathematical approaches, which limits information utilization and algorithm performance. In this work, we propose a hybrid information enhancement model which can significantly enhance the efficiency of information utilization by leveraging attention mechanisms from both spatial and channel branches. Furthermore, we introduce a structural feature enhance module for the FFN of the transformer, which explicitly improves the model’s ability to extract and enhance high-frequency structural information through two symmetric convolution branches. Additionally, we propose a single-photon data simulation pipeline based on RAW images to address the challenge of the lack of single-photon datasets. Experimental results show that the proposed method outperforms state-of-the-art methods in various noise levels and exhibits a more efficient capability for recovering high-frequency structures and extracting information.

High Quality Monocular Video Depth Estimation Based on Mask Guided Refinement

2025, 34(1): 18-27. https://doi.org/10.15918/j.jbit1004-0579.2024.092

Download PDF

Depth maps play a crucial role in various practical applications such as computer vision, augmented reality, and autonomous driving. How to obtain clear and accurate depth information in video depth estimation is a significant challenge faced in the field of computer vision. However, existing monocular video depth estimation models tend to produce blurred or inaccurate depth information in regions with object edges and low texture. To address this issue, we propose a monocular depth estimation model architecture guided by semantic segmentation masks, which introduces semantic information into the model to correct the ambiguous depth regions. We have evaluated the proposed method, and experimental results show that our method improves the accuracy of edge depth, demonstrating the effectiveness of our approach.

Hyperspectral Image Super-Resolution Based on Spatial-Spectral-Frequency Multidimensional Features

2025, 34(1): 28-41. https://doi.org/10.15918/j.jbit1004-0579.2024.096

Download PDF

Due to the limitations of existing imaging hardware, obtaining high-resolution hyperspectral images is challenging. Hyperspectral image super-resolution (HSI SR) has been a very attractive research topic in computer vision, attracting the attention of many researchers. However, most HSI SR methods focus on the tradeoff between spatial resolution and spectral information, and cannot guarantee the efficient extraction of image information. In this paper, a multidimensional features network (MFNet) for HSI SR is proposed, which simultaneously learns and fuses the spatial, spectral, and frequency multidimensional features of HSI. Spatial features contain rich local details, spectral features contain the information and correlation between spectral bands, and frequency feature can reflect the global information of the image and can be used to obtain the global context of HSI. The fusion of the three features can better guide image super-resolution, to obtain higher-quality high-resolution hyperspectral images. In MFNet, we use the frequency feature extraction module (FFEM) to extract the frequency feature. On this basis, a multidimensional features extraction module (MFEM) is designed to learn and fuse multidimensional features. In addition, experimental results on two public datasets demonstrate that MFNet achieves state-of-the-art performance.

Hyperspectral Image Reconstruction for Interferometric Spectral Imaging System with Degradation Synthesis

2025, 34(1): 42-56. https://doi.org/10.15918/j.jbit1004-0579.2024.100

Download PDF

Among hyperspectral imaging technologies, interferometric spectral imaging is widely used in remote sening due to advantages of large luminous flux and high resolution. However, with complicated mechanism, interferometric imaging faces the impact of multi-stage degradation. Most exsiting interferometric spectrum reconstruction methods are based on tradition model-based framework with multiple steps, showing poor efficiency and restricted performance. Thus, we propose an interferometric spectrum reconstruction method based on degradation synthesis and deep learning. Firstly, based on imaging mechanism, we proposed an mathematical model of interferometric imaging to analyse the degradation components as noises and trends during imaging. The model consists of three stages, namely instrument degradation, sensing degradation, and signal-independent degradation process. Then, we designed calibration-based method to estimate parameters in the model, of which the results are used for synthesizing realistic dataset for learning-based algorithms. In addition, we proposed a dual-stage interferogram spectrum reconstruction framework, which supports pre-training and integration of denoising DNNs. Experiments exhibits the reliability of our degradation model and synthesized data, and the effectiveness of the proposed reconstruction method.

PF-YOLO: An Improved YOLOv8 for Small Object Detection in Fisheye Images

2025, 34(1): 57-70. https://doi.org/10.15918/j.jbit1004-0579.2024.077

Download PDF

Top-view fisheye cameras are widely used in personnel surveillance for their broad field of view, but their unique imaging characteristics pose challenges like distortion, complex scenes, scale variations, and small objects near image edges. To tackle these, we proposed peripheral focus you only look once (PF-YOLO), an enhanced YOLOv8n-based method. Firstly, we introduced a cutting-patch data augmentation strategy to mitigate the problem of insufficient small-object samples in various scenes. Secondly, to enhance the model’s focus on small objects near the edges, we designed the peripheral focus loss, which uses dynamic focus coefficients to provide greater gradient gains for these objects, improving their regression accuracy. Finally, we designed the three dimensional (3D) spatial-channel coordinate attention C2f module, enhancing spatial and channel perception, suppressing noise, and improving personnel detection. Experimental results demonstrate that PF-YOLO achieves strong performance on the challenging events for person detection from overhead fisheye images (CEPDTOF) and in-the-wild events for people detection and tracking from overhead fisheye cameras (WEPDTOF) datasets. Compared to the original YOLOv8n model, PF-YOLO achieves improvements on CEPDTOF with increases of 2.1%, 1.7% and 2.9% in mean average precision 50 ($ \mathrm{mAP\ 50} $), $ \mathrm{mAP\ 50-95} $, and $ \mathrm{recall} $, reaching 95.7%, 65.8% and 95.5%, respectively. On WEPDTOF, PF-YOLO achieves substantial improvements with increases of 31.4%, 14.9%, 61.1% and 21.0% in $ \mathrm{mAP\ 50} $, $ \mathrm{mAP\ 50-95} $, $ \mathrm{precision} $ and $ \mathrm{recall} $ reaching 53.1%, 22.8%, 91.2% and 57.2%, respectively.

Multi-Dimensional Weight Regulation Network for Remote Sensing Image Dehazing

2025, 34(1): 71-90. https://doi.org/10.15918/j.jbit1004-0579.2024.089

Download PDF

This paper introduces a lightweight remote sensing image dehazing network called multi-dimensional weight regulation network (MDWR-Net), which addresses the high computational cost of existing methods. Previous works, often based on the encoder-decoder structure and utilizing multiple upsampling and downsampling layers, are computationally expensive. To improve efficiency, the paper proposes two modules: the efficient spatial resolution recovery module (ESRR) for upsampling and the efficient depth information augmentation module (EDIA) for downsampling. These modules not only reduce model complexity but also enhance performance. Additionally, the partial feature weight learning module (PFWL) is introduced to reduce the computational burden by applying weight learning across partial dimensions, rather than using full-channel convolution. To overcome the limitations of convolutional neural networks (CNN)-based networks, the haze distribution index transformer (HDIT) is integrated into the decoder. We also propose the physical-based non-adjacent feature fusion module (PNFF), which leverages the atmospheric scattering model to improve generalization of our MDWR-Net. The MDWR-Net achieves superior dehazing performance with a computational cost of just 2.98×10⁹ multiply-accumulate operations (MACs), which is less than one-tenth of previous methods. Experimental results validate its effectiveness in balancing performance and computational efficiency.

Enhanced Panoramic Image Generation with GAN and CLIP Models

2025, 34(1): 91-101. https://doi.org/10.15918/j.jbit1004-0579.2024.091

Download PDF

Panoramic images, offering a 360-degree view, are essential in virtual reality (VR) and augmented reality (AR), enhancing realism with high-quality textures. However, acquiring complete and high-quality panoramic textures is challenging. This paper introduces a method using generative adversarial networks (GANs) and the contrastive language-image pretraining (CLIP) model to restore and control texture in panoramic images. The GAN model captures complex structures and maintains consistency, while CLIP enables fine-grained texture control via semantic text-image associations. GAN inversion optimizes latent codes for precise texture details. The resulting low dynamic range (LDR) images are converted to high dynamic range (HDR) using the Blender engine for seamless texture blending. Experimental results demonstrate the effectiveness and flexibility of this method in panoramic texture restoration and generation.

Analysis of Temporal Correlation in Visual Data Based on Snapshot Compressive Imaging

2025, 34(1): 102-112. https://doi.org/10.15918/j.jbit1004-0579.2024.090

Download PDF

Video snapshot compressive imaging (Video SCI) modulates scenes using various encoding masks and captures compressed measurements with a low-speed camera during a single exposure. Subsequently, reconstruction algorithms restore image sequences of dynamic scenes, offering advantages such as reduced bandwidth and storage space requirements. The temporal correlation in video data is crucial for Video SCI, as it leverages the temporal relationships among frames to enhance the efficiency and quality of reconstruction algorithms, particularly for fast-moving objects. This paper discretizes video frames to create image datasets with the same data volume but differing temporal correlations. We utilized the state-of-the-art (SOTA) reconstruction framework, EfficientSCI++, to train various compressed reconstruction models with these differing temporal correlations. Evaluating the reconstruction results from these models, our simulation experiments confirm that a reduction in temporal correlation leads to decreased reconstruction accuracy. Additionally, we simulated the reconstruction outcomes of datasets devoid of temporal correlation, illustrating that models trained on non-temporal data affect the temporal feature extraction capabilities of transformers, resulting in negligible impacts on the evaluation of reconstruction results for non-temporal correlation test datasets.

About the journal

Aims & scope

Editorial board

Description

Cover gallery

Contact us

Browse

Latest issue

All volumes and issues

Featured articles

Most accessed

Most cited

Authors & reviewers

Online submisson

Guidelines for authors

Download templates

Please choose a citation manager