ColorAlignNet: a reference-based video colorization network with temporal aggregation

Wenzhi ZHU; Tong WANG

doi:10.19884/j.1672-5220.202501012

Journal of Donghua University(English Edition) ›› 2026, Vol. 43 ›› Issue (2) :94 -102. DOI: 10.19884/j.1672-5220.202501012

Information Technology and Artificial Intelligence

research-article

ColorAlignNet: a reference-based video colorization network with temporal aggregation

Wenzhi ZHU ¹^,²
, Tong WANG ¹^,²^,^∗

Author information +

History +

PDF (12271KB)

Abstract

Video colorization is an important technique to breathe life back into old movies. While current colorization methods work well on still images and low-motion video data, they often struggle with complex dynamic scenes. To address this problem, this study proposes ColorAlignNet, a reference-based video colorization network with temporal aggregation. The network uses source-reference attention to propagate color information from reference frames to grayscale frames, guaranteeing color accuracy, and uses deformable convolution to align features of adjacent frames to enhance temporal consistency. Finally, we use the cyclic transformer module to reconstruct the final prediction results. Extensive experimental results demonstrate that ColorAlignNet achieves excellent performance on the DAVIS and Videvo datasets, outperforming other state-of-the-art methods on both the learned perceptual image patch similarity (LPIPS) and color distribution consistency (CDC) metrics.

Keywords

deformable convolution / video colorization / Swin-transformer

Cite this article

Download citation ▾

Wenzhi ZHU, Tong WANG. ColorAlignNet: a reference-based video colorization network with temporal aggregation. Journal of Donghua University(English Edition), 2026, 43 (2) : 94-102 DOI:10.19884/j.1672-5220.202501012

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	WENG S C , SUN J M , LI Y , et al. CT2: colorization transformer via color tokens[C]// European Conference on Computer Vision. Berlin: Springer—Verlag Berlin, 2022: 1—16.

[2]	KANG X Y , YANG T , OUYANG W Q , et al. DDColor: towards photo—realistic image colorization via dual decoders[C]// 2023 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2024: 328—338.

[3]	CONG X Y , WU Y , CHEN Q F , et al. Automatic controllable colorization via imagination[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2024: 2609—2619.

[4]	WU Y Z , WANG X T , LI Y , et al. Towards vivid and diverse image colorization with generative color prior[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2022: 14357—14366.

[5]	LEI C Y , CHEN Q F . Fully automatic video colorization with self—regularization and diversity[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2020: 3748—3756.

[6]	IIZUKA S , SIMO—SERRA E . DeepRemaster: temporal source—reference attention networks for comprehensive video enhancement[J]. ACM Transactions on Graphics, 2019, 38(6): 1—13.

[7]	ZHAO Y Z , PO L M , LIU K C , et al. SVCNet: scribble—based video colorization network with temporal aggregation[J]. IEEE Transactions on Image Processing, 2023, 32: 4443—4458.

[8]	YANG Y X , PAN J S , PENG Z Z , et al. BiSTNet: semantic image prior guided bidirectional temporal feature fusion for deep exemplar—based video colorization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(8): 5612—5624.

[9]	CASEY E , PÉREZ V , LI Z R . The animation transformer: visual correspondence via segment matching[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2022: 11303—11312.

[10]	LIU Z , LIN Y T , CAO Y , et al. Swin transformer: hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2022: 9992—10002.

[11]	DAI J F , QI H Z , XIONG Y W , et al. Deformable convolutional networks[C]// 2017 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2017: 764—773.

[12]	PAN L Y , LIU M M , HARTLEY R . Single image optical flow estimation with an event camera[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2020: 1669—1678.

[13]	PERAZZI F , PONT—TUSET J , MCWILLIAMS B , et al. A benchmark dataset and evaluation methodology for video object segmentation[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 724—732.

[14]	LAI W S , HUANG J B , WANG O , et al. Learning blind video temporal consistency[C]// Computer Vision—ECCV 2018. Cham: Springer, 2018: 179—195.

[15]	PARSANIA P P S , VIRPARIA D P V . A review: image interpolation techniques for image scaling[J]. International Journal of Innovative Research in Computer and Communication Engineering, 2015, 2(12): 7409—7414.

[16]	DE SANTIS M , LUCIDI S , RINALDI F . Fast active—set—type algorithms for L1—regularized linear regression[J]. SIAM Journal of Optimization, 2016, 26(1): 781—809.

[17]	JOHNSON J , ALAHI A , LI F F . Perceptual losses for real—time style transfer and super—resolution[C]// Computer Vision—ECCV 2016. Cham: Springer, 2016: 694—711.

[18]	GOODFELLOW I , POUGET—ABADIE J , MIRZA M , et al. Generative adversarial nets[J]. Advances in neural information processing systems, 2014, 27: 261560300.

[19]	SZE V , BUDAGAVI M , SULLIVAN G J . High efficiency video coding (HEVC): algorithms and architectures[M]. Cham: Springer International Publishing, 2014.

[20]	WANG Z , BOVIK A C , SHEIKH H R , et al. Image quality assessment: from error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600—612.

[21]	ZHANG R , ISOLA P , EFROS A A , et al. The unreasonable effectiveness of deep features as a perceptual metric[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 586—595.

[22]	OREL R , LUO X , SHAN M Y , et al. StyleSDF: high—resolution 3D—consistent image and geometry generation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2022: 13493—13503.

[23]	SALMONA A , BOUZA L , DELON J . DeOldify: a review and implementation of an automatic colorization method[J]. Image Processing on Line, 2022, 12: 347—368.

[24]	ZHANG B , HE M M , LIAO J , et al. Deep exemplar—based video colorization[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2020: 8044—8053.

[25]	LIU Y H , ZHAO H Y , CHAN K C K , et al. Temporally consistent video colorization with deep feature propagation and self—regularization learning[J]. Computational Visual Media, 2024, 10(2): 375—395.

[26]	HUANG X , BELONGIE S . Arbitrary style transfer in real—time with adaptive instance normalization[C]// 2017 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2017: 1510—1519.