ViT-LPATA: a vision transformer model for autism detection in children using facial images
Li Deng , Wenqiu Zhu , Yingbo Wu
Optoelectronics Letters ›› 2026, Vol. 22 ›› Issue (6) : 379 -384.
To address the difficulty in recognizing subtle differences in facial biomarkers in children with autism, a learnable positional encoding enhancement (LPEE) module was combined with the adaptive token aggregation (ATA) module. The vision transformer with learnable positional encoding and adaptive token aggregation (ViT-LPATA), a predictive model for autism, was proposed. The model leverages the LPEE module to dynamically capture facial geometric deformation features and integrates the ATA module to enhance the feature representation capability of pathological regions, thereby establishing precise mappings of biomarker differences. Experiments on a publicly available autism facial dataset demonstrated that the ViT-LPATA achieved optimal performance, with 99.2% accuracy and an area under the curve (AUC) value of 0.940.
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
FULCERI F, CARUSO A, MICAI M, et al. Autism diagnosis in children and adolescents: a systematic review and meta-analysis of test accuracy[J]. Neuroscience & biobehavioral reviews, 2025, 106164. |
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. (2020-10-22) [2025-06-11]. https://arxiv.org/abs/2010.11929. |
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
PIOSENKA G. Autism spectrum disorder children dataset[EB/OL]. (2023-03-15) [2025-06-11]. https://drive.google.com/drive/folders/1XQU0pluL0m3TIlXqntano12d68peMb8A. |
Tianjin University of Technology
/
| 〈 |
|
〉 |