Unmanned aerial vehicle (UAV) aerial images often feature rapidly changing perspectives, extremely small target scales, and significant occlusions and background interference, posing dual challenges to accuracy and stability in real-time detection. To address these issues, this paper proposes frequency-affine-lightweight detection (FALDet) within the you only look once version 8 (YOLOv8) framework, systematically improving detection through three main approaches: First, replacing spatial pyramid pooling-fast (SPPF) with intra-scale feature interaction-high-low frequency attention (AIFI-HiLo) to model high-frequency local and low-frequency global attention in parallel, balancing edge details and long-range semantics while maintaining low computational overhead. Second, it replaces part of C2f with local affine deformable convolution (LA-DCN), introducing a unified local affine sampling grid to reduce degrees of freedom and enhance stability against rotational, scaling, and translational deformations. Third, it designs lightweight cross-scale dynamic detection head (LiteX-DyHead), which effectively improves recall and localization consistency for dense small objects through lightweight preprocessing, dynamic/deformable alignment, and multi-scale fusion. Using VisDrone2019 as the primary evaluation dataset, ablation and comparative experiments were conducted under unified training strategies and input resolutions. Results demonstrate that FALDet achieves stable improvements over YOLOv8s in both mAP@0.5 and mAP@0.5:0.95 while maintaining high frames per second, validating the effectiveness and practicality of the proposed method. The method’s effectiveness was further validated on the SIMD dataset.