Text-based motion generation enhances the flexibility of human motion design and editing, enabling applications in animation, virtual reality, and beyond. However, diffusion-based methods for text-to-motion generation often produce low-quality results. Conditional autoregressive approaches leveraging vector quantization variational autoencoders (VQ-VAE) struggle with vector quantization errors, requiring hierarchical or residual quantization. This increases the length of quantized token sequences, forcing the model to predict more tokens from text input, which complicates high-quality generation. To address this, we introduce HyT2M, an innovative text-to-motion model based on a hybrid VQ-VAE framework. Our approach decomposes motion into global and local components: local motion is quantized using a single vector quantization layer to preserve fine details, while global motion is reconstructed via residual vector quantization (RVQ) to compensate for errors caused by the limited perceptual range of local components. This hybrid strategy shortens token sequences while maintaining high reconstruction quality, easing the burden on the second-stage model. Furthermore, we develop a conditional masked transformer with a hybrid cross-guidance module, leveraging global motion tokens to enhance local motion predictions. This improves accuracy and usability for motion editing. Experiments on the HumanML3D, KIT-ML, and Motion-X datasets indicate that HyT2M achieves competitive results and excels in tasks such as motion completion and long-motion generation.
Recently, 3D Gaussian Splatting explicitly represents scenes and synthesizes high-quality novel views with impressive performance. However, reconstructing accurate Gaussian geometry becomes extremely challenging when using pure RGB images with few-shot inputs. We propose 2.5D-GS, which projects Gaussians into structured 2D spaces and utilizes the 2.5D representations from monocular models to separately optimize the projected depth and normal maps, ultimately achieving consistent and accurate Gaussian geometry. First, we ensure the spatial accuracy of Gaussians with Depth Plane Constraints. Since monocular depth maps construct only rough shapes, Normal Plane Constraints are then applied to refine the orientations of the Gaussians and enhance surface connectivity. Additionally, we introduce Density Ratio-Based Pruning to eliminate redundant Gaussians generated during optimization, leading to compact and efficient scene representations. Extensive experiments on the LLFF, DTU, Blender, and Mip-NeRF360 datasets demonstrate that 2.5D-GS accurately reconstructs scene geometry and renders high-quality novel views with sparse inputs.
Point cloud completion is a fundamental task in 3D perception and 3D vision. Existing point cloud completion methods typically rely on supervised learning with limited 3D data, resulting in poor generalization and suboptimal recovery in scenarios involving complex shape structures or large missing regions. To overcome these limitations, we propose a novel zero-shot point cloud completion framework (called HybridPC) that achieves high-fidelity 3D reconstruction without any 3D supervision or task-specific training. HybridPC leverages powerful 2D diffusion priors and a progressive implicit-explicit architecture to address severe incompleteness and complex geometries. The framework comprises three key stages: 1) Edge-aware neural field initialization: ControlNet-guided stable diffusion synthesizes multi-view images conditioned on text prompts and orthographic edge projections of the incomplete point cloud, providing strong shape constraints to initialize a coarse NeRF field via Score Distillation Sampling (SDS). 2) Multi-view diffusion collaborative completion: A pre-trained multi-view diffusion model enforces cross-view consistency, collaboratively completing the entire neural radiance field (NeRF) with globally coherent geometry. To reconcile gradient conflicts between ControlNet and multi-view diffusion during joint SDS optimization, a PCGrad-based multi-objective optimization strategy is introduced to balance the structural and semantic guidance, yielding higher-fidelity shape completion. 3) Geometry-aware tetrahedral refinement: The implicit field is converted into a tetrahedral mesh using DMTet, which is further refined via implicit SDS-based normal optimization and explicit geometric constraints on the mesh surface, ensuring structural fidelity to the partial input. Extensive experiments on the ShapeNetPart and Redwood datasets demonstrate that HybridPC outperforms existing supervised and zero-shot methods in both qualitative and quantitative comparisons. Specifically, HybridPC preserves the input structure more faithfully, completes missing regions more accurately, and shows stronger generalization ability, with particularly significant improvements on real-world scans from Redwood dataset. Our results show the strong potential of coupling 2D diffusion priors with 3D geometric modeling for scalable, training-free point cloud completion.
Visually-rich long document understanding task requires accurate extraction of answers from documents like manuals and academic papers, which often consist of dozens of text-rich images. Recently, multimodal large language models (MLLMs) have demonstrated strong performance in this task. To alleviate the inefficiency problem of MLLMs caused by the increase of document length, retrieval-augmented methods select key pages to reduce computational cost by conducting answer generation only on retrieved pages. Despite the significant progress, existing methods still face some inherent challenges. For one thing, relative pages are usually retrieved based on textual content, which neglects spatial layout information. For another, coarse-grained retrieval at the page level can also lead to the semantic gap between retrieved pages and the query. In this paper, we propose PDU, a position-aware fine-grained retrieval-augmented model for long document understanding. Specifically, to bridge the semantic gap between the query and full pages, we first develop a fine-grained document encoding module to partition each document page into chunks and encode them with MLLMs. Then, we design a position-enhanced similarity calculation approach to compute the similarity between the query and each document chunk for retrieving the most relevant ones. To improve the model in terms of understanding document layout and structure, we further encode the bound coordinates and page number of each document chunk and add them to the MLLM-derived visual features. Next, we propose a chunk-to-page answer generation method to map back the retrieved chunks to their corresponding pages and generate the final answer. To support training, we construct a minimal answerable region (MAR) dataset using a bidirectional approximation algorithm to precisely link querys to relevant document chunks. Our method achieves strong results on public benchmarks, highlighting the value of incorporating layout information in retrieval-augmented document understanding.
Artists skillfully make several paper shapes in 3D to create different images using the shadows they cast. It is a creative art form called Paper Shadow Art. Central to this art is finding a small set of developable surfaces so that the shadow images match the desired images. However, it is challenging for users to imagine and generate developable surfaces for their desired images. To this end, we present a computational method for paper shadow art designs. The key is to convert the problem of generating piecewise developable surfaces for multiple images into one for a single image separately, subsequently merging the generated piecewise developable surfaces into the final result. Specifically, given an image, we optimize a height field to approach developability with two requirements: 1) for this image, there is no shadow deviation; 2) for other images, the shadows (projected onto their planes) of the optimized height field fall into them. To better satisfy the second requirement, we develop an iterative algorithm that alternates in each iteration between optimizing the height fields and deforming the input images, both to reduce the deviations. We demonstrate the effectiveness of our method over various examples. Its practicality is also proved by seven physical results with paper sheets.
The coarse-to-fine feature matching paradigm has demonstrated highly effective in point cloud registration. This paradigm progressively propagates feature correspondences from the coarse level to the fine level through hierarchical feature extraction. However, it is limited by the low discriminability of coarse-level features due to insufficient modeling of global geometric structures, which results in unreliable initial correspondences. Furthermore, relying on single-level features leads to the irreversible loss of fine-grained information, especially in low-overlap scenarios. These limitations present significant challenges in maintaining global geometric consistency and result in a high incidence of feature mismatches. To address these limitations, we propose the HFA-Transformer, a novel Hierarchical Feature Aggregation Transformer framework with two key innovations: (1) a feature enhancement mechanism that jointly encodes spatial and channel-wise characteristics of point clouds, enriching the global feature representation; (2) a Hierarchical Feature Aggregation Module that integrates hierarchical features to refine coarse-level correspondence estimation. Extensive experiments conducted on both indoor and outdoor benchmarks validate the superior performance and robustness of the proposed HFA-Transformer.
As multi-source data sharing becomes increasingly prevalent in the digital economy, multi-source multi-client dynamic searchable symmetric encryption (MM-DSSE) has received significant attention. However, the complex key management of MM-DSSE exacerbates the cascading effect of key compromise risks. Existing MM-DSSE schemes only satisfy forward privacy and rely on the ideal “key non-compromised” assumption. We study the key compromise threat in the MM-DSSE and formally define the post-compromise security for MM-DSSE with respect to leakage functions. We introduce a framework for MM-DSSE that supports non-interactive key updates for data sources and clients, named Mosaic. Mosaic ensures data security even in the event of key compromise at any client, data source, or management center. Additionally, we construct an instance MosaicR based on Mosaic that supports range search. Both Mosaic and MosaicR satisfy forward and type-II backward privacy. We conduct comprehensive experimental evaluations using real-world datasets. The results show that Mosaic and MosaicR ensure strong security and competitive performance. Compared with the state-of-the-art single-user DSSE scheme with post-compromise security Bamboo, Mosaic achieves a 79.21% improvement in total search efficiency. The index storage overhead of MosaicR is reduced by 49.98% compared with the range search scheme (RS)2.
The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.
Honeywords are decoys stored alongside real passwords in credential databases. A real-world application of honeywords is the honeyword-based authentication system that detects malicious login attempts by utilizing these deceptive fake passwords. Existing honeyword-based authentication systems face two key restrictions: 1) the authentication server is a single point of full trust (i.e., it is not allowed to be intruded upon or colluded with by attackers), and 2) the stored real passwords are vulnerable to tweaking attacks once attackers gain knowledge of the passwords from other sources. To address the above challenges, we introduce SecHive, a secure three-layer honeyword-based authentication system with a hash-query server. SecHive ensures real password security even when the authentication server is semi-honest instead of fully honest. Moreover, we design a new honeyword generation method called , which is embedded in SecHive to detect tweaking attacks effectively. Our extensive experimental results prove that SecHive improves security over state-of-the-art honeyword-based authentication systems, in particular, at least a 7.39x improvement in the accuracy of detecting tweaking attacks.