To boost research into cognition-level visual understanding, i.e., making an accurate inference based on a thorough understanding of visual details, visual commonsense reasoning (VCR) has been proposed. Compared with traditional visual question answering which requires models to select correct answers, VCR requires models to select not only the correct answers, but also the correct rationales. Recent research into human cognition has indicated that brain function or cognition can be considered as a global and dynamic integration of local neuron connectivity, which is helpful in solving specific cognition tasks. Inspired by this idea, we propose a directional connective network to achieve VCR by dynamically reorganizing the visual neuron connectivity that is contextualized using the meaning of questions and answers and leveraging the directional information to enhance the reasoning ability. Specifically, we first develop a GraphVLAD module to capture visual neuron connectivity to fully model visual content correlations. Then, a contextualization process is proposed to fuse sentence representations with visual neuron representations. Finally, based on the output of contextualized connectivity, we propose directional connectivity to infer answers and rationales, which includes a ReasonVLAD module. Experimental results on the VCR dataset and visualization analysis demonstrate the effectiveness of our method.
Object detection is one of the hottest research directions in computer vision, has already made impressive progress in academia, and has many valuable applications in the industry. However, the mainstream detection methods still have two shortcomings: (1) even a model that is well trained using large amounts of data still cannot generally be used across different kinds of scenes; (2) once a model is deployed, it cannot autonomously evolve along with the accumulated unlabeled scene data. To address these problems, and inspired by visual knowledge theory, we propose a novel scene-adaptive evolution unsupervised video object detection algorithm that can decrease the impact of scene changes through the concept of object groups. We first extract a large number of object proposals from unlabeled data through a pre-trained detection model. Second, we build the visual knowledge dictionary of object concepts by clustering the proposals, in which each cluster center represents an object prototype. Third, we look into the relations between different clusters and the object information of different groups, and propose a graph-based group information propagation strategy to determine the category of an object concept, which can effectively distinguish positive and negative proposals. With these pseudo labels, we can easily fine-tune the pretrained model. The effectiveness of the proposed method is verified by performing different experiments, and the significant improvements are achieved.
Three-dimensional (3D) reconstruction of shapes is an important research topic in the fields of computer vision, computer graphics, pattern recognition, and virtual reality. Existing 3D reconstruction methods usually suffer from two bottlenecks: (1) they involve multiple manually designed states which can lead to cumulative errors, but can hardly learn semantic features of 3D shapes automatically; (2) they depend heavily on the content and quality of images, as well as precisely calibrated cameras. As a result, it is difficult to improve the reconstruction accuracy of those methods. 3D reconstruction methods based on deep learning overcome both of these bottlenecks by automatically learning semantic features of 3D shapes from low-quality images using deep networks. However, while these methods have various architectures, in-depth analysis and comparisons of them are unavailable so far. We present a comprehensive survey of 3D reconstruction methods based on deep learning. First, based on different deep learning model architectures, we divide 3D reconstruction methods based on deep learning into four types, recurrent neural network, deep autoencoder, generative adversarial network, and convolutional neural network based methods, and analyze the corresponding methodologies carefully. Second, we investigate four representative databases that are commonly used by the above methods in detail. Third, we give a comprehensive comparison of 3D reconstruction methods based on deep learning, which consists of the results of different methods with respect to the same database, the results of each method with respect to different databases, and the robustness of each method with respect to the number of views. Finally, we discuss future development of 3D reconstruction methods based on deep learning.
Rule-based autonomous driving systems may suffer from increased complexity with large-scale intercoupled rules, so many researchers are exploring learning-based approaches. Reinforcement learning (RL) has been applied in designing autonomous driving systems because of its outstanding performance on a wide variety of sequential control problems. However, poor initial performance is a major challenge to the practical implementation of an RL-based autonomous driving system. RL training requires extensive training data before the model achieves reasonable performance, making an RL-based model inapplicable in a real-world setting, particularly when data are expensive. We propose an asynchronous supervised learning (ASL) method for the RL-based end-to-end autonomous driving model to address the problem of poor initial performance before training this RL-based model in real-world settings. Specifically, prior knowledge is introduced in the ASL pre-training stage by asynchronously executing multiple supervised learning processes in parallel, on multiple driving demonstration data sets. After pre-training, the model is deployed on a real vehicle to be further trained by RL to adapt to the real environment and continuously break the performance limit. The presented pre-training method is evaluated on the race car simulator, TORCS (The Open Racing Car Simulator), to verify that it can be sufficiently reliable in improving the initial performance and convergence speed of an end-to-end autonomous driving model in the RL training stage. In addition, a real-vehicle verification system is built to verify the feasibility of the proposed pre-training method in a real-vehicle deployment. Simulations results show that using some demonstrations during a supervised pre-training stage allows significant improvements in initial performance and convergence speed in the RL training stage.
In an unmanned aerial vehicle ad-hoc network (UANET), sparse and rapidly mobile unmanned aerial vehicles (UAVs)/nodes can dynamically change the UANET topology. This may lead to UANET service performance issues. In this study, for planning rapidly changing UAV swarms, we propose a dynamic value iteration network (DVIN) model trained using the episodic Q-learning method with the connection information of UANETs to generate a state value spread function, which enables UAVs/nodes to adapt to novel physical locations. We then evaluate the performance of the DVIN model and compare it with the non-dominated sorting genetic algorithm II and the exhaustive method. Simulation results demonstrate that the proposed model significantly reduces the decisionmaking time for UAV/node path planning with a high average success rate.
Extracting discriminative speaker-specific representations from speech signals and transforming them into fixed length vectors are key steps in speaker identification and verification systems. In this study, we propose a latent discriminative representation learning method for speaker recognition. We mean that the learned representations in this study are not only discriminative but also relevant. Specifically, we introduce an additional speaker embedded lookup table to explore the relevance between different utterances from the same speaker. Moreover, a reconstruction constraint intended to learn a linear mapping matrix is introduced to make representation discriminative. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods based on the Apollo dataset used in the Fearless Steps Challenge in INTERSPEECH2019 and the TIMIT dataset.
Gait recognition has significant potential for remote human identification, but it is easily influenced by identityunrelated factors such as clothing, carrying conditions, and view angles. Many gait templates have been presented that can effectively represent gait features. Each gait template has its advantages and can represent different prominent information. In this paper, gait template fusion is proposed to improve the classical representative gait template (such as a gait energy image) which represents incomplete information that is sensitive to changes in contour. We also present a partition method to reflect the different gait habits of different body parts of each pedestrian. The fused template is cropped into three parts (head, trunk, and leg regions) depending on the human body, and the three parts are then sent into the convolutional neural network to learn merged features. We present an extensive empirical evaluation of the CASIA-B dataset and compare the proposed method with existing ones. The results show good accuracy and robustness of the proposed method for gait recognition.
We deal with event-triggered H∞controller design for discrete-time piecewise-affine systems subject to actuator saturation. By considering saturation information, a novel event-triggered strategy is proposed to conserve communication resources. A linear matrix inequality based condition is derived based on a piecewise Lyapunov function. This condition guarantees the stability of the closed-loop system with a certain H∞performance index and reduces the number of transmitted signals. Numerical examples are given to show the efficiency of our method.
In this study, we focus mainly on the problem of finding the minimum-length path through a set of circular regions by a fixed-wing unmanned aerial vehicle. Such a problem is referred to as the Dubins traveling salesman problem with neighborhoods (DTSPN). Algorithms developed in the literature for solving DTSPN either are computationally demanding or generate low-quality solutions. To achieve a better trade-off between solution quality and computational cost, an efficient gradient-free descent method is designed. The core idea of the descent method is to decompose DTSPN into a series of subproblems, each of which consists of finding the minimum-length path of a Dubins vehicle from a configuration to another configuration via an intermediate circular region. By analyzing the geometric properties of the subproblems, we use a bisection method to solve the subproblems. As a result, the descent method can efficiently address DTSPN by successively solving a series of subproblems. Finally, several numerical experiments are carried out to demonstrate the descent method in comparison with several existing algorithms.
Network on chip (NoC) is an infrastructure providing a communication platform to multiprocessor chips. Furthermore, the wormhole-switching method, which shares resources, was used to increase its efficiency; however, this can lead to congestion. Moreover, dealing with this congestion consumes more energy and correspondingly leads to increase in power consumption. Furthermore, consuming more power results in more heat and increases thermal fluctuations that lessen the life span of the infrastructures and, more importantly, the network’s performance. Given these complications, providing a method that controls congestion is a significant design challenge. In this paper, a fuzzy logic congestion control routing algorithm is presented to enhance the NoC’s performance when facing congestion. To avoid congestion, the proposed algorithm employs the occupied input buffer and the total occupied buffers of the neighboring nodes along with the maximum possible path diversity with minimal path length from instant neighbors to the destination as the selection parameters. To enhance the path selection function, the uncertainty of the fuzzy logic algorithm is used. As a result, the average delay, power consumption, and maximum delay are reduced by 14.88%, 7.98%, and 19.39%, respectively. Additionally, the proposed method enhances the throughput and the total number of packets received by 14.9% and 11.59%, respectively. To show the significance, the proposed algorithm is examined using transpose traffic patterns, and the average delay is improved by 15.3%. The average delay is reduced by 3.8% in TMPEG-4 (treble MPEG-4), 36.6% in QPIP (quadruplicate PIP), and 20.9% in TVOPD (treble VOPD).
In this study, titanium disulfide (TiS2) polyvinyl alcohol (PVA) film-type saturable absorber (SA) is synthesized with a modulation depth of 5.08% and a saturable intensity of 10.62 MW/cm2 by liquid-phase exfoliation and spin-coating methods. Since TiS2-based SA has a strong nonlinear saturable absorption property, two types of optical soliton were observed in a mode-locked Er-doped fiber laser. When the pump power was raised to 67.3 mW, a conventional mode-locked pulse train with a repetition rate of 1.716 MHz and a pulse width of 6.57 ps was generated, and the output spectrum centered at 1556.98 nm and 0.466 nm spectral width with obvious Kelly sidebands was obtained. Another type of mode-locked pulse train with the maximum output power of 3.92 mW and pulse energy of 2.28 nJ at the pump power of 517.2 mW was achieved when the polarization controllers were adjusted. Since TiS2-based SA has excellent nonlinear saturable absorption characteristics, broad applications in ultrafast photonic are expected.