In this paper, we investigate the tensor similarity and propose the T-Jordan canonical form and its properties. The concepts of the T-minimal polynomial and the T-characteristic polynomial are proposed. As a special case, we present properties when two tensors commute based on the tensor T-product. We prove that the Cayley–Hamilton theorem also holds for tensor cases. Then, we focus on the tensor decompositions: T-polar, T-LU, T-QR and T-Schur decompositions of tensors are obtained. When an F-square tensor is not invertible with the T-product, we study the T-group inverse and the T-Drazin inverse which can be viewed as the extension of matrix cases. The expressions of the T-group and T-Drazin inverses are given by the T-Jordan canonical form. The polynomial form of the T-Drazin inverse is also proposed. In the last part, we give the T-core-nilpotent decomposition and show that the T-index and T-Drazin inverses can be given by a limit process.
Tensor robust principal component analysis has received a substantial amount of attention in various fields. Most existing methods, normally relying on tensor nuclear norm minimization, need to pay an expensive computational cost due to multiple singular value decompositions at each iteration. To overcome the drawback, we propose a scalable and efficient method, named parallel active subspace decomposition, which divides the unfolding along each mode of the tensor into a columnwise orthonormal matrix (active subspace) and another small-size matrix in parallel. Such a transformation leads to a nonconvex optimization problem in which the scale of nuclear norm minimization is generally much smaller than that in the original problem. We solve the optimization problem by an alternating direction method of multipliers and show that the iterates can be convergent within the given stopping criterion and the convergent solution is close to the global optimum solution within the prescribed bound. Experimental results are given to demonstrate that the performance of the proposed model is better than the state-of-the-art methods.
The celebrated Erdős–Ko–Rado theorem states that given $n\geqslant 2k,$ every intersecting k-uniform hypergraph G on n vertices has at most $\left( {\begin{array}{c}n-1\\ k-1\end{array}}\right)$ edges. This paper states spectral versions of the Erdős–Ko–Rado theorem: let G be an intersecting k-uniform hypergraph on n vertices with $n\geqslant2k.$ Then, the sharp upper bounds for the spectral radius of $\mathcal {A}_{\alpha }(G)$ and $\mathcal {Q}^{*}(G)$ are presented, where $\mathcal {A}_{\alpha }(G)=\alpha \mathcal {D}(G)+(1-\alpha ) \mathcal {A}(G)$ is a convex linear combination of the degree diagonal tensor $\mathcal {D}(G)$ and the adjacency tensor $\mathcal {A}(G)$ for $0\leqslant \alpha < 1,$ and $\mathcal {Q}^{*}(G)$ is the incidence $\mathcal {Q}$-tensor, respectively. Furthermore, when $n>2k,$ the extremal hypergraphs which attain the sharp upper bounds are characterized. The proof mainly relies on the Perron–Frobenius theorem for nonnegative tensor and the property of the maximizing connected hypergraphs.
In this article, two new algorithms are presented that convert a given data tensor train into either a Tucker decomposition with orthogonal matrix factors or a multi-scale entanglement renormalization ansatz (MERA). The Tucker core tensor is never explicitly computed but stored as a tensor train instead, resulting in both computationally and storage efficient algorithms. Both the multilinear Tucker-ranks as well as the MERA-ranks are automatically determined by the algorithm for a given upper bound on the relative approximation error. In addition, an iterative algorithm with low computational complexity based on solving an orthogonal Procrustes problem is proposed for the first time to retrieve optimal rank-lowering disentangler tensors, which are a crucial component in the construction of a low-rank MERA. Numerical experiments demonstrate the effectiveness of the proposed algorithms together with the potential storage benefit of a low-rank MERA over a tensor train.
Overfitting frequently occurs in deep learning. In this paper, we propose a novel regularization method called drop-activation to reduce overfitting and improve generalization. The key idea is to drop nonlinear activation functions by setting them to be identity functions randomly during training time. During testing, we use a deterministic network with a new activation function to encode the average effect of dropping activations randomly. Our theoretical analyses support the regularization effect of drop-activation as implicit parameter reduction and verify its capability to be used together with batch normalization (Ioffe and Szegedy in Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015). The experimental results on CIFAR10, CIFAR100, SVHN, EMNIST, and ImageNet show that drop-activation generally improves the performance of popular neural network architectures for the image classification task. Furthermore, as a regularizer drop-activation can be used in harmony with standard training and regularization techniques such as batch normalization and AutoAugment (Cubuk et al. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 113–123, 2019). The code is available at https://github.com/LeungSamWai/Drop-Activation.
The task of using the machine learning to approximate the mapping ${\mathbf {x}}\mapsto \sum\limits _{i=1}^d x_i^2$ with $x_i\in [-1,1]$ seems to be a trivial one. Given the knowledge of the separable structure of the function, one can design a sparse network to represent the function very accurately, or even exactly. When such structural information is not available, and we may only use a dense neural network, the optimization procedure to find the sparse network embedded in the dense network is similar to finding the needle in a haystack, using a given number of samples of the function. We demonstrate that the cost (measured by sample complexity) of finding the needle is directly related to the Barron norm of the function. While only a small number of samples are needed to train a sparse network, the dense network trained with the same number of samples exhibits large test loss and a large generalization gap. To control the size of the generalization gap, we find that the use of the explicit regularization becomes increasingly more important as d increases. The numerically observed sample complexity with explicit regularization scales as $\mathcal {O}(d^{2.5})$, which is in fact better than the theoretically predicted sample complexity that scales as $\mathcal {O}(d^{4})$. Without the explicit regularization (also called the implicit regularization), the numerically observed sample complexity is significantly higher and is close to $\mathcal {O}(d^{4.5})$.
A novel correction algorithm is proposed for multi-class classification problems with corrupted training data. The algorithm is non-intrusive, in the sense that it post-processes a trained classification model by adding a correction procedure to the model prediction. The correction procedure can be coupled with any approximators, such as logistic regression, neural networks of various architectures, etc. When the training dataset is sufficiently large, we theoretically prove (in the limiting case) and numerically show that the corrected models deliver correct classification results as if there is no corruption in the training data. For datasets of finite size, the corrected models produce significantly better recovery results, compared to the models without the correction algorithm. All of the theoretical findings in the paper are verified by our numerical examples.