Identification of Corrupted Data via k-Means Clustering for Function Approximation

Jun Hou , Yeonjong Shin , Dongbin Xiu

CSIAM Trans. Appl. Math. ›› 2021, Vol. 2 ›› Issue (1) : 81 -107.

PDF (400KB)
CSIAM Trans. Appl. Math. ›› 2021, Vol. 2 ›› Issue (1) : 81 -107. DOI: 10.4208/csiam-am.2020-0212
research-article

Identification of Corrupted Data via k-Means Clustering for Function Approximation

Author information +
History +
PDF (400KB)

Abstract

In addition to measurement noises, real world data are often corrupted by unexpected internal or external errors. Corruption errors can be much larger than the standard noises and negatively affect data processing results. In this paper, we propose a method of identifying corrupted data in the context of function approximation. The method is a two-step procedure consisting of approximation stage and identification stage. In the approximation stage, we conduct straightforward function approximation to the entire data set for preliminary processing. In the identification stage, a clustering algorithm is applied to the processed data to identify the potentially corrupted data entries. In particular, we found k-means clustering algorithm to be highly effective. Our theoretical analysis reveal that under sufficient conditions the proposed method can exactly identify all corrupted data entries. Numerous examples are provided to verify our theoretical findings and demonstrate the effectiveness of the method.

Keywords

Data corruption / function approximation / sparse approximation / k-means clustering.

Cite this article

Download citation ▾
Jun Hou, Yeonjong Shin, Dongbin Xiu. Identification of Corrupted Data via k-Means Clustering for Function Approximation. CSIAM Trans. Appl. Math., 2021, 2(1): 81-107 DOI:10.4208/csiam-am.2020-0212

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

N. Abdelmalek, On the discrete linear l1 approximation and l1 solutions of overdetermined linear equations, J. Approx. Theory, 11 (1974), pp. 38-53.

[2]

C. C. Aggarwal, Outlier analysis, in Data mining, Springer, 2015, pp. 237-263.

[3]

D. Arthur and S. Vassilvitskii, k-means++:The advantages of careful seeding, in Proceedings of the 18th Annual ACM-SIAM symposium on Discrete algorithms, SIAM, Philadelphia, 2007, pp. 1027-1035.

[4]

I. Barrodale and F. Roberts, An improved algorithm for discrete l1 linear approximation, SIAM J. Numer. Anal., 10 (1973), pp. 839-848.

[5]

P. Bellot and M. El-Bèze, A clustering method for information retrieval, Technical ReportIR0199, Laboratoire dInformatique dAvignon, France, (1999).

[6]

P. Bloomfield and W. Steiger, Least absolute deviations curve-titting, SIAM J. Sci. Comput., 1 (1980), pp. 290-301.

[7]

E. Candès and J. Romberg, l1-magic: Recovery of sparse signals via convex programming, URL: www.acm.caltech.edu/l1magic/downloads/l1magic.pdf, 4 (2005), p. 14.

[8]

E. CANDÈS AND T. TAO, Decoding by linear programming, Information Theory, IEEE Transactions on, 51 (2005), pp. 4203-4215.

[9]

E. J. Candes, M. B. WAKIN, AND S. P. BOYD, Enhancing sparsity by reweighted ${\mathcal{l}}_{1}$ minimization, Journal of Fourier analysis and applications, 51 (2005), pp. 4203-4215.

[10]

R. CHARTRAND, Exact reconstruction of sparse signals via nonconvex minimization, IEEE Signal Proc. Lett., 14 (2007), pp. 707-710.

[11]

R. Chartrand and W. Yin, Iteratively reweighted algorithms for compressive sensing, in IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, March 31-April 4, 2008, IEEE, 2008, pp. 3869-3872.

[12]

A. Cohen, M. A. Davenport, and D. Leviatan, On the stability and accuracy of least squares approximations, Found. Comput. Math., 13 (2013), pp. 819-834.

[13]

M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu, Dimensionality reduction for k-means clustering and low rank approximation, in Proceedings of the forty-seventh annual ACM symposium on Theory of computing, 2015, pp. 163-172.

[14]

E. Esser, Y. Lou, and J. Xin, A method for finding structured sparse solutions to nonnegative least squares problems with applications, SIAM J. Imaging Sci., 6 (2013), pp. 2010-2046.

[15]

R. Franke, A critical comparison of some methods for interpolation of scattered data, tech. report, NAVAL POSTGRADUATE SCHOOL MONTEREY CA, 1979.

[16]

A. Genz, B. Ford, Testing multidimensional integration routines, in Tools, Methods, and Languages for Scientific and Engineering Computation, B. Ford, J. Rault, and F. Thomasset, eds., NorthHolland, 1984, pp. 81-94.

[17]

J. A. Hartigan and M. A. Wong, Algorithm as 136: A k-means clustering algorithm, Journal of the royal statistical society. series c (applied statistics), 28 (1979), pp. 100-108.

[18]

Z. He, S. Deng, and X. Xu, An optimization model for outlier detection in categorical data, in International Conference on Intelligent Computing, Springer, 2005, pp. 400-409.

[19]

M. Jamil and X.-S. Yang, A literature survey of benchmark functions for global optimisation problems, Intl J Math Model Num Opt., 4 (2013), pp. 150-194.

[20]

M. Jiang, S. Tseng, And C. Su, Two-phase clustering process for outliers detection, Pattern recognition letters, 22 (2001), pp. 691-700.

[21]

M. J. Li, M. K. Ng, Y.-m. Cheung, and J. Z. Huang, Agglomerative fuzzy k-means clustering algorithm with selection of number of clusters, IEEE transactions on knowledge and data engineering, 20 (2008), pp. 1519-1534.

[22]

S. Lloyd, Least squares quantization in pcm, IEEE Trans. Inf. Theory, 28 (1982), pp. 129-137.

[23]

Y. Lou, P. Yin, Q. He, and J. Xin, Computing sparse representation in a highly coherent dictionary based on difference of l1 and l2, J. Sci. Comput., 64 (2015), pp. 178-196.

[24]

J. MacQueen, Some methods for classification and analysis of multivariate observations, in Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley, 1967, pp. 281-297.

[25]

J. L. Marroquin and F. Girosi, Some extensions of the k-means algorithm for image segmentation and pattern classification, tech. report, Massachusetts Institute of Technology, 1993.

[26]

S. Narula and J. Wellington, The minimum sum of absolute errors regression: A state of the art survey, Inter. Stat. Rev., 50 (1982), p. 317326.

[27]

M. Osborne and G. Watson, On an algorithm for discrete nonlinear l1 approximation, Computer J., 14 (1971), pp. 184-188.

[28]

B. D. Rao and K. Kreutz-Delgado, An affine scaling methodology for best basis selection, IEEE Transactions on signal processing, 47 (1999), pp. 187-200.

[29]

Y. She et AL., Thresholding-based iterative selection procedures for model selection and shrinkage, Electronic Journal of statistics, 3 (2009), pp. 384-415.

[30]

Y. Shin and D. XIU, Correcting data corruption errors for multivariate function approximation, SIAM J. Sci. Comput., 38 (2016), pp. A2492-A2511.

[31]

K. Spyropoulos, E. Kiountouzis, and A. Young, Discrete approximation in the l1 norm, Computer J., 16 (1972), pp. 180-186.

[32]

Z. Xu, X. Chang, F. Xu, and H. Zhang, l1/2 regularization: A thresholding representation theory and a fast solver, IEEE Trans. Neural Netw. Learn. Syst., 23 (2012), pp. 1013-1027.

[33]

L. Yan, L. Guo, and D. XiU, Stochastic collocation algorithms using ${\mathcal{l}}_{1}$-minimization. Int. J. UQ, 2 (2012), pp. 279-293.

[34]

L. Yan, Y. Shin, and D. Xil, Sparse approximation using ${\mathcal{l}}_{1}-{\mathcal{l}}_{2}$ minimization and its application to stochastic collocation. SIAM J. Sci. Comput., 39 (2017), pp. A229-A254.

[35]

P. Yin, Y. Lou, Q. He, and J. Xin, Minimization of ${\mathcal{l}}_{1-2}$ for compressed sensing. SIAM J. Sci. Comput., 37 (2015), pp. A536-A563.

[36]

A. Zhang, S. Song, and J. Wang, Sequential data cleaning: A statistical approach, in Proceedings of the 2016 International Conference on Management of Data, 2016, pp. 909-924.

AI Summary AI Mindmap
PDF (400KB)

121

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/