RESEARCH ARTICLE

Bayesian Ying-Yang system, best harmony learning, and five action circling

  • Lei XU
Expand
  • Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China

Received date: 01 Mar 2010

Accepted date: 31 Mar 2010

Published date: 05 Sep 2010

Copyright

2014 Higher Education Press and Springer-Verlag Berlin Heidelberg

Abstract

Firstly proposed in 1995 and systematically developed in the past decade, Bayesian Ying-Yang learning<FootNote>

“Ying” is spelled “Yin” in Chinese Pin Yin. To keep its original harmony with Yang, we deliberately adopted the term “Ying-Yang” since 1995.

</FootNote> is a statistical approach for a two pathway featured intelligent system via two complementary Bayesian representations of a joint distribution on the external observation X and its inner representation R, which can be understood from a perspective of the ancient Ying-Yang philosophy. We have q(X,R)=q(X|R)q(R) as Ying that is primary, with its structure designed according to tasks of the system, and p(X,R)=p(R|X)p(X) as Yang that is secondary, with p(X) given by samples of X while the structure of p(R|X) designed from Ying according to a Ying-Yang variety preservation principle, i.e., p(R|X) is designed as a functional with q(X|R), q(R) as its arguments. We call this pair Bayesian Ying-Yang (BYY) system. A Ying-Yang best harmony principle is proposed for learning all the unknowns in the system, in help of an implementation featured by a five action circling under the name of A5 paradigm. Interestingly, it coincides with the famous ancient WuXing theory that provides a general guide to keep the A5 circling well balanced towards a Ying-Yang best harmony. This BYY learning provides not only a general framework that accommodates typical learning approaches from a unified perspective but also a new road that leads to improved model selection criteria, Ying-Yang alternative learning with automatic model selection, as well as coordinated implementation of Ying based model selection and Yang based learning regularization.

This paper aims at an introduction of BYY learning in a twofold purpose. On one hand, we introduce fundamentals of BYY learning, including system design principles of least redundancy versus variety preservation, global learning principles of Ying-Yang harmony versus Ying-Yang matching, and local updating mechanisms of rival penalized competitive learning (RPCL) versus maximum a posteriori (MAP) competitive learning, as well as learning regularization by data smoothing and induced bias cancelation (IBC) priori. Also, we introduce basic implementing techniques, including apex approximation, primal gradient flow, Ying-Yang alternation, and Sheng-Ke-Cheng-Hui law. On the other hand, we provide a tutorial on learning algorithms for a number of typical learning tasks, including Gaussian mixture, factor analysis (FA) with independent Gaussian, binary, and non-Gaussian factors, local FA, temporal FA (TFA), hidden Markov model (HMM), hierarchical BYY, three layer networks, mixture of experts, radial basis functions (RBFs), subspace based functions (SBFs). This tutorial aims at introducing BYY learning algorithms in a comparison with typical algorithms, particularly with a benchmark of the expectation maximization (EM) algorithm for the maximum likelihood. These algorithms are summarized in a unified Ying-Yang alternation procedure with major parts in a same expression while differences simply characterized by few options in some subroutines. Additionally, a new insight is provided on the ancient Chinese philosophy of Yin-Yang and WuXing from a perspective of information science and intelligent system.

Cite this article

Lei XU . Bayesian Ying-Yang system, best harmony learning, and five action circling[J]. Frontiers of Electrical and Electronic Engineering, 0 , 5(3) : 281 -328 . DOI: 10.1007/s11460-010-0108-9

1
Duda R O, Hart P E, Stork D G. Pattern Classification. 2nd ed. New York: John Wiley & Sons, 2001

2
Xu L. Machine learning problems from optimization perspective. Journal of Global Optimization, 2010, 47: 369-401

DOI

3
Xu L. Bayesian Ying Yang learning. Scholarpedia, 2007, 2(3): 1809http://scholarpedia.org/article/Bayesian_Ying_Yang_learning

4
Aster R, Borchers B, Thurber C. Parameter Estimation and Inverse Problems. New York: Elsevier Academic Press, 2004

5
Brown R G, Hwang P Y C. Introduction to Random Signals and Applied Kalman Filtering. 3rd ed. New York: John Wiley & Sons, 1997

6
Narendra K S, Parthasarathy K. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1990, 1(1): 4-27

DOI

7
Redner R A, Walker H F. Mixture densities, maximum likelihood, and the EM algorithm. SIAM Review, 1984, 26(2): 195-239

DOI

8
Xu L, Jordan M I. On convergence properties of the EM algorithm for Gaussian mixtures. Neural Computation, 1996, 8(1): 129-151

9
Anderson T W, Rubin H. Statistical inference in factor analysis. In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability. 1956, 5: 111-150

10
Rubi D, Thayer D. EM algorithm for ML factor analysis. Psychometrika, 1976, 57: 69-76

11
Bozdogan H, Ramirez D E. FACAIC: Model selection algorithm for the orthogonal factor model using AIC and FACAIC. Psychometrika, 1988, 53(3): 407-415

DOI

12
Burnham K P, Anderson D R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. 2nd ed. New York: Springer, 2002

13
Tikhonov A N, Arsenin V Y. Solutions of Ill-Posed Problems. Washington: Winston and Sons, 1977

14
Poggio T, Girosi F. Networks for approximation and learning. Proceedings of the IEEE, 1990, 78(9): 1481-1497

DOI

15
Amari S I, Cichocki A, Yang H. A new learning algorithm for blind separation of sources. In: Touretzky D S, Mozer M C, Hasselmo M E, eds. Advances in Neural Information Processing System 8. Cambridge: MIT Press, 1996, 757-763

16
Bell A J, Sejnowski T J. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 1995, 7(6): 1129-1159

DOI

17
Xu L. Independent component analysis and extensions with noise and time: A Bayesian Ying-Yang learning perspective. Neural Information Processing — Letters and Reviews, 2003, 1(1): 1-52

18
Xu L. Independent subspaces. In: Ramόn J, Dopico R, Dorado J, Pazos A, eds. Encyclopedia of Artificial Intelligence, Hershey(PA): IGI Global. 2008, 903-912

19
Xu L. Least mean square error reconstruction principle for self-organizing neural-nets. Neural Networks, 1993, 6(5): 627-648

DOI

20
McLachlan G J, Krishnan T. The EM Algorithms and Extensions. New York: John Wiley & Sons, 1997

21
Dempster A P, Laird N M, Rubin D B. Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 1977, 39(1): 1-38

22
Amari S. Information geometry of the EM and EM algorithms for neural networks. Neural Networks, 1995, 8(9): 1379-1408

DOI

23
Grenander U, Miller M. Pattern theory: From representation to inference. Oxford: Oxford University Press, 2007

24
Mumford D. On the computational architecture of the neocortex II: The role of cortico-cortical loops. Biological Cybernetics, 1992, 66(3): 241-251

DOI

25
Friston K. A theory of cortical responses. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 2005, 360(1456): 815-836

DOI

26
Yuille A L, Kersten D. Vision as Bayesian inference: Analysis by synthesis? Trends in Cognitive Sciences, 2006, 10(7): 301-308

DOI

27
Schwarz G. Estimating the dimension of a model. Annals of Statistics, 1978, 6(2): 461-464

DOI

28
Rissanen J. Modeling by shortest data description. Automatica, 1978, 14: 465-471

DOI

29
Rissanen J. Information and Complexity in Statistical Modeling. New York: Springer, 2007

30
DeGroot M H. Optimal Statistical Decisions. Hooken: Wiley Classics Library, 2004

31
Mackay D J C. A practical Bayesian framework for backpropagation networks. Neural Computation, 1992, 4(3): 448-472

DOI

32
MacKay D. Information Theory, Inference, and Learning Algorithms. Cambridge: Cambridge University Press, 2003

33
Wallace C S, Boulton D M. An information measure for classification. Computer Journal, 1968, 11(2): 185-194

34
Wallace C S, Dowe D R. Minimum message length and Kolmogorov complexity. Computer Journal, 1999, 42(4): 270-280

DOI

35
Bourlard H, Kamp Y. Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 1988, 59: 291-294

DOI

36
Palmieri F, Zhu J, Chang C. Anti-Hebbian learning in topologically constrained linear networks: A tutorial. IEEE Transactions on Neural Networks, 1993, 4(5): 748-761

DOI

37
Grossberg S, Carpenter G A. Adaptive resonance theory. In: Arbib M A, ed. The Handbook of Brain Theory and Neural Networks. 2nd ed. Cambridge: MIT Press, 2002, 87-90

38
Carpenter G A, Grossberg S. A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 1987, 37: 54-115

DOI

39
Kawato M. Cerebellum and motor control. In: Arbib M A, ed. The Handbook of Brain Theory and Neural Networks. 2nd ed. Cambridge: MIT Press, 2002, 190-195

40
Shidara M, Kawano K, Gomi H, Kawato M. Inversedynamics model eye movement control by Purkinje cells in the cerebellum. Nature, 1993, 365(6441): 50-52

DOI

41
Wolpert D, Kawato M. Multiple paired forward and inverse models for motor control. Neural Networks, 1998, 11(7-8): 1317-1329

DOI

42
Hinton G E, Dayan P, Frey B J, Neal R N. The wake-sleep algorithm for unsupervised learning neural networks. Science, 1995, 268(5214): 1158-1160

DOI

43
Dayan P, Hinton G E, Neal R M, Zemel R S. The Helmholtz machine. Neural Computation, 1995, 7(5): 889-904

DOI

44
Jaakkola T S. Tutorial on variational approximation methods. In: Opper M, Saad D, eds. Advanced Mean Field Methods: Theory and Practice. Cambridge: MIT press, 2001, 129-160

45
Jordan M, Ghahramani Z, Jaakkola T, Saul L. Introduction to variational methods for graphical models. Machine Learning, 1999, 37(2): 183-233

DOI

46
Corduneanu A, Bishop CM. Variational Bayesian model selection for mixture distributions. In: Jaakkola T, Richardson T, eds. Proceedings of the Eighth International Conference on Artificial Intelligence and Statistics. 2001, 27-34

47
Xu L. Bayesian-Kullback coupled YING-YANG machines: Unified learning and new results on vector quantization. In: Proceedings of the International Conference on Neural Information Processing. 1995, 977-988 (A further version in NIPS8. In: Touretzky D S, et al. eds. Cambridge: MIT Press, 444-450)

48
Xu L. Ying-Yang learning. In: Arbib M A, ed. The Handbook of Brain Theory and Neural Networks. 2nd ed. Cambridge: MIT Press, 2002, 1231-1237

49
Xu L. Advances on BYY harmony learning: Information theoretic perspective, generalized projection geometry, and independent factor auto-determination. IEEE Transactions on Neural Networks, 2004, 15(4): 885-902

DOI

50
Xu L. Learning algorithms for RBF functions and subspace based functions. In: Olivas E, . eds. Handbook of Research on Machine Learning, Applications and Trends: Algorithms, Methods and Techniques. Hershey(PA): IGI Global, 2009, 60-94

51
Xu L. Bayesian Ying Yang system, best harmony learning, and Gaussian manifold based family. In: Zurada. eds. Computational Intelligence: Research Frontiers, WCCI2008 Plenary/Invited Lectures. Lecture Notes in Computer Science, 2008, 5050: 48-78

52
Xu L, Oja E. Randomized Hough transform. In: Ramόn J, Dopico R, Dorado J, Pazos A, eds. Encyclopedia of Artificial Intelligence. Hershey(PA): IGI Global, 2008, 1354-1361

53
Veith I. The Yellow Emperor’s Classic of Internal Medicine. Berkeley: University of California Press, 1972

54
Vapnik, V. Estimation of Dependences Based on Empirical Data. Springer, 2006

55
Stone M. Cross-validation: A review. Mathematics, Operations and Statistics, 1978, 9(1): 127-140

56
Rivals I, Personnaz L. On cross validation for model selection. Neural Computation, 1999, 11(4): 863-870

DOI

57
Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 1974, 19(6): 714-723

DOI

58
Bozdogan H. Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extension. Psychometrika, 1987, 52(3): 345-370

DOI

59
Cavanaugh J E. Unifying the derivations for the Akaike and corrected Akaike information criteria. Statistics & Probability Letters, 1997, 33(2): 201-208

DOI

60
Williams P M. Bayesian regularization and pruning using a Laplace prior. Neural Computation, 1995, 7(1): 117-143

DOI

61
Tibshirani R, Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 1996, 58(1): 267-288

62
MacKay D J C. Bayesian interpolation. Neural Computation, 1992, 4(3): 415-447

DOI

63
Salah A A, Alpaydin E. Incremental mixtures of factor analyzers. In: Proceedings the 17th International Conference on Pattern Recognition. 2004, 1: 276-279

64
Xu L, Krzyzak A, Oja E. Rival penalized competitive learning for clustering analysis, RBF net and curve detection. IEEE Transactions on Neural Networks, 1993, 4(4): 636-649

DOI

65
Xu L, Krzyzak A, Oja E. Unsupervised and supervised classifications by rival penalized competitive learning. In: Proceedings of the 11th International Conference on Pattern Recognition. 1992, I: 672-675

66
Xu L. Rival penalized competitive learning. Scholarpedia, 2007, 2(8): 1810http://www.scholarpedia.org/article/Rival_penalized_competitive_learning

67
Corduneanu A, Bishop C M. Variational Bayesian model selection for mixture distributions. In: Richardson T, Jaakkola T, eds. Proceedings of the Eighth International Conference on Artificial Intelligence and Statistics. 2001, 27-34

68
McGrory C A, Titterington D M. Variational approximations in Bayesian model selection for finite mixture distributions. Computational Statistics & Data Analysis, 2007, 51(11): 5352-5367

DOI

69
Tu S, Xu L. A study of several model selection criteria for determining the number of signals. In: Proceedings of 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. 2010, 1966-1969

70
Xu L. Fundamentals, challenges, and advances of statistical learning for knowledge discovery and problem solving: A BYY harmony perspective, keynote talk. In: Proceedings of the International Conference on Neural Networks and Brain. 2005, 1: 24-55

71
Hinton G E, Zemel R S. Autoencoders, minimum description length and Helmholtz free energy. In: Cowan J D, Tesauro G, Alspector J, eds. Advances in Neural Information Processing Systems 6. San Mateo: Morgan Kaufmann, 1994, 449-455

72
Xu L. Data smoothing regularization, multi-sets-learning, and problem solving strategies. Neural Networks, 2003, 16(5-6): 817-825

DOI

73
Xu L. Bayesian Ying Yang system and theory as a unified statistical learning approach: (I) Unsupervised and semi-unsupervised learning. In: Amari S, Kassabov N, eds. Brain-like Computing and Intelligent Information Systems. Springer-Verlag, 1997, 241-274

74
Xu L. Bayesian Ying Yang system and theory as a unified statistical learning approach: (II) From unsupervised learning to supervised learning and temporal modeling and (III) Models and algorithms for dependence reduction, data dimension reduction, ICA and supervised learning. In: Wong K M, King I, Yeung D Y, eds. Proceedings of Theoretical Aspects of Neural Computation: A Multidisciplinary Perspective. 1997: 25-60

75
Xu L. Bayesian Ying Yang system and theory as a unified statistical learning approach (VII): Data smoothing. In: Proceedings of the International Conference on Neural Information Processing. 1998, 1: 243-248

76
Bishop C M. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 1995, 7(1): 108-116

DOI

77
Xu L. A unified perspective and new results on RHT computing, mixture based learning, and multi-learner based problem solving. Pattern Recognition, 2007, 40(8): 2129-2153

DOI

78
Xu L, Oja E, Kultanen P. A new curve detection method randomized Hough transform (RHT). Pattern Recognition Letters, 1990, 11(5): 331-338

DOI

79
Hough P V C. Method and means for recognizing complex patterns. <patent>US Patent, 3069654</patent>, 1962-<month>12</month>-<day>18</day>

80
Xu L. Best harmony, unified RPCL and automated model selection for unsupervised and supervised learning on Gaussian mixtures, ME-RBF models and three-layer nets. International Journal of Neural Systems, 2001, 11(1): 3-69

DOI

81
Xu L. Bayesian Ying-Yang learning theory for data dimension reduction and determination. Journal of Computational Intelligence in Finance, 1998, 6(5): 6-18

82
Tu S, Xu L. Theoretical analysis and comparison of several criteria on linear model dimension reduction. In: Adali T, Jutten C, Romano J M T, Barros A K, eds. Independent Component Analysis and Signal Separation. Lecture Notes in Computer Science, 2009, 5441: 154-162

83
Xu L. BYY harmony learning, independent state space and generalized APT financial analyses. IEEE Transactions on Neural Networks, 2001, 12(4): 822-849

DOI

84
Xu L. Temporal BYY encoding, Markovian state spaces, and space dimension determination. IEEE Transactions on Neural Networks, 2004, 15(5): 1276-1295

DOI

85
Kalman R E. A new approach to linear filtering and prediction problems. Transactions of the ASME Journal of Basic Engineering, 1960, 35-45

86
Sun K, Tu S, Gao D Y, Xu L. Canonical dual approach to binary factor analysis. In: Adali T, Jutten C, Romano J M T, Barros A K, eds. Independent Component Analysis and Signal Separation. Lecture Notes in Computer Science, 2009, 5441: 346-353

87
Nathan S. Science and medicine in imperial China — The state of the field. The Journal of Asian Studies, 1988, 47(1): 41-90

DOI

88
Wilhelm R, Baynes C. The I Ching or Book of Changes, with Foreword by Carl Jung. 3rd ed. Bollingen Series XIX. Princeton: Princeton University Press, 1967

89
Hansen C. A Daoist Theory of Chinese Thought: A Philosophical Interpretation. New York: Oxford University Press, 2000

90
Shilov G E, Gurevich B L. Integral, Measure, and Derivative: A Unified Approach. Silverman R trans. New York: Dover Publications, 1978

91
Ali S M, Silvey S D. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B, 1966, 28(1): 131-140

92
Kullback S, Leibler R A. On information and sufficiency. Annals of Mathematical Statistics, 1951, 22(1): 79-86

DOI

93
Shore J. Minimum cross-entropy spectral analysis. IEEE Transactions on Acoustics, Speech and Signal Process, 1981, 29(2): 230-237

DOI

94
Burg J P, Luenberger D G, Wenger D L. Estimation of structured covariance matrices. Proceedings of the IEEE, 1982, 70(9): 963-974

DOI

95
Jaynes E T. Information theory and statistical mechanics. Physical Review, 1957, 106(4): 620-630

DOI

96
Xu L. Temporal BYY learning for state space approach, hidden Markov model and blind source separation. IEEE Transactions on Signal Processing, 2000, 48(7): 2132-2144

DOI

97
Jeffreys H. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A: Mathematical and Physical Sciences, 1946, 186(1007): 453-461

DOI

98
Xu L. BYY learning system and theory for parameter estimation, data smoothing based regularization and model selection. Neural, Parallel and Scientific Computations, 2000, 8(1): 55-82

99
Xu L. BYY Σ-П factor systems and harmony learning. Invited talk. In: Proceedings of International Conference on Neural Information Processing (ICONIP’2000). 2000, 1: 548-558

100
Xu L. Bayesian Ying Yang learning. In: Zhong N, Liu J, eds. Intelligent Technologies for Information Analysis. Berlin: Springer, 2004, 615-706

101
Barron A, Rissanen J, Yu B. The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 1998, 44(6): 2743-2760

DOI

102
Xu L, Amari S. Combining classifiers and learning mixtureof-experts. In: Ramόn J, Dopico R, Dorado J, Pazos A, eds. Encyclopedia of Artificial Intelligence. Hershey(PA): IGI Global, 2008, 318-326

103
Xu L. BYY learning, regularized implementation, and model selection on modular networks with one hidden layer of binary units. Neurocomputing, 2003, 51: 277-301 (Errata on Neurocomputing, 2003, 55(1-2): 405-406)

104
Gales M J F, Young S. The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 2008, 1(3): 195-304

DOI

105
Su D, Wu X H, Xu L. GMM-HMM acoustic model training by a two level procedure with Gaussian components determined by automatic model selection. In: Proceedings of 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. 2010, 4890-4893

106
Rosti A V, Gales M. Factor analysed hidden Markov models for speech recognition. Computer Speech and Language, 2004, 18(2): 181-200

DOI

107
Gales M J F. Discriminative models for speech recognition. In: Proceedings of Information Theory and Applications Workshop. 2007, 170-176

108
Woodland P C, Povey D. Large scale discriminative training of hidden Markov models for speech recognition. Computer Speech and Language, 2002, 16(1): 25-47

DOI

109
Csiszár I, Tusnády G. Information geometry and alternating minimization procedures. Statistics and Decisions, 1984, (Suppl. 1): 205-237

110
Xu L, Oja E, Suen C Y. Modified Hebbian learning for curve and surface fitting. Neural Networks, 1992, 5(3): 441-457

DOI

111
Xu L, Krzyzak A, Oja E. A neural net for dual subspace pattern recognition methods. International Journal of Neural Systems, 1991, 2(3): 169-184

DOI

Outlines

/