School of Electronics and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China
xjtu.cloud@gmail.com
Show less
History+
Received
Accepted
Published
2009-03-05
Issue Date
Revised Date
2009-03-05
PDF
(145KB)
Abstract
Focusing on the problem that the ant colony algorithm gets into stagnation easily and cannot fully search in solution space, a text clustering approach based on the fusion of the ant colony and genetic algorithms is proposed. The four parameters that influence the performance of the ant colony algorithm are encoded as chromosomes, thereby the fitness function, selection, crossover and mutation operator are designed to find the combination of optimal parameters through a number of iteration, and then it is applied to text clustering. The simulation results show that compared with the classical k-means clustering and the basic ant colony clustering algorithm, the proposed algorithm has better performance and the value of F-Measure is enhanced by 5.69%, 48.60% and 69.60%, respectively, in 3 test data sets. Therefore, it is more suitable for processing a larger dataset.
Yun ZHANG, Boqin FENG, Shouqiang MA, Lianmeng LIU.
Text clustering based on fusion of ant colony and genetic algorithms.
Front. Electr. Electron. Eng., 2009, 4(1): 15-19 DOI:10.1007/s11460-009-0019-9
Today, text clustering has become an important research topic to solve information overload problems in the field of information processing [1-3].
The ant colony algorithm (ACA) [4] is a novel simulated evolutionary algorithm that shows many promising characteristics [5] such as positive feedback, distributed computing and strong robustness, etc. However, it has shortcomings such as getting into stagnation easily and needing longer computing time. The ant colony algorithm cannot fully search in solution space. Moreover, its parameters are set randomly without theoretical basis. The genetic algorithm (GA) is a global search algorithm, which imitates the mechanism of living creature evolution and natural selection. It has the advantages of not easily falling into local optimum; even in the case where the defined fitness function is non-continuous, irregular and accompanied with noise, it can find the global optimum solution with a high probability. Furthermore, it can be easily combined with other algorithms to achieve better solutions.
In this article, a text clustering algorithm based on the fusion of the ant colony algorithm and the genetic algorithm is proposed. It utilizes the ability of the genetic algorithm on combinatorial optimization problems to get optimal combination of the parameters needed in the ant colony algorithm, and then applies that to a text clustering task.
Clustering models and measures
Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into groups (clusters), so that the data in the same group are similar among themselves, and collectively different from the data of other groups. Generally, the distance measure between objects can be used to determine the similarity between them; a formalized description of the clustering model is as follows:
Given a dataset , for , is an object in X and for , xil is an attribute of xi. According to the inner feature of objects, X can be grouped into clusters , where and is called a clustering space, Ci is the ith cluster in the clustering space.
There have been several suggestions for a measure of the similarity between two clusterings. Such a measure can be used to compare how well different data clustering algorithms perform on a set of data. Generally, the sum of the squared error (SSE) of each object to the related cluster center are used as the evaluation function, that is,
where nj denotes the number of objects in the jth cluster; d(xi,cj) is the distance between object xi and cluster center cj. The smaller the SSE value is, the better the clustering results are.
External criteria can also be used in text clustering. F-Measure [6], which combines precision and recall, is an index of system performance in information retrieval field. The precision P and recall R of cluster j and classification i are defined as
where Nij is the number of objects in cluster j subordinated to classification i; Ni is the number of all objects in classification i; Nj is the number of all objects in cluster j. The formula of F-Measure is as follows:
Thus, the evaluation function of clustering results is defined as
A greater F-Measure means better clustering results.
Fusions of ant colony and genetic algorithms
Clustering algorithm based on ant-foraging rules
After a long-term observation, biologists found that a moving ant may deposit some pheromone (a particular chemical that ants can smell) on the ground while walking, thus marking the path it follows by a trail of this substance. By sensing pheromone trails foragers can choose the shortest among the available paths from their nest to feeding sources and return. The collective behavior that emerges is a form of autocatalytic behavior where the more the ants are following a trail, the more attractive that trail becomes being followed. The process is thus characterized by a positive feedback loop. The ant colony algorithm is such a heuristic algorithm derived from the study of real ant colonies’ behavior.
By applying the ant foraging rules to text clustering, each datum can be viewed as an ant with different attributes, the cluster center can be viewed as food source searched by the ants, and the clustering analysis can be considered as the ant foraging process. In a search iteration, each ant decides the next position by computing the transition probability based on the intensity of the trail and heuristic information such as visibility on each path. At time t, the transition probability from data xi cluster center j is defined as
where α is the trail heuristic factor that reflects the relative importance of the accumulated intensity of the trail when ants are walking; β is the visibility heuristic factor that reflects the relative importance of visibility when ants are walking; is the intensity of the trail on the path from data i to the cluster center j at time t. The initial time has the same intensity of trail on each path, that is, is the visibility, which is a priori, uncertainty factor used when ants are walking. where .
After each iteration, the cluster center is changed, and the trail intensity of each data to the cluster center is adjusted by the following rules
where ρ is a coefficient such that 1–ρ represents the evaporation of the trail,
is the quantity per unit of length of the trail substance (pheromone in real ants) laid on the path of data i to the cluster center j between time t and t+1. Q is a constant; a greater Q means a faster trail accumulating velocity on the route covered by ants, which affects the convergence rate of the algorithm to a certain extent. Each time an ant transfers from one cluster center to another one will make the clusters’ centers changed, which starts the next clustering loop. The loop will be stopped until the clustering results are converged.
Fusion of ant colony and genetic algorithms
In the ant foraging based clustering algorithm, lots of parameters need to be initialized, which greatly influences the performance of the algorithm. Thus, how to determine an optimal combination of these parameters to achieve an optimal performance of the algorithm is a complex combination optimization problem, which has no theoretical basis until now. Generally, it is set by experience [4,7,8]. In Ref. [9], the range values of these parameters when using the ant colony algorithm to solve the TSP problem are given as follows: However, there is no conclusion that the scopes of parameters are also applicable to other applications. We did many experiments on text clustering and the result was not satisfactory.
Thus, we propose a fusion method of the ant colony algorithm and genetic algorithm in this article. Based on the combinational optimization ability of the genetic algorithm, the four parameters used in the ant colony algorithm, that is, α, β, ρ, Q, are encoded as chromosomes in the genetic algorithm. The scope of each chromosome is set large enough. The fitness function and the selection, cross-over and mutation operators are designed. Finally, the optimal combination of the parameters is found after repeated iterations, making the ant colony algorithm achieve optimal performance when solving specific problems. The fitness function should be set combined with the features of the solving problem. In this article, the evaluation function F-Measure is used as fitness function. Each individual chromosome is judged by the quality of clustering results.
The pseudo-code of the ant colony and genetic clustering algorithm (ACGA) is shown as follows:
Step 1 initialize parameters, setting the scope of chromosomes
Step 2 generate the initial populations
For i=1 to PopSize do
Generate population[i].chrom randomly
Step 3 use the ant colony algorithm to obtain the fitness value.
For i=1 to PopSize do
{ //For every individual in the population
(α,β,ρ,Q)=population[i].chrom;
For j=1 to K do //select the initial centers
centers[j]=dataset[j];
While(N<maxNACA || the centers are not changed)
{
for every xi in X //for each data xi in dataset X
for j=1 to K do
compute pij by Eq. (5)
move xi to Cj, 其中pij =max{ pij, 0<j≤K}
for j=1 to K do //update centers
update new centers[j]
for j=1 to TotalNum do //update trail
compute by Eq. (6)
}
population[i].fitness=F-Measure, compute F-Measure by Eq. (4)
}
Step 4 genetic operations.
Rank all the individuals in the current population by its fitness value, select the new individuals according to the roulette wheel selection, and then generate the new population after the single point crossover and mutation operation.
Step 5 repeat step 3 and step 4, until the iteration number is up to maxNGA.
Experiments and results
Dataset
The dataset is extracted from the universal standard text collection, Reuters-21578, in which the documents were collected from Reuters newswire in 1987. 135 different categories have been assigned to 21578 Reuter documents. We selected several categories from it and constructed a dataset; 3 datasets are constructed and the size and the number of topics are different for different datasets. The datasets are described in Table 1. Vector space model is used here; each document is converted from the original format (i.e. strings of characters) to a word vector, , where wi is the weight of the ith word. Information gain (IG) based approach is used as feature selection method, and the threshold is set as 0.5, which means that the IG value of each word lower than this threshold will be discarded. A popular LTC-weighting method is used. Finally, we can get the word and the dimensions of a word vector in each dataset, which is 145, 155 and 173 respectively. F-Measure and SSE are used to evaluate the performance of k-means algorithm, the ant colony algorithm (ACA) and the ACGA.
Experimental results
Comparison of clustering results
The predefined cluster number K is set as the number of categories of each dataset, and TotalNum is the size of each dataset. The parameters of the ACA are set as follows: α=95, β=45, ρ=0.5, Q=400, , and the max iteration times maxNACA=30. The parameters of the AGCA are set as follows: the population size PopSize=30, the crossover probability pC=0.8, the mutation probability pM=0.8, the max iteration times maxNGA=50. To verify the sensitivity of the initial cluster centers of various algorithms, two conditions are set to choose the initial cluster centers: initc=1, that is, poor choice of the initial cluster centers, and thus we will choose the data from the same category as initial centers; initc=2, that is, better choice of the initial cluster centers, and thus we will choose the data from different categories as initial cluster centers. The experimental results are illustrated in Tables 2, 3 and Figs. 1 and 2.
As it can be seen from Tables 2 and 3, the clustering performance of the ACGA and the ACA is better than the k-means algorithm. The F-Measure of the ACGA is improved by 5.69%, 48.60% and 69.60% compared with the k-means algorithm when poor choice of the initial cluster centers is selected. It also indicates that the ACGA is more suitable for dealing with larger dataset than the k-means algorithm. In addition, comparing the results obtained from different initial cluster centers selection methods, we can see that the ACA is sensitive to the initial cluster centers, which is the same as the k-means algorithm. If the initial cluster centers are inappropriately selected, it is difficult to obtain good clustering results. The fusion algorithm of the ant colony and genetic algorithms reduces the sensitivity to improper selection of the initial cluster centers and can get better clustering results in any case.
Also, it can be seen from Fig. 3 that, though the k-means clustering algorithm can get a small SSE value, in most cases, the ACGA can get a smaller SSE value than the k-means as well as the optimum F-Measure value. This further proves that the ACGA is an effective clustering method.
It must be pointed out that, the ACGA has a lower efficiency than the ACA and the k-means algorithm because it needs to execute ACA more than one time. However, this shortcoming is acceptable compared with the advantages brought by the ACGA algorithm, such as reducing the sensitivity to the initial cluster centers, automatically finding the global optimum combination of the parameters in the ACA, and improving the clustering results, etc.
Influence of different scopes of parameters on ant colony algorithm
When the ACA is used to solve classical TSP problems, the scope of different parameters are set as follows [9]:0≤α≤5, 0≤β≤5, 0.1≤ρ≤0.99, 10≤Q≤10000. The scope of different parameters in the AGCA are set as follows: 0≤α≤100, 0≤β≤100, 0≤ρ≤1, 10≤Q≤1000. The other parameters are set as the same in Sect. 4.2.1. Testing them on three datasets separately, the optimal F-Measure values found by the genetic algorithm in 50 generations are illustrated in Table 4.
It can be seen from Table 4 that the scopes of parameters used when ACA solves TSP problems are not suitable for solving all the other problems, such as the text clustering problem. Compared with the ACA, the algorithm proposed in this article can better address the setting values of the multi-parameter problem.
Conclusions
The genetic algorithm (GA) is a self-adapted probability search method used to solve optimization problems, which has been applied widely in science and engineering. The ant colony algorithm simulates the social behavior of ant colonies, and has become a hotspot in recent years. The proposed ant colony-genetic fusion clustering algorithm in this article combines the genetic algorithm with the ant colony algorithm, using the ability of the genetic algorithm to solve combinational optimization problems to determine the optimal value of multiple parameters in the ant colony algorithm, and then is applied to the text clustering task to achieve better results.
Future research will be focused on further improving the efficiency of the ant colony-genetic fusion algorithm, and trying to study the setting of various parameters theoretically in the ant colony algorithm.
Liu Y C, WangX L, XuZ M, GuanY. A survey of document clustering. Journal of Chinese Information Processing, 2006, 20(3): 55–62 (in Chinese)
[2]
SasakiM, ShinnouH. Spam detection using text clustering. In: Proceedings of the 2005 International Conference on Cyberworlds (CW’05), Singapore. 2005, 316–319
[3]
HeF, DingX Q. Combining text clustering and retrieval for corpus adaptation. Proceedings of SPIE. 2007, 6500: 65000P1–7
[4]
DorigoM, BlumC. Ant colony optimization theory: a survey. Theoretical Computer Science, 2005, 344(2-3): 243–278
[5]
ZhuX L, LiJ Z. An ant colony system-based optimization scheme of data mining. In: Proceedings of the 6th International Conference on Intelligent Systems Design and Applications (ISDA’06), Jinan, Shandong, China. 2006, 400–403
[6]
van RijsbergenC J. Information Retrieval. 2nd ed. London: Butterworths, 1979
[7]
WuC M, ChenZ, JiangM. The research on initialization of ants system and configuration of parameters for different TSP problems in ant algorithm. Acta Electronica Sinica, 2006, 34(8): 1530–1533 (in Chinese)
[8]
HuangY Q, LiangC Y, ZhangX D. Parameter establishment of an ant system based on uniform design. Control and Decision, 2006, 21(1): 93–96 (in Chinese)
[9]
DuanH B. Ant Algorithm–Theory and Its Applications. Beijing: Science Press, 2005 (in Chinese)
RIGHTS & PERMISSIONS
Higher Education Press and Springer-Verlag Berlin Heidelberg
AI Summary 中Eng×
Note: Please be aware that the following content is generated by artificial intelligence. This website is not responsible for any consequences arising from the use of this content.