Performance analysis of new word weighting procedures for opinion mining
G. R. BRINDHA, P. SWAMINATHAN, B. SANTHI
Performance analysis of new word weighting procedures for opinion mining
The proliferation of forums and blogs leads to challenges and opportunities for processing large amounts of information. The information shared on various topics often contains opinionated words which are qualitative in nature. These qualitative words need statistical computations to convert them into useful quantitative data. This data should be processed properly since it expresses opinions. Each of these opinion bearing words differs based on the significant meaning it conveys. To process the linguistic meaning of words into data and to enhance opinion mining analysis, we propose a novel weighting scheme, referred to as inferred word weighting (IWW). IWW is computed based on the significance of the word in the document (SWD) and the significance of the word in the expression (SWE) to enhance their performance. The proposed weighting methods give an analytic view and provide appropriate weights to the words compared to existing methods. In addition to the new weighting methods, another type of checking is done on the performance of text classification by including stop-words. Generally, stop-words are removed in text processing. When this new concept of including stop-words is applied to the proposed and existing weighting methods, two facts are observed: (1) Classification performance is enhanced; (2) The outcome difference between inclusion and exclusion of stop-words is smaller in the proposed methods, and larger in existing methods. The inferences provided by these observations are discussed. Experimental results of the benchmark data sets show the potential enhancement in terms of classification accuracy.
Inferred word weight / Opinion mining / Supervised classification / Support vector machine (SVM) / Machine learning
[1] |
Andreevskaia, A., Bergler, S., 2008. When specialists and generalists work together: overcoming domain dependence in sentiment tagging. Proc. ACL-08, p.290–298.
|
[2] |
Armstrong, T.G., Moffat, A., Webber, W.,
|
[3] |
Barnes, S.J., Bohringer, M., 2011. Modeling use continuance behavior in micro blogging services: the case of Twitter. J. Comput. Inform. Syst., 51(4):1–10.
|
[4] |
Blitzer, J., Dredze, M., Pereira, F., 2007. Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. Proc. 45th Annual Meeting of the Association of Computational Linguistics, p.440–447.
|
[5] |
Boiy, E., Moens, M.F., 2009. A machine learning approach to sentiment analysis in multilingual web texts. Inform. Retr., 12(5):526–558. http://dx.doi.org/10.1007/s10791-008-9070-z
|
[6] |
Boiy, E., Hens, P., Deschacht, K.,
|
[7] |
Church, K.W., Hanks, P., 1989. Word association norms, mutual information and lexicography. Proc. 27th Annual Meeting on Association for Computational Linguistics, p.76–83. http://dx.doi.org/10.3115/981623.981633
|
[8] |
Das, S., Chen, M., 2001. Yahoo! for Amazon: extracting market sentiment from stock message boards. Manag. Sci., 53(9):1375–1388. http://dx.doi.org/10.1287/mnsc.1070.0704
|
[9] |
Debole, F., Sebastiani, F., 2003. Supervised term weighting for automated text categorization. Proc. ACM Symp. on Applied Computing, p.784–788. http://dx.doi.org/10.1145/952532.952688
|
[10] |
Esparza, S.G., O’Mahony, M.P., Smyth, B., 2012. Mining the real-time web: a novel approach to product recommendation. Knowl.-Based Syst., 29(3):3–11. http://dx.doi.org/10.1016/j.knosys.2011.07.007
|
[11] |
Gabrilovich, E., Markovitch, S., 2004. Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. Proc. 21st Int. Conf. on Machine Learning, p.41–50. http://dx.doi.org/10.1145/1015330.1015388
|
[12] |
Geng, L., Hamilton, H.J., 2006. Interestingness measures for data mining: a survey. ACM Comput. Surv., 38(3), Article 9. http://dx.doi.org/10.1145/1132960.1132963
|
[13] |
He, B., Huang, J.X.J., Zhou, X., 2011. Modeling term proximity for probabilistic information retrieval models. Inform. Sci., 181(14):3017–3031. http://dx.doi.org/10.1016/j.ins.2011.03.007
|
[14] |
Lee, S., Song, J., Kim, Y., 2010. An empirical comparison of four text mining methods. J. Comput. Inform. Syst., 51(1):1–10.
|
[15] |
Li, S., Xia, R., Zong, C.,
|
[16] |
Maas, A.L., Daly, R.E., Pham, P.T.,
|
[17] |
Manning, C.D., Raghavan, P., Schütze, H., 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK.
|
[18] |
Mladenić, D., Grobelnik, M., 1998. Feature selection for classification based on text hierarchy. Proc. Int. Conf. on Automated Learning and Discovery.
|
[19] |
Ng, V., Dasgupta, S., Arifin, S.M.N., 2006. Examining the role of linguistic knowledge sources in the automatic identification and classification of reviews. Proc. Int. Conf. on COLING/ACL, p.611–618.
|
[20] |
Nigam, K., McCallum, A.K., Thrun, S.,
|
[21] |
Paltoglou, G., Thelwall, M., 2010. A study of information retrieval weighting schemes for sentiment analysis. Proc. 48th Annual Meeting of the Association for Computational Linguistics, p.1386–1395. http://dx.doi.org/10.3115/1218955.1218990
|
[22] |
Pang, B., Lee, L., 2004. A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. Proc. 42nd Annual Meeting of the Association for Computational Linguistics, p.271–278. http://dx.doi.org/10.3115/1218955.1218990
|
[23] |
Pang, B., Lee, L., Vaithyanathan, S., 2002. Thumbs up?: sentiment classification using machine learning techniques. Proc. Conf. on Empirical Methods in Natural Language Processing, p.79–86. http://dx.doi.org/10.3115/1118693.1118704
|
[24] |
Saif, H., He, Y., Alani, H., 2012. Alleviating data sparsity for Twitter sentiment analysis. CEUR Workshop Proc., p.2–9.
|
[25] |
Salton, G., Buckley, C., 1998. Term-weighting approaches in automatic text retrieval. Inform. Process. Manag., 24(5):513–523. http://dx.doi.org/10.1016/0306-4573(88)90021-0
|
[26] |
Salvetti, F., Lewis, S., Reichenbach, C., 2004. Impact of lexical filtering on overall opinion polarity identification. Proc. AAAI Spring Symp. on Exploring Attitude and Affect in Text: Theories and Applications.
|
[27] |
Sebastiani, F., 2002. Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1–47. http://dx.doi.org/10.1145/505282.505283
|
[28] |
Sheikh, M., Conlon, S., 2012. A rule-based system to extract financial information. J. Comput. Inform. Syst., 52(4):10–19. http://dx.doi.org/10.1080/08874417.2012.11645572
|
[29] |
Simmons, L., Conlon, S., Mukhopadhyay, S.,
|
[30] |
Tong, R.M., 2001. An operational system for detecting and tracking opinions in on-line discussion. Working Notes of the ACM SIGIR Workshop on Operational Text Classification, p.1–6.
|
[31] |
Tsutsumi, K., Shimada, K.K., Endo, T., 2007. Movie review classification based on a multiple classifier. Proc. Annual Meetings of the 21st Pacific Asia Conf. on Language, Information and Computation, p.481–488.
|
[32] |
Xu, Y., Jones, G.J., Li, J.,
|
[33] |
Zaidan, O., Eisner, J., Piatko, C.D., 2007. Using “annotator rationales” to improve machine learning for text categorization. Proc. HLT-NAACL, p.260–267.
|
/
〈 | 〉 |