Stratified sampling for data mining on the deep web

Tantan LIU, Fan WANG, Gagan AGRAWAL

PDF(650 KB)
PDF(650 KB)
Front. Comput. Sci. ›› 2012, Vol. 6 ›› Issue (2) : 179-196. DOI: 10.1007/s11704-012-2859-3
RESEARCH ARTICLE

Stratified sampling for data mining on the deep web

Author information +
History +

Abstract

In recent years, the deep web has become extremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this paper, we target two related data mining problems, association mining and differential rulemining. These are proposed to extract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively processes the query space of a deep web data source, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.

Keywords

deep web / associate rule mining / stratified sampling

Cite this article

Download citation ▾
Tantan LIU, Fan WANG, Gagan AGRAWAL. Stratified sampling for data mining on the deep web. Front Comput Sci, 2012, 6(2): 179‒196 https://doi.org/10.1007/s11704-012-2859-3

References

[1]
Bergman M K. White paper: the deep web: surfacing hidden value. Journal of Electronic Publishing, 2001, 7(1)
CrossRef Google scholar
[2]
Braga D, Ceri S, Daniel F, Martinenghi D. Optimization of multidomain queries on the web. Proceedings of the VLDB Endowment, 2008, 1(1): 562-673
[3]
Cali A, Martinenghi D. Querying data under access limitations. In: Proceedings of the 24th IEEE International Conference on Data Engineering. 2008, 50-59
CrossRef Google scholar
[4]
Toivonen H. Sampling large databases for association rules. In: Proceedings of the 22nd International Conference on Very Large Data Bases. 1996, 134-145
[5]
Parthasarathy S. Efficient progressive sampling for association rules. In: Proceedings of the 2002 IEEE International Conference on Data Mining. 2002, 354-361
CrossRef Google scholar
[6]
Chen B, Haas P, Scheuermann P. A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 462-468
CrossRef Google scholar
[7]
Boley M, Grosskreutz H. A randomized approach for approximating the number of frequent sets. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008, 43-52
CrossRef Google scholar
[8]
Campagna A, Pagh R. Finding associations and computing similarity via biased pair sampling. Knowledge and Information Systems (in Press)
[9]
Bar-Yossef Z, Gurevich M. Mining search engine query logs via suggestion sampling. Proceedings of the VLDB Endowment, 2008, 1(1): 54-65
[10]
Dasgupta A, Das G, Mannila H. A random walk approach to sampling hidden databases. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. 2007, 629-640
CrossRef Google scholar
[11]
Dasgupta A, Zhang N, Das G. Leveraging count information in sampling hidden databases. In: Proceedings of the 25th IEEE International Conference on Data Engineering. 2009, 329-340
CrossRef Google scholar
[12]
Liu B. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). New York: Springer-Verlag, 2006
[13]
Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases. 1994, 487-499
[14]
Zaki M J. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 2000, 12(3): 372-390
CrossRef Google scholar
[15]
Salam A, Khayal M S H. Mining top-k frequent patterns without minimum support threshold. Knowledge and Information Systems, 2012, 30(1): 57-86
CrossRef Google scholar
[16]
Gaya M C, Giráldez J I. Merging local patterns using an evolutionary approach. Knowledge and Information Systems, 2011, 29(1): 1-24
CrossRef Google scholar
[17]
Wang F, Agrawal G, Jin R, Piontkivska H. Snpminer: a domainspecific deep web mining tool. In: Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering. 2007, 192-199
[18]
Cochran W. Sampling Techniques. 3rd ed. New York: Wiley and Sons, 1977
[19]
Press W H, Farrar G R. Recursive stratified sampling for multidimensional Monte Carlo integration. Computers in Physics, 1990, 4(2): 190-195
[20]
Caflisch R E. Monte Carlo and quasi-Monte Carlo methods. Acta Numerica, 1998, 7: 1-49
CrossRef Google scholar
[21]
Bay S D, Pazzani M J. Detecting group differences: mining contrast sets. Data Mining and Knowledge Discovery, 2001, 5(3): 213-246
CrossRef Google scholar
[22]
Loekito E, Bailey J. Mining influential attributes that capture class and group contrast behaviour. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008, 971-980
CrossRef Google scholar
[23]
Brin S, Motwani R, Silverstein C. Beyond market baskets: generalizing association rules to correlations. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. 1997, 265-276
CrossRef Google scholar
[24]
Rubin D B. Matching to remove bias in observational studies. Biometrics, 1973, 29(1): 159-183
CrossRef Google scholar
[25]
Anderson D W, Kish L, Cornell R G. On stratification, grouping and matching. Scandinavian Journal of Statistics, 1980, 7(2): 61-66
[26]
Sprinthall R C. Basic Statistical Analysis. 7th ed. Boston: Allyn and Bacon, 2003
[27]
Pasquier N, Bastide Y, Taouil R, Lakhal L. Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th International Conference on Database Theory. 1999, 398-416
[28]
Gunopulos D, Mannila H, Khardon R, Toivonen H. Data mining, hypergraph transversals, and machine learning (extended abstract). In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 1997, 209-216
CrossRef Google scholar
[29]
Bayardo R J. Efficiently mining long patterns from databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. 1998, 85-93
CrossRef Google scholar
[30]
Foreman E K. Survey Sampling Principles. New York: Marcel Dekker publishers, 1991
[31]
Dasgupta A, Jin X, Jewell B, Zhang N, Das G. Unbiased estimation of size and other aggregates over hidden web databases. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 855-866
[32]
Jin R, Glimcher L, Jermaine C, Agrawal G. New sampling-based estimators for OLAP queries. In: Proceedings of the 22nd International Conference on Data Engineering. 2006
[33]
Afrati F N, Lekeas P V, Li C. Adaptive-sampling algorithms for answering aggregation queries on web sites. Data and Knowledge Engineering, 2008, 64(2): 462-490
CrossRef Google scholar
[34]
Joshi S, Jermaine C. Robust stratified sampling plans for low selectivity queries. In: Proceedings of the 24th International Conference on Data Engineering. 2008, 199-208
CrossRef Google scholar
[35]
Wu M, Jermaine C. Guessing the extreme values in a data set: a Bayesian method and its applications. VLDB Journal, 2009, 18(2): 571-597
CrossRef Google scholar

RIGHTS & PERMISSIONS

2014 Higher Education Press and Springer-Verlag Berlin Heidelberg
AI Summary AI Mindmap
PDF(650 KB)

Accesses

Citations

Detail

Sections
Recommended

/