Linking temporal records
Pei LI, Xin Luna DONG, Andrea MAURINO, Divesh SRIVASTAVA
Linking temporal records
Many data sets contain temporal records which span a long period of time; each record is associated with a time stamp and describes some aspects of a real-world entity at a particular time (e.g., author information in DBLP). In such cases, we often wish to identify records that describe the same entity over time and so be able to perform interesting longitudinal data analysis. However, existing record linkage techniques ignore temporal information and fall short for temporal data.
This article studies linking temporal records. First, we apply time decay to capture the effect of elapsed time on entity value evolution. Second, instead of comparing each pair of records locally, we propose clustering methods that consider the time order of the records and make global decisions. Experimental results show that our algorithms significantly outperform traditional linkage methods on various temporal data sets.
temporal data / record linkage / data integration
[1] |
Elmagarmid A, Ipeirotis P, Verykios V. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16
Pubmed
|
[2] |
Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the 25th ACM SIGMOD International Conference on Management of Data. 2006, 802-803
Pubmed
|
[3] |
Weikum G, Ntarmos N, Spaniol M, Triantafillou P, Benczúr A, Kirkpatrick S, Rigaux P, Williamson M. Longitudinal analytics on web archive data: It’s about time! In: Proceedings of the Biennial Conference on Innovative Data Systems Research. 2011, 199-202
Pubmed
|
[4] |
McCallum A, Nigam K, Ungar L. Efficient clustering of highdimensional data sets with application to reference matching. In: Proceedings of the 6th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining. 2000, 169-178
|
[5] |
Li P, Dong X, Maurino A, Srivastava D. Linking temporal records. Proceedings of the VLDB Endowment, 2011, 4(7): 956-967
|
[6] |
Fan W, Geerts F, Wijsen J. Determining the currency of data. In: Proceedings of the 30th Symposium on Principles of Database Systems of Data. 2011, 71-82
|
[7] |
Hassanzadeh O, Chiang F, Lee H, Miller R. Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment, 2009, 2(1): 1282-1293
|
[8] |
Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183-1210
|
[9] |
Dey D. Entity matching in heterogeneous databases: A logistic regression approach. Decision Support Systems, 2008, 44(3): 740-747
|
[10] |
Hernández M, Stolfo S. Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 1998, 2(1): 9-37
|
[11] |
Domingos P. Multi-relational record linkage. In: Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining. 2004, 31-48
|
[12] |
Winkler W. Methods for record linkage and bayesian networks. Technical report, Statistical Research Division, US Census Bureau, Washington, DC, 2002
|
[13] |
Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases. 2002, 586-597
|
[14] |
Chen Z, Kalashnikov D, Mehrotra S. Exploiting relationships for object consolidation. In: Proceedings of the 2nd International Workshop on Information Quality in Information Systems. 2005, 47-58
|
[15] |
On B, Koudas N, Lee D, Srivastava D. Group linkage. In: Proceedings of the 23rd IEEE International Conference on the Data Engineering. 2007, 496-505
|
[16] |
Wijaya D, Bressan S. Ricochet: A family of unconstrained algorithms for graph clustering. In: Proceedings of the 14th International Conference on Database Systems for Advanced Applications. 2009, 153-167
|
[17] |
Flake G, Tarjan R, Tsioutsiouliklis K. Graph clustering and minimum cut trees. Internet Mathematics, 2004, 1(4): 385-408
|
[18] |
Yakout M, Elmagarmid A, Elmeleegy H, Ouzzani M, Qi A. Behavior based record linkage. Proceedings of the VLDB Endowment, 2010, 3(1-2): 439-448
|
[19] |
Burdick D, Hernández MA, Ho H, Koutrika G, Krishnamurthy R, Popa L, Stanoi I, Vaithyanathan S, Das S R. Extracting, linking and integrating data from public sources: a financial case study. IEEE Data Engineering, 2011, 34(3): 60-67
|
[20] |
Ozsoyoglu G, Snodgrass R. Temporal and real-time databases: a survey. IEEE Transactions on Knowledge and Data Engineering, 1995, 7(4): 513-532
|
[21] |
Roddick J, Spiliopoulou M. A survey of temporal knowledge discovery paradigms and methods. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(4): 750-767
|
[22] |
Cohen E, Strauss M. Maintaining time-decaying stream aggregates. Journal of Algorithms, 2006, 59(1): 19-36
|
[23] |
Cormode G, Shkapenyuk V, Srivastava D, Xu B. Forward decay: a practical time decay model for streaming systems. In: Proceedings of the 25th IEEE International Conference on Data Engineering. 2009, 138-149
|
/
〈 | 〉 |