CLUSTER BASED DUPLICATE DETECTION

A. Venkatesh Kumar; S. Vengataasalam

doi:10.3844/jcssp.2013.1514.1518

Research Article Open Access

CLUSTER BASED DUPLICATE DETECTION

A. Venkatesh Kumar¹ and S. Vengataasalam¹

¹ , India

Abstract

We propose a clustering technique for entropy based text dis-similarity calculation of de-duplication system. Improve the quality of grouping; in this study we propose a Multi-Level Group Detection (MLGD) algorithm which produces a most accurate group with most closely related object using Alternative Decision Tree (ADT) technique. Our propose a two new algorithm; first one is Multi-Level Group Detection (MLGD) formation using Alternative Decision Tree (AD Tree), which will split the bunch of record into self-sized cluster to reduce the volume of data for text comparisons. Second one is calculating the dis-similarity percentage using entropy and Information Gain (IG). We show experimentally our proposed technique achieves higher average accuracy than existing traditional de-duplication system. Further, our technique not required any manual tuning for clustering formations as well as dis-similarity calculation for any kind of business data. In this study, we have presented a new efficient method is introduced for clustering formation using ADTree algorithm for duplicate deduction. The new method offers more accuracy dis-similarity measure for each cluster data without manual intervention at the time of duplicate deduction.

Journal of Computer Science

Volume 9 No. 11, 2013, 1514-1518

DOI: https://doi.org/10.3844/jcssp.2013.1514.1518

Submitted On: 9 September 2013 Published On: 28 September 2013

How to Cite: Kumar, A. V. & Vengataasalam, S. (2013). CLUSTER BASED DUPLICATE DETECTION. Journal of Computer Science, 9(11), 1514-1518. https://doi.org/10.3844/jcssp.2013.1514.1518

Copyright: © 2013 A. Venkatesh Kumar and S. Vengataasalam. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

6,118 Views
4,140 Downloads
0 Citations

Download

Keywords

Clustering Algorithm
Alternative Decision Tree Algorithm
Duplicate Detection
Efficient Method
Manual Intervention
Cluster Data
Similarity Measure
Clustering Formation