Dixendris d-doop - Data Deduplication System

Most data sources contain 3% to 5 duplication. Doubled addresses can harm customer relationships and create unnecessary expenses.

Finding and removing duplicative information in large data sets requires a suitable software solution.

Dixendris provides d-doop, the high speed, precise, effective and very affordable duplicate removal software solution. Our software is perfect for identifying duplication in all types of data (addresses, material lists, etc.), and also works for very large data-sets.

Please contact us for a free consultation, we are happy to help you.

Contact

Dixendris AG
Binningerstrasse 15
CH-4051 Basel
Switzerland

Phone +41 61 272 25 15 -
www.dixendris.com -
info@dixendris.com

Dixendris d-doop; the Data Deduplication System

The powerful Dixendris d-doop solution detects duplicate records in a wide range of data sets, providing you separated clean and duplicative data.

» Overview

As a rule of thumb, every source contains between 3% and 5% duplication. Such duplication can be costly. For example, in a marketing campaign, duplicate data means extra contact (mailing or phone) cost. For the consumer, additional contact leads to frustration and annoyance.

Hand-eliminating such duplication requires support time and cost. This could mean intensive efforts in campaigns with multiple deliveries. In large data sources, such as a customer database, the cumbersome task of duplicate elimination requires a software solution.

The Dixendris d-doop solution identifies duplicates in a variety of sources. The powerful yet affordable d-doop system allows you to deduplicate, for example, 1,000,000 records in less than forty-five minutes on a standard PC! (see Run-time table)

» Example of Application

The Dixendris d-doop system easily configures and customizes to your needs. You can either use it stand-alone, or incorporate it into your existing workflow.

Combine newly acquired address data with your current database, eliminating duplication and producing separate reports:

» Deduplication Process

  • The deduplication process is designed to use the least memory possible whilst maximizing execution speed.
  • You can start with any number of different data sources (separated field, extracted data, direct database connection via JDBC, etc.).
  • You can pull data together from different sources into one pool.
  • You input similarity values to tell the system 1) which information to identify as duplicative, and 2) which to eliminate (similarity value of 1.0 implies a 100% match, and 0.75 a 75% match). These values can be different if you wish!
  • The FAME system (Fingerprint Accelerated Matching Engine) automatically assigns a "digital fingerprint," which the system uses for internal comparison.
  • The data is then divided into rough clusters and compared. While not mandatory on smaller data-sets, clustering dramatically reduces execution time, and is required on extremely large data-sets where memory space is limited.
  • All potential duplicates are then identified.
  • The output shows:
    • Clean data - deduplicated.
    • The log of duplicate data, which contains a similarity value (a value of 1.0 implies a 100% match, and 0.75 a 75% match).
    • Duplicate report without the similarity value.
    • All this in one or separate files (user configurable).

Dixendris d-doop runtime for deduplication:

The following table shows the approximate runtime for deduplicating address records. The following fields have been used as a base for deduplication: Salutation, First Name, Last Name, Street, Number, City and Zip with a standard configuration.

Number of Addresses Runtime
50'000 1 minute
100'000 2 minutes
150'000 5 minutes
300'000 15 minutes
1'000'000 40 minutes
2'000'000 2.5 hours
3'000'000 4.5 hours

Reference System: standard laptop with Intel T2400, 1.83GHz, 2.00GB RAM.

Dixendris d-doop system is not only efficient, but very effective in identifying precise and similar duplicates.

» Inside the Machine

There are two types of duplicates; fully identical and similar. The second type of duplicate can occur from spelling and transposition errors, and from acquiring and introducing third-party addresses. Dixendris d-doop uses a fuzzy search to find both fully identical and similar entries.

Finding similar entries requires a complex search, and in contrast to finding identical records, is very challenging for a computer system. A similarity must be as meaningful as possible in order to assign a similarity percentage value.

When using a fuzzy search to deduplicate similar entries, search times grow exponentially with the number of records. Dixendris d-doop uses clever algorithms to optimize search speed, thereby efficiently processing your records within a useful timeframe.

While there are many systems capable of deduplicating a few thousand records within a few minutes, such systems require tens of hours or fail on data sets of over 300,000 records.

Dixendris d-doop is a high performance deduplication system and, thanks to it's unique FAME (Fingerprint Accelerated Matching Engine) technology, can deduplicate large data sets within relatively short periods of time (85,000,000 comparisons per second).

Thus it is truly possible to deduplicate large data sets within a reasonable and useful amount of time.

» Contact us today for a customized solution!

Dixendris AG
Binningerstrasse 15
CH-4051 Basel
Switzerland

Phone +41 61 272 25 15
www.dixendris.com
info@dixendris.com