This is too few and will lead to many false positives.ĭo not build scenarios containing lots of fields! Go for an approach of using 3 or 4 fields for a scenario and using multiple scenarios for the same Object to find all duplicates instead.įirst Name AND Last Name AND Email Address >90%įirst Name AND Last Name AND Phone Number >90%įirst Name AND Last Name AND Company Name >90% Since a scenario typically relies on 3 or 4 fields being combined for an evaluation, ignoring the field will lead to a scenario using 2 or 3 fields. If you go for a combine scenario approach, as outlined in the next paragraph, definitely go for ‘score 0%’. If your fill rate is low, go for ‘ignore’ or ‘score 50%’. My advice would be to use a 0% match (no match) if you have a high fill rate. How to treat empty fields depends on the fill rate of the field you have included in your scenario. You can treat an empty field in three different ways: In a lot of cases, you are comparing an empty field with a field containing a value. In a scenario, you combine different matching methods on different fields to evaluate if records are duplicate. You include fields that are (almost) unique for a single person, such as first name, last name, phone number, email address, birth date, social security number and so on.Įxample of a scenario to find duplicates in the Lead object: These matching methods will give you fewer false positives when looking for duplicates.īased on our years of experience building Duplicate Check and consulting clients we share some best practices with you.Ī scenario consists of a number of fields with corresponding matching methods and aims to find duplicates for a specific Object. My advice is to always apply a special matching method, when it is available for a field you want to include in your matching. A matching method specific for company names may ignore legal entities (such as Inc., Ltd., LLC, etcetera). A specialized phone number matching method will ignore spaces, dashes and standardize prefixes for a valid comparison. When matching telephone numbers, you will get much better results if they are in the same format. Most of them are based on either exact or fuzzy and include some additional logic. Special matching methodsĪlmost all deduplication solutions offer more specialized matching methods. Note: A different letter in the last name leads to a lower score. Setting a high threshold when using fuzzy matching makes sure you don’t get too many false positives. However, it is based on the length of the longest string.Īs you can see, the score is much higher for longer strings with the same edit distance. The process to calculate the maximum edit distance is too complex to show here. Matching score is generally calculated by subtracting the result of the division of the found edit distance by the maximum edit distance of the two values of 1. To combat this problem, most deduplication solutions use a matching score based on multiple fields and a threshold to determine duplicate records. The longer the string, the less the impact of an edit on the meaning. Shorter strings often have entirely different meanings with one or two edits. Purely using edit distance for this goal is not ideal, especially for shorter strings (names, words). The goal of matching is to return similar results (with the same meaning). In this case only the insertion of the letter ‘h’ in John will make the two strings equal. Jon Doe John Doe has an edit distance of 1. This is sometimes also called ‘Levenshtein distance’ after the Soviet mathematician Vladimir Levenshtein, who did extensive research on the subject.Įdit distance is the number of single character edits (insert, delete or change) needed to change one string into another. One of the most used algorithms is based on the concept of ‘Edit distance’. Similarity, scoring often involves a combination of different algorithms. It’s like looking through almost closed eyelids, with your vision becoming fuzzy and it’s hard to distinguish small differences between words. Some solutions offer variations on exact match, such as ‘Exact (Random Order)’:Īs you can see “Exact (Random Order)” means the individual words have to match exactly, but not necessarily in the same order.įuzzy matching will return a match when two fields are alike (similar). Tip: note that different vendors use different names for the same thing (matching method and matching algorithm are the same)įor an exact match method to evaluate two fields as duplicate they have to match…exactly. Since all matching methods can be divided into two main groups: exact and fuzzy matching, that is where we’ll start. In this blog, we will explore the most important matching methods and when to use them, followed by some best practices in combining matching methods in a matching rule or scenario.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |