Scoring method will be helpful where data duplication can be the same but not based on the exact match of the data, but it will based on set of matched. It uses the algorithm to calculate the score, generally 80% or the score considered as a make the record as duplicate which is call threshold. Threshold is basically the minimum score that needs to be satisfied for the match, for the record to be marked to be duplicate. Depending on the type of fuzzy method, you have to identify and duplicate management system would automatically set a threshold.
When a matching rule is activated, one or more match keys are applied to existing records. The matching rule looks only for duplicates among records with the same match key. If two records don’t share match keys, they aren’t considered duplicates and the matching algorithms aren’t applied to them. This indexing process improves performance and returns a better set of match candidates. Only standard matching rules use the scoring method.
- Average: Uses the average match score.
- Maximum: Uses the highest match score.
- Minimum: Uses the lowest match score.
- Weighted Average: Uses the weight of each matching method to determine the average match score.
It helps to determine the minimum match score needed for the field to be managing a match. The field is given a match score based on how closely it matches to the same field in an existing record.
Edit Distance Algorithm
Edit distance is a way of identifying way of dissimilar two strings are to one another by checking the minimum number of actions required to transform one string into the other. Determines the similarity between two strings based on the number of deletions, insertions and character replacements needed to transform one string into the other.
It helps to determine the similarity of two sets of personal names. Let’s take an example, the first name John and its initial J match and return a score of 100.