Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deduplication - add isEquivalentIdentity() method #1580

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Nov 12, 2021

This PR adds a new method to helper/diffPlaces.js called isEquivalentIdentity().

Detecting Equal Identity is trivial, if two records have the same GID they are the same thing...
but there is also a case where record A has an equivalent identity to the parent of record B at the same layer.

An example is something like this, these two records are distinct:

A = { "source": "geonames", "source_id": 1, layer: "region" }
B = { "source": "whosonfirst", "source_id": 999, layer: "region" }

... however with this PR the following two records are considered equivalent:

A = { "source": "geonames", "source_id": 1, layer: "region",
  parent: {
    "region_id": ["999"],
    "region_source": ["whosonfirst"]
  }
}
B = { "source": "whosonfirst", "source_id": 999, layer: "region" }

This is particularly relevant for inter-source deduplication as, for example, a geonames 'locality' will have a WOF 'locality' in its parent hierarchy (from PIP). In that case we can dedupe the WOF record with the GN record containing the WOF record as a parent.

note: this is backwards compatible with indices prior to pelias/schema#459 since an empty or absent *_source will be considered to have the value 'whosonfirst'.

@@ -124,20 +124,69 @@ function isNameDifferent(item1, item2, requestLanguage){
// iterate over all the languages in item2, comparing them to the
// 'default' name of item1 and also against the language requested by the user.
for( let lang in names2 ){
if (!names2.hasOwnProperty(lang)) { continue; } // do not iterate over inherited properties
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these hasOwnProperty checks snuck in to the PR, they are not strictly related but considered good practice

@missinglink
Copy link
Member Author

missinglink commented Nov 12, 2021

this PR replaces #1554 which was much more complicated, this method is both easier to implement/test and also more accurate than string checking the various name fields.

@@ -174,6 +223,7 @@ function isAddressDifferent(item1, item2){
* Optionally provide $requestLanguage (req.clean.lang.iso6393) to improve name deduplication.
*/
function isDifferent(item1, item2, requestLanguage){
if( isEquivalentIdentity( item1, item2 ) ){ return false; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth pointing out here that isEquivalentIdentity is different (no pun intended) than all the other checks in the isDifferent function.

All the others are each sufficient to prove that items are different. isEquivalentIdentity, on the other hand, is sufficient to prove that the items are the same.

I'm honestly surprised it took us this long to have such a check but we might want to divide the functions within isDifferent into two appropriate sections with a comment or something to help us keep track of them.

@orangejulius
Copy link
Member

Cool, this looks pretty easy to read and understand. And it definitely looks like a sound approach.

We might want to add a test with some real world data to our deduplication tests to make sure we handle it, right? Also, do you have any compare links handy that show off queries that should be improved by this PR?

@orangejulius
Copy link
Member

Ah, bad news. The acceptance tests show a bunch of regressions from this PR, looks like we need to restrict the comparison to records at the same layer.

Otherwise, for example, a city called Asia in the continent of Asia can replace the record for the continent:

/v1/search?text=Asia
Screen Shot 2021-11-12 at 10 19 50 AM

@missinglink
Copy link
Member Author

hmm.. it should already be doing that, I'll have to pick this up again next week 🤔

https://github.com/pelias/api/pull/1580/files#diff-19d27b5d3dcbc72dfec62360afe0dbc154cc7ff03a30be725d3a8a6dee3df5afR181-R185

@missinglink
Copy link
Member Author

added via rebase

diff --git a/helper/diffPlaces.js b/helper/diffPlaces.js
index 840f0203..d11ef0ba 100644
--- a/helper/diffPlaces.js
+++ b/helper/diffPlaces.js
@@ -168,6 +168,9 @@ const parentGID = (item, layer) => {
  */
 function isEquivalentIdentity(item1, item2) {

+  // Both items must be on the same layer
+  if (item1.layer !== item2.layer) { return false; }
+
   // Generate a GID value for each item
   const gid1 = GID(item1);
   const gid2 = GID(item2);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants