Disambiguation removal logic truncates remaining search text #162

niravmehta · 2019-10-03T06:06:04Z

Tokenization has a regex that removes disambiguation markers. This can be helpful, but it's currently truncating remaining text after the first occurrence of any disambiguation character.

So:
"Portland (Oregon) USA" becomes "portland"
"Borivali (West), Mumbai, India" becomes "borivali"
etc

Should it just remove the disambiguation part and leave the rest of the text as it is? (I understand that's going to be impossible where we don't have a "closing" disambiguation marker - e.g. just a simple "-")

Or is it just easier to remove the marker characters only and leave the disambiguation text as it is?

For example, change the regex to
input.replace(/[-֊־‐‑﹣\(\)\[\]]/g, ' ');

With this:
"Portland (Oregon) USA" becomes "portland oregon usa"
"Borivali (West), Mumbai, India" becomes "borivali west india"

Referring to this:
d7de4c9#diff-b1c9f1b1a4d867ea6fd37744bd1b38e5

The text was updated successfully, but these errors were encountered:

missinglink · 2019-10-03T10:55:02Z

I believe this is correct as-is.
The intention is to remove all parts of the text which aren't the 'subject'.

So in the case of "Borivali (West), Mumbai, India" we are only looking for 'Borivali', the additional tokens which help localize it to Mumbai India shouldn't be included in the index.

The associations to Mumbai & India should be made via their hierarchical links instead, so that we understand the parent-child relationship of these tokens.

Can you provide an example of a query which is currently failing due to this?

missinglink · 2019-10-03T10:59:00Z

How we have it currently allows us to show a clear hierarchy of the tokens:

Borivali neighbourhood 85933015
└ Mumbai locality 102030609
   └ Mumbai City MU county 890503073
      └ Maharashtra MH region 85672171
         └ India IND country 85632469
            └ Asia continent 102191569

niravmehta · 2019-10-04T05:01:07Z

Because of the truncation, searching for "Portland (Oregon) USA" yields match from Jamaica as well.

And searching for "Borivali (East), MH, India" yields Borivali West as the first match.

"3 Store, 311-318 High Holborn, London, WC1V 7BN, UK" returns no matches. Instead of returning the following (screenshot taken from a modified instance where I removed the disambiguation regex)

Similarly, "1313 1/2 Railroad Ave Bellingham WA 98225-4729" returns no matches.

"St. Judes & St. Pauls C of E (Va) Primary School, 10 Kingsbury Road, London, N1 4AZ" returns a wrong result.

"〒100-8994, 東京都中央区八重洲一丁目 5番3号東京中央郵便局, Japan" returns no result.

There may be some more examples. I took some here from Falsehoods.

The main problem I see is that truncating at a disambiguation character removes all trailing address information - the lineage - which is crucial in determining the location.

niravmehta · 2019-10-04T05:04:47Z

BTW, what I did was replace these characters with a space.

text = text.replace(/[-֊־‐‑﹣\(\)\[\]]/g, ' ').trim();

My guess is that giving more tokens to Placeholder, will allow it to perform a better match. And it seems to be working well with it.

missinglink · 2019-10-04T17:23:50Z

Oh I see we were talking about slightly different topics.

The original intention of the regex was to fix erroneous data at import-time.

It seems we are using the same analysis at query-time that we're using at index-time and so maybe you're right, we might consider making them separate analyzers so they can have different functions.

Thanks for the examples, they are certainly helpful, although I don't expect us to be able to handle all the edge cases from that Falsehoods post because this library doesn't have any awareness of addresses.

niravmehta · 2019-10-05T05:08:19Z

Awesome.

And sure, I wouldn't expect Placeholder to handle different address oddities. Placeholder should stay focused on "last line parsing".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disambiguation removal logic truncates remaining search text #162

Disambiguation removal logic truncates remaining search text #162

niravmehta commented Oct 3, 2019 •

edited

Loading

missinglink commented Oct 3, 2019

missinglink commented Oct 3, 2019

niravmehta commented Oct 4, 2019

niravmehta commented Oct 4, 2019

missinglink commented Oct 4, 2019 •

edited

Loading

niravmehta commented Oct 5, 2019

Disambiguation removal logic truncates remaining search text #162

Disambiguation removal logic truncates remaining search text #162

Comments

niravmehta commented Oct 3, 2019 • edited Loading

missinglink commented Oct 3, 2019

missinglink commented Oct 3, 2019

niravmehta commented Oct 4, 2019

niravmehta commented Oct 4, 2019

missinglink commented Oct 4, 2019 • edited Loading

niravmehta commented Oct 5, 2019

niravmehta commented Oct 3, 2019 •

edited

Loading

missinglink commented Oct 4, 2019 •

edited

Loading