Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disambiguation removal logic truncates remaining search text #162

Open
niravmehta opened this issue Oct 3, 2019 · 6 comments
Open

Disambiguation removal logic truncates remaining search text #162

niravmehta opened this issue Oct 3, 2019 · 6 comments

Comments

@niravmehta
Copy link

niravmehta commented Oct 3, 2019

Tokenization has a regex that removes disambiguation markers. This can be helpful, but it's currently truncating remaining text after the first occurrence of any disambiguation character.

So:
"Portland (Oregon) USA" becomes "portland"
"Borivali (West), Mumbai, India" becomes "borivali"
etc

Should it just remove the disambiguation part and leave the rest of the text as it is? (I understand that's going to be impossible where we don't have a "closing" disambiguation marker - e.g. just a simple "-")

Or is it just easier to remove the marker characters only and leave the disambiguation text as it is?

For example, change the regex to
input.replace(/[-֊־‐‑﹣\(\)\[\]]/g, ' ');

With this:
"Portland (Oregon) USA" becomes "portland oregon usa"
"Borivali (West), Mumbai, India" becomes "borivali west india"

Referring to this:
d7de4c9#diff-b1c9f1b1a4d867ea6fd37744bd1b38e5

@missinglink
Copy link
Member

I believe this is correct as-is.
The intention is to remove all parts of the text which aren't the 'subject'.

So in the case of "Borivali (West), Mumbai, India" we are only looking for 'Borivali', the additional tokens which help localize it to Mumbai India shouldn't be included in the index.

The associations to Mumbai & India should be made via their hierarchical links instead, so that we understand the parent-child relationship of these tokens.

Can you provide an example of a query which is currently failing due to this?

@missinglink
Copy link
Member

How we have it currently allows us to show a clear hierarchy of the tokens:

Borivali neighbourhood 85933015
└ Mumbai locality 102030609
   └ Mumbai City MU county 890503073
      └ Maharashtra MH region 85672171
         └ India IND country 85632469
            └ Asia continent 102191569

@niravmehta
Copy link
Author

Because of the truncation, searching for "Portland (Oregon) USA" yields match from Jamaica as well.

https___placeholder_synergy_io_demo_

And searching for "Borivali (East), MH, India" yields Borivali West as the first match.

https___placeholder_synergy_io_demo_

"3 Store, 311-318 High Holborn, London, WC1V 7BN, UK" returns no matches. Instead of returning the following (screenshot taken from a modified instance where I removed the disambiguation regex)

localhost_6100_demo__eng

Similarly, "1313 1/2 Railroad Ave Bellingham WA 98225-4729" returns no matches.

"St. Judes & St. Pauls C of E (Va) Primary School, 10 Kingsbury Road, London, N1 4AZ" returns a wrong result.

"〒100-8994, 東京都 中央区 八重洲一丁目 5番3号 東京中央郵便局, Japan" returns no result.

There may be some more examples. I took some here from Falsehoods.

The main problem I see is that truncating at a disambiguation character removes all trailing address information - the lineage - which is crucial in determining the location.

@niravmehta
Copy link
Author

BTW, what I did was replace these characters with a space.

text = text.replace(/[-֊־‐‑﹣\(\)\[\]]/g, ' ').trim();

My guess is that giving more tokens to Placeholder, will allow it to perform a better match. And it seems to be working well with it.

@missinglink
Copy link
Member

missinglink commented Oct 4, 2019

Oh I see we were talking about slightly different topics.

The original intention of the regex was to fix erroneous data at import-time.

It seems we are using the same analysis at query-time that we're using at index-time and so maybe you're right, we might consider making them separate analyzers so they can have different functions.

Thanks for the examples, they are certainly helpful, although I don't expect us to be able to handle all the edge cases from that Falsehoods post because this library doesn't have any awareness of addresses.

@niravmehta
Copy link
Author

Awesome.

And sure, I wouldn't expect Placeholder to handle different address oddities. Placeholder should stay focused on "last line parsing".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants