Finding urls having semantic similarity #10028

imhans33 · 2022-01-11T14:02:58Z

imhans33
Jan 11, 2022

I am working on project to find semantically similar urls. As an example

Relevant Urls

https://www.laphil.com/events/performances/300/2018-10-14/la-fest-la-santa-cecilia
https://alabamasymphony.org/event/bohemian-serenade
https://mobilesymphony.org/event/beethoven-blue-jeans
https://www.flagstaffsymphony.org/event/masterworks-iv-amrhein-copland-bach-and-beethoven/
https://www.tucsonsymphony.org/event/duke-ellington-harlem/2022-02-20/
https://tickets.coloradosymphony.org/5176
https://my.bsomusic.org/overview/16895

Irrelevant Urls

https://mobilesymphony.org/online-concert-viewing
https://www.laphil.com/events/performances?Venue=Walt+Disney+Concert+Hall&Season=null
https://www.flagstaffsymphony.org/event/
https://www.tucsonsymphony.org/events/category/masterworks/
https://tickets.coloradosymphony.org/events
https://www.bsomusic.org/education-community/programs-for-schools/

These are just a few. Relevant urls are the pages having details about a single event and all other pages like "about us", "news", and even pages containing multiple event details in that site are irrelevant. The model must predict that some other url is relevant or not which is not present in the corpus. I have around 15000 relevant urls from different domains and used it to create a corpus and followed the method of checking spacy similarity to find the relevant urls. But its not giving accurate results. Does semantic similarity be the best approach. I also thought of considering this as a binary classification problem but i am not sure about that whether it works or not. If i consider it as binary classification, then how large dataset might be required just for a headstart(right now 15000 relevant urls in hand).

Answered by polm

Jan 12, 2022

Are you trying to find if literally the URLs alone are similar, or do you mean the contents of the URLs?

If you want to decide if two things are similar just by looking at the URLs, that's going to be impossible pretty often I think. Can a human even do that? Like, what can you do with this:

https://my.bsomusic.org/overview/16895
https://tickets.coloradosymphony.org/5176

I can tell you they're about music from the domain, but not more than that.

I can see how you can get real words out of URLs, but even if you preprocess things the criteria you outlined seem pretty unclear. Also note that spaCy is mainly designed with complete sentences or longer documents in mind, and it can deal with …

View full answer

polm · 2022-01-12T10:28:51Z

polm
Jan 12, 2022

Are you trying to find if literally the URLs alone are similar, or do you mean the contents of the URLs?

If you want to decide if two things are similar just by looking at the URLs, that's going to be impossible pretty often I think. Can a human even do that? Like, what can you do with this:

https://my.bsomusic.org/overview/16895
https://tickets.coloradosymphony.org/5176

I can tell you they're about music from the domain, but not more than that.

I can see how you can get real words out of URLs, but even if you preprocess things the criteria you outlined seem pretty unclear. Also note that spaCy is mainly designed with complete sentences or longer documents in mind, and it can deal with other kinds of text, bags of keywords might be better handled with something else, though I'm not sure what.

You might be able to train a classifier to recognize URLs of "about" pages or something with some accuracy, though I'm not sure it'd work better than a simple heuristic like looking for "about" in the URL.

On the other hand, if you want to use the contents of the URLs, that's a more straightforward problem, though the hard part is probably the content extraction.

4 replies

imhans33 Jan 12, 2022
Author

I am looking at the contents (single event content) of the page. Initially I tried creating a small corpus with relevant webpage data. On comparison I found that probability of similarity with corpus for relevant and irrelevant are pretty close and I couldnt come to conclusion with such a closer value(Since the web data of relevant and irrelevant are related to music data alone, it will come to show much similairty). Thats why I tried to find a similarity from urls. Yeah, for those two urls as pointed out, humans even categorize that. So I thought of crawling all urls from each domain and label it an valid and invalid, so that i can consider it as a classification problem. So my thought went in that path.

polm Jan 13, 2022

I think working with URLs will just make this harder, not easier.

I would go back to working with content, and in spaCy the easiest way to set this up is to model it as a classification problem with labels of "relevant" or "not relevant". You might get better performance if you refine "not relevant" into more concrete categories like "about pages" etc.

The best way to approach this problem is probably to learn a model that predicts whether two things are related or not, but spaCy doesn't have a component for that at the moment.

imhans33 Jan 17, 2022
Author

yeah. In that case working with web document can also bring lot of unknown data during the creating of corpus. So instead of working on the whole data, i am thinking of creating a list of keywords from webpage data so that the length of document can be reduced and treat as a classification problem. Also my data involves date and time as important information in a page. So i have a doubt(i am unsure if this is the place to ask) what are the different possibilities that a date and time can be considered as valid information and not avoided during prepocessing (removing numbers, special characters etc).

polm Jan 18, 2022

For spaCy you generally don't have to do any preprocessing as long as you have something like complete sentences, so you don't have to worry about that.

On the other hand, I'm not sure how dates will help with classifying pages. I guess the model can learn that "2022" might mean it's an upcoming event or something, but then you'll have to update that for next year... In any case doing classification of page content does seem like it should be easy to try even if it might not work well, so I'd encourage you to give that a shot first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding urls having semantic similarity #10028

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Finding urls having semantic similarity #10028

imhans33 Jan 11, 2022

Replies: 1 comment · 4 replies

polm Jan 12, 2022

imhans33 Jan 12, 2022 Author

polm Jan 13, 2022

imhans33 Jan 17, 2022 Author

polm Jan 18, 2022

imhans33
Jan 11, 2022

Replies: 1 comment 4 replies

polm
Jan 12, 2022

imhans33 Jan 12, 2022
Author

imhans33 Jan 17, 2022
Author