Finding urls having semantic similarity #10028
-
I am working on project to find semantically similar urls. As an example Relevant Urls https://www.laphil.com/events/performances/300/2018-10-14/la-fest-la-santa-cecilia Irrelevant Urls https://mobilesymphony.org/online-concert-viewing These are just a few. Relevant urls are the pages having details about a single event and all other pages like "about us", "news", and even pages containing multiple event details in that site are irrelevant. The model must predict that some other url is relevant or not which is not present in the corpus. I have around 15000 relevant urls from different domains and used it to create a corpus and followed the method of checking spacy similarity to find the relevant urls. But its not giving accurate results. Does semantic similarity be the best approach. I also thought of considering this as a binary classification problem but i am not sure about that whether it works or not. If i consider it as binary classification, then how large dataset might be required just for a headstart(right now 15000 relevant urls in hand). |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Are you trying to find if literally the URLs alone are similar, or do you mean the contents of the URLs? If you want to decide if two things are similar just by looking at the URLs, that's going to be impossible pretty often I think. Can a human even do that? Like, what can you do with this:
I can tell you they're about music from the domain, but not more than that. I can see how you can get real words out of URLs, but even if you preprocess things the criteria you outlined seem pretty unclear. Also note that spaCy is mainly designed with complete sentences or longer documents in mind, and it can deal with other kinds of text, bags of keywords might be better handled with something else, though I'm not sure what. You might be able to train a classifier to recognize URLs of "about" pages or something with some accuracy, though I'm not sure it'd work better than a simple heuristic like looking for "about" in the URL. On the other hand, if you want to use the contents of the URLs, that's a more straightforward problem, though the hard part is probably the content extraction. |
Beta Was this translation helpful? Give feedback.
Are you trying to find if literally the URLs alone are similar, or do you mean the contents of the URLs?
If you want to decide if two things are similar just by looking at the URLs, that's going to be impossible pretty often I think. Can a human even do that? Like, what can you do with this:
I can tell you they're about music from the domain, but not more than that.
I can see how you can get real words out of URLs, but even if you preprocess things the criteria you outlined seem pretty unclear. Also note that spaCy is mainly designed with complete sentences or longer documents in mind, and it can deal with …