-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReadabiliPy vs Readability.js #81
Comments
Hi @kjoshi! No, it's not meant to be identical. The original idea was that this would just be a python wrapper around Readability.js, and you can still use it as that if you want to. However, we found that sometimes Readability.js gives HTML that doesn't strictly adhere to the standard (although it renders in browsers without issue). The downstream application that we're using this package for cares more about that aspect so we focused on that. We are (were?) planning to work on getting them to be feature equivalent (if not completely identical) but we haven't got much budget for that at the moment. I think that Readability.js uses some complex heuristics to decide which part of the page to pull out as the main content element and we haven't had a chance to look into that. If you're interested in doing so, you can try diving into the Javascript to work out what it's doing... PS. Whereabouts are you working these days? |
Ok, great, thanks for confirming. I had a quick look at the Readability.js code but it was a bit more complicated than I assumed it would be, and I don't have enough time to go through it in detail at the moment so I'm just going to stick with your ReadabiliPy wrapper for now. PS. I'm currently a Data Science Developer at Jisc - still based in Manchester |
Apologies if this is a stupid question, since I've not had a proper read through the source of ReadabiliPy or Readability.js, but is the pure-python implementation of ReadabiliPy intended to exactly reproduce the results of Readability.js?
In other words, should I get the exact same results when calling:
readabilipy.simple_json_from_html_string(html, use_readability=False)
and
readabilipy.simple_json_from_html_string(html, use_readability=True)
?
Because for certain articles I find that ReabiliPy gives me extra html elements and text that I'm not at all interested in, for example:
whereas Readability.js manages to avoid extracting all of those links in the side bar:
Is there anything I can do to get ReadabiliPy to give me results more like Readability.js, since I'd like to use ReadabiliPy inside an AWS Lambda function and would like to avoid using both node and
python (if that's even possible in a single function..?)
Thanks
(Hi @jemrobinson - small world..!)
The text was updated successfully, but these errors were encountered: