Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadabiliPy vs Readability.js #81

Open
kjoshi opened this issue Jul 31, 2019 · 2 comments
Open

ReadabiliPy vs Readability.js #81

kjoshi opened this issue Jul 31, 2019 · 2 comments
Labels
future Needs revisiting in the future

Comments

@kjoshi
Copy link

kjoshi commented Jul 31, 2019

Apologies if this is a stupid question, since I've not had a proper read through the source of ReadabiliPy or Readability.js, but is the pure-python implementation of ReadabiliPy intended to exactly reproduce the results of Readability.js?

In other words, should I get the exact same results when calling:

readabilipy.simple_json_from_html_string(html, use_readability=False)
and
readabilipy.simple_json_from_html_string(html, use_readability=True)
?

Because for certain articles I find that ReabiliPy gives me extra html elements and text that I'm not at all interested in, for example:

> import requests
> from readabilipy import simple_json_from_html_string

> url = 'https://analytics.jiscinvolve.org/wp/2019/02/12/my-algorithmic-friend-by-andrew-cormack/'
> html = requests.get(url).text
> article = simple_json_from_html_string(html, use_readability=False)
> article['plain_text']
...
{'text': 'If you have comments on the draft Wellbeing Analytics Code of Practice, please...'}
...
{'text': 'Archives'},
 {'text': '* July 2019, * June 2019, * February 2019, * December 2018, * November 2018, ........'}
...

whereas Readability.js manages to avoid extracting all of those links in the side bar:

> article = simple_json_from_html_string(html, use_readability=True)
> article['plain_text']
...
{'text': 'If you have comments on the draft Wellbeing Analytics Code of Practice, please...'}
<end>

Is there anything I can do to get ReadabiliPy to give me results more like Readability.js, since I'd like to use ReadabiliPy inside an AWS Lambda function and would like to avoid using both node and
python (if that's even possible in a single function..?)

Thanks

(Hi @jemrobinson - small world..!)

@jemrobinson
Copy link
Member

jemrobinson commented Aug 1, 2019

Hi @kjoshi!

No, it's not meant to be identical.

The original idea was that this would just be a python wrapper around Readability.js, and you can still use it as that if you want to. However, we found that sometimes Readability.js gives HTML that doesn't strictly adhere to the standard (although it renders in browsers without issue). The downstream application that we're using this package for cares more about that aspect so we focused on that.

We are (were?) planning to work on getting them to be feature equivalent (if not completely identical) but we haven't got much budget for that at the moment.

I think that Readability.js uses some complex heuristics to decide which part of the page to pull out as the main content element and we haven't had a chance to look into that. If you're interested in doing so, you can try diving into the Javascript to work out what it's doing...

PS. Whereabouts are you working these days?

@jemrobinson jemrobinson added the future Needs revisiting in the future label Aug 1, 2019
@kjoshi
Copy link
Author

kjoshi commented Aug 14, 2019

Ok, great, thanks for confirming.

I had a quick look at the Readability.js code but it was a bit more complicated than I assumed it would be, and I don't have enough time to go through it in detail at the moment so I'm just going to stick with your ReadabiliPy wrapper for now.

PS. I'm currently a Data Science Developer at Jisc - still based in Manchester

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
future Needs revisiting in the future
Projects
None yet
Development

No branches or pull requests

2 participants