Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTML5Parser option #133

Closed
wants to merge 3 commits into from
Closed

Conversation

joaquingx
Copy link

@joaquingx joaquingx commented Jan 11, 2019

PR Solves Issue scrapy/scrapy#83.
As that issue pointed out, there's a lot of requests to add this option: 1 2 3 4 5 6

Based/Continue work of #54 & scrapy/scrapy#1043

Known issues with html5lib:
html5lib/html5lib-python#96 , for now I just raise a TypeError, is there another/better way to handle this?

I also made some benchmarks, and I obtain that html5lib is a waaaaay slower than normal html parser. It's 60 times slower when parse: https://gist.github.com/joaquingx/efaf1152beb4e4ea25d6a8afa061ebcf (I used https://gist.github.com/kmike/af647777cef39c3d01071905d176c006 as reference). Due that, should be a good idea to make some alert when user chooses html5 as option?

@joaquingx joaquingx changed the title Add HTML5Parser option [WIP] Add HTML5Parser option Jan 11, 2019
@codecov
Copy link

codecov bot commented Jan 11, 2019

Codecov Report

Merging #133 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     scrapy/scrapy#133      +/-   ##
==========================================
+ Coverage   99.63%   99.64%   +<.01%     
==========================================
  Files           5        5              
  Lines         271      278       +7     
  Branches       48       49       +1     
==========================================
+ Hits          270      277       +7     
  Misses          1        1
Impacted Files Coverage Δ
parsel/selector.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8fc608e...e2cbb25. Read the comment docs.

@joaquingx joaquingx force-pushed the add-html5-support branch 2 times, most recently from f3484d5 to b6030f6 Compare January 11, 2019 22:57
@@ -39,8 +44,12 @@ def create_root_node(text, parser_cls, base_url=None):
"""Create root node for text using given parser class.
"""
body = text.strip().replace('\x00', '').encode('utf8') or b'<html/>'
parser = parser_cls(recover=True, encoding='utf8')
root = etree.fromstring(body, parser=parser, base_url=base_url)
if parser_cls != html5parser.HTMLParser:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be legible if we define the option as == for the case of html5parser and then else for the other options

Copy link
Author

@joaquingx joaquingx Jan 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey!, thanks for review!, I also think that, i'll change it. changed

</body>
</html>'''
sel = self.sscls(text=body, type='html5')
res = sel.xpath('//div/text()').get()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the test confirm that it gets the div element? Or maybe I am not understanding the new functionality correctly.

Copy link
Author

@joaquingx joaquingx Jan 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right the testcase isn't clear. The difference that this test pretend to show is the difference in parsers view. In this particular case:

In [1]:         body = u'''<html> 
   ...:                     <head></head> 
   ...:                        <body> 
   ...:                         <li>one<div></li> 
   ...:                         <li>two</li> 
   ...:                        </body> 
   ...:                 </html>'''                                                                                                                                                                                  

In [2]: from parsel import Selector                                                                                                                                                                                 

In [3]: sel = Selector(body, 'html')                                                                                                                                                                                

In [4]: sel.get()                                                                                                                                                                                                   
Out[4]: '<html>\n            <head></head>\n               <body>\n                <li>one<div>\n                <li>two</li>\n               </div></li></body>\n        </html>'

In [5]: selfive = Selector(body, 'html5')                                                                                                                                                                           

In [6]: selfive.get()                                                                                                                                                                                               
Out[6]: '<html><head></head>\n               <body>\n                <li>one<div></div></li>\n                <li>two</li>\n               \n        </body></html>'

As you can see, div tag is filled/completed/closed in different way with html than html5parser, just as extra info, the way html5parser see is the way that browsers would see that html.

In [7]: sel.xpath('//div').get()                                                                                                                                                                                    
Out[7]: '<div>\n                <li>two</li>\n               </div>'

In [8]: selfive.xpath('//div').get()                                                                                                                                                                                
Out[8]: '<div></div>'

EDIT:

Already changed, thanks for the suggestion, test is much clear now.

def test_characters_gt_and_lt(self):
"""HTML5 parser tests: greater and less than symbols work as expected."""
lt_elem = '20 < 100'
gt_elem = '120 > 100'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are just defining these two cases, I would prefer if they could be one on its respective test. On the other hand I think we should be adding more cases, and defining them as a list which will also improve the tests below to not repeat code, but iterate that list checking all those cases

Copy link
Author

@joaquingx joaquingx Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, think that I improved that 👍

@joaquingx joaquingx changed the title [WIP] Add HTML5Parser option [MRG] Add HTML5Parser option Jan 18, 2019
@joaquingx
Copy link
Author

@eLRuLL @redapple @kmike can you give me some feedback here please 🙏

@@ -39,8 +44,15 @@ def create_root_node(text, parser_cls, base_url=None):
"""Create root node for text using given parser class.
"""
body = text.strip().replace('\x00', '').encode('utf8') or b'<html/>'
parser = parser_cls(recover=True, encoding='utf8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parser_cls made sense here when both classes had the same API. Now that we are introducing a new class that does not have the same signature, maybe we should switch to a different approach.

For example, instead of a class, we could use a function that returns a valid root, and modify _ctgroup so that _parser values are such functions.

That way, the code here remains independent of the _parser value.

What do you think?

@redapple
Copy link
Contributor

redapple commented May 8, 2019 via email

@Gallaecio
Copy link
Member

@joaquingx I know feedback took a while… Do you wish to continue working on it?

@Gallaecio Gallaecio changed the title [MRG] Add HTML5Parser option Add HTML5Parser option Aug 8, 2019
@Gallaecio Gallaecio closed this Sep 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants