Add HTML5Parser option #133

joaquingx · 2019-01-11T22:50:22Z

PR Solves Issue scrapy/scrapy#83.
As that issue pointed out, there's a lot of requests to add this option: 1 2 3 4 5 6

Based/Continue work of #54 & scrapy/scrapy#1043

Known issues with html5lib:
html5lib/html5lib-python#96 , for now I just raise a TypeError, is there another/better way to handle this?

I also made some benchmarks, and I obtain that html5lib is a waaaaay slower than normal html parser. It's 60 times slower when parse: https://gist.github.com/joaquingx/efaf1152beb4e4ea25d6a8afa061ebcf (I used https://gist.github.com/kmike/af647777cef39c3d01071905d176c006 as reference). Due that, should be a good idea to make some alert when user chooses html5 as option?

codecov · 2019-01-11T22:53:25Z

Codecov Report

Merging #133 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     scrapy/scrapy#133      +/-   ##
==========================================
+ Coverage   99.63%   99.64%   +<.01%     
==========================================
  Files           5        5              
  Lines         271      278       +7     
  Branches       48       49       +1     
==========================================
+ Hits          270      277       +7     
  Misses          1        1

Impacted Files	Coverage Δ
parsel/selector.py	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8fc608e...e2cbb25. Read the comment docs.

eLRuLL · 2019-01-12T02:01:49Z

parsel/selector.py

@@ -39,8 +44,12 @@ def create_root_node(text, parser_cls, base_url=None):
    """Create root node for text using given parser class.
    """
    body = text.strip().replace('\x00', '').encode('utf8') or b'<html/>'
-    parser = parser_cls(recover=True, encoding='utf8')
-    root = etree.fromstring(body, parser=parser, base_url=base_url)
+    if parser_cls != html5parser.HTMLParser:


I think it should be legible if we define the option as == for the case of html5parser and then else for the other options

Hey!, thanks for review!, I also think that, i'll change it. changed

eLRuLL · 2019-01-12T02:04:08Z

tests/test_selector.py

+                       </body>
+                </html>'''
+        sel = self.sscls(text=body, type='html5')
+        res = sel.xpath('//div/text()').get()


shouldn't the test confirm that it gets the div element? Or maybe I am not understanding the new functionality correctly.

Yes, you're right the testcase isn't clear. The difference that this test pretend to show is the difference in parsers view. In this particular case:

In [1]: body = u'''<html> ...: <head></head> ...: <body> ...: <li>one<div></li> ...: <li>two</li> ...: </body> ...: </html>''' In [2]: from parsel import Selector In [3]: sel = Selector(body, 'html') In [4]: sel.get() Out[4]: '<html>\n <head></head>\n <body>\n <li>one<div>\n <li>two</li>\n </div></li></body>\n </html>' In [5]: selfive = Selector(body, 'html5') In [6]: selfive.get() Out[6]: '<html><head></head>\n <body>\n <li>one<div></div></li>\n <li>two</li>\n \n </body></html>'

As you can see, div tag is filled/completed/closed in different way with html than html5parser, just as extra info, the way html5parser see is the way that browsers would see that html.

In [7]: sel.xpath('//div').get() Out[7]: '<div>\n <li>two</li>\n </div>' In [8]: selfive.xpath('//div').get() Out[8]: '<div></div>'

EDIT:

Already changed, thanks for the suggestion, test is much clear now.

eLRuLL · 2019-01-12T02:07:37Z

tests/test_selector.py

+    def test_characters_gt_and_lt(self):
+        """HTML5 parser tests: greater and less than symbols work as expected."""
+        lt_elem = '20 < 100'
+        gt_elem = '120 > 100'


if we are just defining these two cases, I would prefer if they could be one on its respective test. On the other hand I think we should be adding more cases, and defining them as a list which will also improve the tests below to not repeat code, but iterate that list checking all those cases

Thanks, think that I improved that 👍

joaquingx · 2019-01-18T21:01:37Z

@eLRuLL @redapple @kmike can you give me some feedback here please 🙏

Gallaecio · 2019-05-08T07:54:39Z

parsel/selector.py

@@ -39,8 +44,15 @@ def create_root_node(text, parser_cls, base_url=None):
    """Create root node for text using given parser class.
    """
    body = text.strip().replace('\x00', '').encode('utf8') or b'<html/>'
-    parser = parser_cls(recover=True, encoding='utf8')


I think parser_cls made sense here when both classes had the same API. Now that we are introducing a new class that does not have the same signature, maybe we should switch to a different approach.

For example, instead of a class, we could use a function that returns a valid root, and modify _ctgroup so that _parser values are such functions.

That way, the code here remains independent of the _parser value.

What do you think?

redapple · 2019-05-08T08:24:45Z

There is another option to get faster parsing times: https://html5-parser.readthedocs.io/en/latest/ It's not as straightforward to install as html5lib but it worked well for the few times I tried it

Gallaecio · 2019-08-08T09:27:39Z

@joaquingx I know feedback took a while… Do you wish to continue working on it?

joaquingx changed the title ~~Add HTML5Parser option~~ [WIP] Add HTML5Parser option Jan 11, 2019

joaquingx force-pushed the add-html5-support branch 2 times, most recently from f3484d5 to b6030f6 Compare January 11, 2019 22:57

Add HTML5Parser option and tests

6a9e309

joaquingx force-pushed the add-html5-support branch from b6030f6 to 6a9e309 Compare January 11, 2019 22:58

eLRuLL reviewed Jan 12, 2019

View reviewed changes

joaquingx added 2 commits January 18, 2019 14:04

Add unicode support and update tests.

a613595

Raise TypeError in Sequence Characters case and add test

e2cbb25

joaquingx changed the title ~~[WIP] Add HTML5Parser option~~ [MRG] Add HTML5Parser option Jan 18, 2019

Gallaecio mentioned this pull request May 8, 2019

Support "html5" type to use html5lib parser #83

Open

Gallaecio reviewed May 8, 2019

View reviewed changes

Gallaecio changed the title ~~[MRG] Add HTML5Parser option~~ Add HTML5Parser option Aug 8, 2019

Gallaecio closed this Sep 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HTML5Parser option #133

Add HTML5Parser option #133

joaquingx commented Jan 11, 2019 •

edited

Loading

codecov bot commented Jan 11, 2019 •

edited

Loading

eLRuLL Jan 12, 2019

joaquingx Jan 12, 2019 •

edited

Loading

eLRuLL Jan 12, 2019

joaquingx Jan 12, 2019 •

edited

Loading

eLRuLL Jan 12, 2019

joaquingx Jan 18, 2019 •

edited

Loading

joaquingx commented Jan 18, 2019

Gallaecio May 8, 2019

redapple commented May 8, 2019 via email

Gallaecio commented Aug 8, 2019

Add HTML5Parser option #133

Add HTML5Parser option #133

Conversation

joaquingx commented Jan 11, 2019 • edited Loading

codecov bot commented Jan 11, 2019 • edited Loading

Codecov Report

eLRuLL Jan 12, 2019

Choose a reason for hiding this comment

joaquingx Jan 12, 2019 • edited Loading

Choose a reason for hiding this comment

eLRuLL Jan 12, 2019

Choose a reason for hiding this comment

joaquingx Jan 12, 2019 • edited Loading

Choose a reason for hiding this comment

eLRuLL Jan 12, 2019

Choose a reason for hiding this comment

joaquingx Jan 18, 2019 • edited Loading

Choose a reason for hiding this comment

joaquingx commented Jan 18, 2019

Gallaecio May 8, 2019

Choose a reason for hiding this comment

redapple commented May 8, 2019 via email

Gallaecio commented Aug 8, 2019

joaquingx commented Jan 11, 2019 •

edited

Loading

codecov bot commented Jan 11, 2019 •

edited

Loading

joaquingx Jan 12, 2019 •

edited

Loading

joaquingx Jan 12, 2019 •

edited

Loading

joaquingx Jan 18, 2019 •

edited

Loading