Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTML5Parser option #133

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions parsel/selector.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

import six
from lxml import etree, html
from lxml.html import html5parser

from .utils import flatten, iflatten, extract_regex
from .csstranslator import HTMLTranslator, GenericTranslator
Expand All @@ -23,6 +24,10 @@ def __init__(self, *args, **kwargs):
'xml': {'_parser': SafeXMLParser,
'_csstranslator': GenericTranslator(),
'_tostring_method': 'xml'},
'html5': {'_parser': html5parser.HTMLParser,
'_csstranslator': HTMLTranslator(),
'_tostring_method': 'html',
},
}


Expand All @@ -39,8 +44,15 @@ def create_root_node(text, parser_cls, base_url=None):
"""Create root node for text using given parser class.
"""
body = text.strip().replace('\x00', '').encode('utf8') or b'<html/>'
parser = parser_cls(recover=True, encoding='utf8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parser_cls made sense here when both classes had the same API. Now that we are introducing a new class that does not have the same signature, maybe we should switch to a different approach.

For example, instead of a class, we could use a function that returns a valid root, and modify _ctgroup so that _parser values are such functions.

That way, the code here remains independent of the _parser value.

What do you think?

root = etree.fromstring(body, parser=parser, base_url=base_url)
if parser_cls == html5parser.HTMLParser:
try:
parser = parser_cls(namespaceHTMLElements=False)
root = parser.parse(body, useChardet=False, override_encoding='utf8').getroot()
except ValueError:
raise TypeError('HTML5parser does not support control characters')
else:
parser = parser_cls(recover=True, encoding='utf8')
root = etree.fromstring(body, parser=parser, base_url=base_url)
if root is None:
root = etree.fromstring(b'<html/>', parser=parser, base_url=base_url)
return root
Expand Down Expand Up @@ -158,7 +170,7 @@ class Selector(object):

``text`` is a ``unicode`` object in Python 2 or a ``str`` object in Python 3

``type`` defines the selector type, it can be ``"html"``, ``"xml"`` or ``None`` (default).
``type`` defines the selector type, it can be ``"html"``, ``"xml"``, ``"html5"`` or ``None`` (default).
If ``type`` is ``None``, the selector defaults to ``"html"``.
"""

Expand Down
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ def has_environment_marker_platform_impl_support():
'w3lib>=1.19.0',
'lxml>=2.3',
'six>=1.5.2',
'cssselect>=0.9'
'cssselect>=0.9',
'html5lib',
]
extras_require = {}

Expand Down
4 changes: 4 additions & 0 deletions tests/html_parser.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"html_parser": "html",
"html5_parser": "html5"
}
1 change: 1 addition & 0 deletions tests/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
pytest
pytest-cov
ddt
Loading