[Feature Request] Add support for JMESPath #25

voith · 2016-01-27T06:29:14Z

Building a Selector based on JMESPath in parsel will help ease parsing Json.
This will also help scrapy to add methods like add_json and get_json to the ItemLoader. I got this idea from scrapy/scrapy#1005.
From what I understand, the Selector in parsel has been built using lxml, how about using jmespath for building a JsonSelector ?

I am not sure if this is the feature to have in this library as Parsel describes itself as a parser for XML/HTML. But adding this feature will add great value to this project.

PS: If the maintainers would like to have this feature in, Than I'd like to contribute to it myself.

The text was updated successfully, but these errors were encountered:

eliasdorneles · 2016-01-27T12:35:38Z

That's an interesting idea.
We were just talking about perhaps adding a JsonResponse to Scrapy in scrapy/scrapy#1729

I'd be okay with adding a JsonSelector completely separate from the already existing Selector, and then providing a factory function selector_for(response_text) that would do something like:

def selector_for(text):
    try:
        return JsonSelector(text):
    catch NotAJsonError:
        return Selector(text)

@dangra @kmike what do you think, fellows?

voith · 2016-01-27T15:50:10Z

Alright I'll prototype this idea and see how it goes.

dangra · 2016-01-27T17:07:23Z

@eliasdorneles : do you recall XPathHtmlSelector, XPathXMLSelector, CSSHtmlSelector...?

I am not fond of using different class for JMESPath, we ditched it already in favour of a single class with different methods per selection type.

From the tip of my head the main reason I recall is simpler nesting of selection methods: response.css('div').xpath('.//script').jmespath(...)

eliasdorneles · 2016-01-27T17:49:51Z

@dangra I see.
Hm, my thinking was that the input for both would be different (a selector supporting JMESPath wants JSON, not HTML/XML).

Do we have an use case for response.xpath().jmes() or response.jmes().xpath()?
I suppose it could be useful when one has escaped HTML inside a AJAX JSON response or JSON inside an HTML attribute -- are those the ones in your mind?

kmike · 2016-01-27T18:38:20Z

There are useful use cases for chaining (e.g. processing data- attributes), but I think they don't worth extra complexity we may introduce to support them.

response.jmespath(...) or jmespath.search(...) covers most use cases nicely and much easier to implement and understand.

eliasdorneles · 2016-01-27T18:45:29Z

@kmike curious how you're thinking about the implementation.
You mentioned response.jmespath -- we don't have response in Parsel, did you mean it as a method for the Selector class itself?

dangra · 2016-01-28T16:38:08Z

There are useful use cases for chaining (e.g. processing data- attributes), but I think they don't worth extra complexity we may introduce to support them.

This is exactly what Parsel provides, it moves the implementation complexity out from users.

Do we have an use case for response.xpath().jmes() or response.jmes().xpath()?
I suppose it could be useful when one has escaped HTML inside a AJAX JSON response or JSON inside an HTML attribute -- are those the ones in your mind?

Both examples for chaining JSON and HTML are valid and making chaining easy is part of Parsel philosophy.

Digenis · 2016-01-28T16:57:31Z

This is exactly what Parsel provides, it moves the implementation complexity out from users.

Users who try to subclass the selectors will end up facing these complexities.
So far, the implementation has been inviting for subclassing.

dangra · 2016-01-28T20:51:51Z

@Digenis I don't think there is such complexity for users extending Selector class, I can understand there was bit when CSS selection method was added because behind the scenes it translates to xpath and reuse it. But for JMESPath this is going to be a completely new method, it doesn't interfere with existent methods at all.

I think we have two options:

Adding a "tailing" selection method like .re() named .jmespath() which is a thin wrapper for jmespath.search().
Pro: Simple to implement and can be chained after xpath/css().
Cons: chaining JSON->XPATH is not possible (although it is the less common I think) .
Implement the full fledge selection interface: Selection method returns SelectorList() instance and extract() returns list of unicode text.
Pro: Chaining all the way is possible
Cons: We may need a tailing method anyway to parse json.

dangra · 2016-01-28T21:23:21Z

Ok! I must admit Option 2 is complex because we are parsing the DOM in Selector constructor

but option 1 is still compelling, isn't it? :)

kmike · 2016-01-28T21:30:18Z

I agree that option (1) looks easy enough to implement, but have anyone had a real use case for it? If I understood @voith properly, he wanted to parse JSON using Parsel (no XML/HTML involved at all), not to query some JSON data extracted form XML/HTML element attributes.

voith · 2016-01-29T15:54:09Z

Yes I opened this issue with the intention of being able to parse JSON with Parsel. Although It'd be great to have chained parsing. But the implementation of having jmespath under an XML/HTML selector sounds complex as the inputs are different.

dangra · 2016-01-29T19:41:10Z

We can delay the parsing of the DOM until the first selection method is called. That will trigger json, xml, html parsing on demand.

redapple · 2016-01-29T22:01:40Z

Offering .json()/.jmespath()/.jsonpath() for a Selector instantiated with a JSON string, with type="json"? why not.
Being able to chain JSON selectors? why not as well.

But I don't see a compelling use case for chaining .xpath()/.css() and .json()/.jmespath()/.jsonpath()

Internally, in current parsel implementation, once the input is parsed, the chaining navigates inside the same parsed document tree, it does not re-parse to build a new document.

Take, say, some HTML document containing comments which themselves contain HTML code,
think facebook's view-source:https://www.facebook.com/JustinBieber/

<code class="hidden_elem" id="u_0_15">
<!-- <div class="_5ay5"><div class="_4-u2 _4-u8"><div id="u_0_14"></div></div></div> -->
</code>

parsel does not support something like selector.css('code#u_0_15').xpath('string(comment())').xpath('//@id') for the previous example:

>>> selector = parsel.Selector(text=u'''<code class="hidden_elem" id="u_0_15">
... <!-- <div class="_5ay5"><div class="_4-u2 _4-u8"><div id="u_0_14"></div></div></div> -->
... </code>''')

>>> selector.css('#u_0_15').xpath('comment()').extract_first()
u'<!-- <div class="_5ay5"><div class="_4-u2 _4-u8"><div id="u_0_14"></div></div></div> -->'

>>> selector.css('#u_0_15').xpath('string(comment())').extract_first()
u' <div class="_5ay5"><div class="_4-u2 _4-u8"><div id="u_0_14"></div></div></div> '

>>> selector.css('#u_0_15').xpath('string(comment())').xpath('//@id')
[]

You still have to reinject into another selector to work on the embedded HTML:

>>> parsel.Selector(
...         selector.css('#u_0_15').xpath('string(comment())').extract_first()
...     ).xpath('//@id').extract()
[u'u_0_14']
>>>

EchoShoot · 2020-01-02T04:44:44Z

Finding a mixed form of the json and xml\html is not rare when we crawl

I have been submit pull request #181 .
Implemented a method named jpath that could be use it like xpath and css with chaining.

Here are some example

when have json in html

<div>
    <h1>Information</h1>
    <content>
            {
              "user": [
                        { "name": "A", "age": 18},
                        {"name": "B","age": 32},
                        {"name": "C","age": 22},
                        {"name": "D","age": 25}
              ],
              "total": 4,
              "status": "ok"
            }
    </content>
</div>

extract with this syntax

>>> sel.xpath('//div/content').jpath('user[*].name').getall()
['A', 'B', 'C', 'D']

when have html in json

{
    "content": [
                        { "name": "A", "value": "a" },
                        {"name": {"age": 18}, "value": "b"},
                        {"name": "C", "value": "c"},
                        {"name": "<a>D</a>", "value": "<div>d</div>"}
                    ],
    "html": [
                  "<div><a>AAA<br>Test</a>aaa</div><div><a>BBB</a>bbb<b>BbB</b><div/>"
                 ]
}

extract with this syntax

>>> sel.jpath('html').xpath('//div/a/text()').getall()
['AAA', 'Test', 'BBB']

By the way, it will called json.loads() inside of selector,
it means we could use it normally that Selector(text="{"A":"a"}")
It will also facilitate the implementation of response.jpath ('...') rather than Selector(json=response.json()) in scrapy

EchoShoot · 2020-01-03T01:29:30Z

Hey guys! I think we need to discuss which name is better, jsonpath? Jpath? Jmespath?

In my opinion, jpath is relatively short and suitable for developers from all over the world to remember. it may be confusing, but will gradually become mainstream over time.
jsonpath is also a good name, but a bit long, not conducive to chained calls.
jmespath is difficult to remembered, especially for a man who first language is not english.

Gallaecio · 2020-01-03T12:02:32Z

JMESPath and JSONPath are different JSON query languages. If we use jpath, it is unclear which one we are using, and things can get worse if a new JSON query language is ever implemented with that name (JPath).

Moreover, just as we support 2 different HTML/XML query languages (CSS and XPath), at some point we may support multiple JSON query languages (e.g. JMESPath, JSONPath and jq); so I really believe that jpath is a bad choice in the long run.

Yesterday I found out that Parsel used to have a select method, probably back when only one of CSS and XPath was supported. Care to guess which one it used? :)

EchoShoot · 2020-01-03T14:54:49Z

You convinced me, I agree with you now, I decided to adopt jmespath, thank you very much for your help.^0^

xPi2 · 2020-07-28T10:10:40Z

How is going this?
I'm trying to implement this myself over parsel selector in my own project but I'm sure you know how to do it better.

Gallaecio · 2020-07-28T11:24:58Z

I’m not entirely against it, but given that we use selector.xpath() instead of selector.x(), I think jmespath is more coherent, and it is not that long.

Granitosaurus · 2020-11-16T10:27:05Z

Not to derail this but I'd argue that implementing JSONpath[1] would actually be more fitting for parsel as it is xpath like. For example Jmespath doesn't support recursive queries (like //node xpath) while Jsonpath does (as $..node); also the whole protocol structure is much more similar to that of xpath.

Ideally it would be great to have both! More and more web is using json and would be great to have one good parser for both html and json.

1 - https://github.com/h2non/jsonpath-ng jsonpath implementation in Python

Gallaecio · 2021-02-21T16:12:00Z

I’ve added JMESPath support to a real-life project, and I must say @Granitosaurus you are completely right. The lack of the concept of parent nodes in JMESPath can be quite limiting, just as in CSS. It feels like JMESPath is to JSONpath what CSS is to XPath.

So, once this is fixed, I agree we should aim to extend support to JSONpath. Hopefully it won’t be too hard at that point.

EchoShoot · 2022-03-16T02:15:29Z

I think Jmespath should be supported first, because it has been actively maintained over the years, and has plenty of resources and documentation. Many developers can find a way to get started. Then we can wait for a better and more robust json parser to appear.This doesn't conflict, just like css doesn't conflict with xpath, both are supported by parsel at the same time.

voith mentioned this issue Feb 3, 2016

Added jmespath to selector #27

Closed

redapple added the enhancement label Feb 9, 2016

voith mentioned this issue Jan 11, 2017

Feature Request: Adding "add_jmes" and "replace_jmes" method to ItemLoader scrapy/itemloaders#67

Closed

Gallaecio added the discuss label Sep 24, 2019

EchoShoot mentioned this issue Jul 28, 2020

Support JMESPath now #181

Merged

2 tasks

Gallaecio mentioned this issue Nov 17, 2020

Add JSONPath support #204

Open

wRAR closed this as completed in #181 Apr 11, 2023

barrio mentioned this issue Apr 30, 2024

Parsel import causes crash #294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Add support for JMESPath #25

[Feature Request] Add support for JMESPath #25

voith commented Jan 27, 2016

eliasdorneles commented Jan 27, 2016

voith commented Jan 27, 2016

dangra commented Jan 27, 2016

eliasdorneles commented Jan 27, 2016

kmike commented Jan 27, 2016

eliasdorneles commented Jan 27, 2016

dangra commented Jan 28, 2016

Digenis commented Jan 28, 2016

dangra commented Jan 28, 2016

dangra commented Jan 28, 2016

kmike commented Jan 28, 2016

voith commented Jan 29, 2016

dangra commented Jan 29, 2016

redapple commented Jan 29, 2016

EchoShoot commented Jan 2, 2020 •

edited

Loading

EchoShoot commented Jan 3, 2020 •

edited

Loading

Gallaecio commented Jan 3, 2020

EchoShoot commented Jan 3, 2020

xPi2 commented Jul 28, 2020

Gallaecio commented Jul 28, 2020

Granitosaurus commented Nov 16, 2020 •

edited

Loading

Gallaecio commented Feb 21, 2021

EchoShoot commented Mar 16, 2022

[Feature Request] Add support for JMESPath #25

[Feature Request] Add support for JMESPath #25

Comments

voith commented Jan 27, 2016

eliasdorneles commented Jan 27, 2016

voith commented Jan 27, 2016

dangra commented Jan 27, 2016

eliasdorneles commented Jan 27, 2016

kmike commented Jan 27, 2016

eliasdorneles commented Jan 27, 2016

dangra commented Jan 28, 2016

Digenis commented Jan 28, 2016

dangra commented Jan 28, 2016

dangra commented Jan 28, 2016

kmike commented Jan 28, 2016

voith commented Jan 29, 2016

dangra commented Jan 29, 2016

redapple commented Jan 29, 2016

EchoShoot commented Jan 2, 2020 • edited Loading

Here are some example

when have json in html

when have html in json

EchoShoot commented Jan 3, 2020 • edited Loading

Gallaecio commented Jan 3, 2020

EchoShoot commented Jan 3, 2020

xPi2 commented Jul 28, 2020

Gallaecio commented Jul 28, 2020

Granitosaurus commented Nov 16, 2020 • edited Loading

Gallaecio commented Feb 21, 2021

EchoShoot commented Mar 16, 2022

EchoShoot commented Jan 2, 2020 •

edited

Loading

EchoShoot commented Jan 3, 2020 •

edited

Loading

Granitosaurus commented Nov 16, 2020 •

edited

Loading