Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add support for JMESPath #25

Closed
voith opened this issue Jan 27, 2016 · 23 comments · Fixed by #181
Closed

[Feature Request] Add support for JMESPath #25

voith opened this issue Jan 27, 2016 · 23 comments · Fixed by #181

Comments

@voith
Copy link

voith commented Jan 27, 2016

Building a Selector based on JMESPath in parsel will help ease parsing Json.
This will also help scrapy to add methods like add_json and get_json to the ItemLoader. I got this idea from scrapy/scrapy#1005.
From what I understand, the Selector in parsel has been built using lxml, how about using jmespath for building a JsonSelector ?

I am not sure if this is the feature to have in this library as Parsel describes itself as a parser for XML/HTML. But adding this feature will add great value to this project.

PS: If the maintainers would like to have this feature in, Than I'd like to contribute to it myself.

@eliasdorneles
Copy link
Member

That's an interesting idea.
We were just talking about perhaps adding a JsonResponse to Scrapy in scrapy/scrapy#1729

I'd be okay with adding a JsonSelector completely separate from the already existing Selector, and then providing a factory function selector_for(response_text) that would do something like:

def selector_for(text):
    try:
        return JsonSelector(text):
    catch NotAJsonError:
        return Selector(text)

@dangra @kmike what do you think, fellows?

@voith
Copy link
Author

voith commented Jan 27, 2016

Alright I'll prototype this idea and see how it goes.

@dangra
Copy link
Member

dangra commented Jan 27, 2016

@eliasdorneles : do you recall XPathHtmlSelector, XPathXMLSelector, CSSHtmlSelector...?

I am not fond of using different class for JMESPath, we ditched it already in favour of a single class with different methods per selection type.

From the tip of my head the main reason I recall is simpler nesting of selection methods: response.css('div').xpath('.//script').jmespath(...)

@eliasdorneles
Copy link
Member

@dangra I see.
Hm, my thinking was that the input for both would be different (a selector supporting JMESPath wants JSON, not HTML/XML).

Do we have an use case for response.xpath().jmes() or response.jmes().xpath()?
I suppose it could be useful when one has escaped HTML inside a AJAX JSON response or JSON inside an HTML attribute -- are those the ones in your mind?

@kmike
Copy link
Member

kmike commented Jan 27, 2016

There are useful use cases for chaining (e.g. processing data- attributes), but I think they don't worth extra complexity we may introduce to support them.

response.jmespath(...) or jmespath.search(...) covers most use cases nicely and much easier to implement and understand.

@eliasdorneles
Copy link
Member

@kmike curious how you're thinking about the implementation.
You mentioned response.jmespath -- we don't have response in Parsel, did you mean it as a method for the Selector class itself?

@dangra
Copy link
Member

dangra commented Jan 28, 2016

There are useful use cases for chaining (e.g. processing data- attributes), but I think they don't worth extra complexity we may introduce to support them.

This is exactly what Parsel provides, it moves the implementation complexity out from users.

Do we have an use case for response.xpath().jmes() or response.jmes().xpath()?
I suppose it could be useful when one has escaped HTML inside a AJAX JSON response or JSON inside an HTML attribute -- are those the ones in your mind?

Both examples for chaining JSON and HTML are valid and making chaining easy is part of Parsel philosophy.

@Digenis
Copy link
Member

Digenis commented Jan 28, 2016

This is exactly what Parsel provides, it moves the implementation complexity out from users.

Users who try to subclass the selectors will end up facing these complexities.
So far, the implementation has been inviting for subclassing.

@dangra
Copy link
Member

dangra commented Jan 28, 2016

@Digenis I don't think there is such complexity for users extending Selector class, I can understand there was bit when CSS selection method was added because behind the scenes it translates to xpath and reuse it. But for JMESPath this is going to be a completely new method, it doesn't interfere with existent methods at all.

I think we have two options:

  1. Adding a "tailing" selection method like .re() named .jmespath() which is a thin wrapper for jmespath.search().
    Pro: Simple to implement and can be chained after xpath/css().
    Cons: chaining JSON->XPATH is not possible (although it is the less common I think) .
  2. Implement the full fledge selection interface: Selection method returns SelectorList() instance and extract() returns list of unicode text.
    Pro: Chaining all the way is possible
    Cons: We may need a tailing method anyway to parse json.

@dangra
Copy link
Member

dangra commented Jan 28, 2016

Ok! I must admit Option 2 is complex because we are parsing the DOM in Selector constructor

but option 1 is still compelling, isn't it? :)

@kmike
Copy link
Member

kmike commented Jan 28, 2016

I agree that option (1) looks easy enough to implement, but have anyone had a real use case for it? If I understood @voith properly, he wanted to parse JSON using Parsel (no XML/HTML involved at all), not to query some JSON data extracted form XML/HTML element attributes.

@voith
Copy link
Author

voith commented Jan 29, 2016

Yes I opened this issue with the intention of being able to parse JSON with Parsel. Although It'd be great to have chained parsing. But the implementation of having jmespath under an XML/HTML selector sounds complex as the inputs are different.

@dangra
Copy link
Member

dangra commented Jan 29, 2016

We can delay the parsing of the DOM until the first selection method is called. That will trigger json, xml, html parsing on demand.

@redapple
Copy link
Contributor

Offering .json()/.jmespath()/.jsonpath() for a Selector instantiated with a JSON string, with type="json"? why not.
Being able to chain JSON selectors? why not as well.

But I don't see a compelling use case for chaining .xpath()/.css() and .json()/.jmespath()/.jsonpath()

Internally, in current parsel implementation, once the input is parsed, the chaining navigates inside the same parsed document tree, it does not re-parse to build a new document.

Take, say, some HTML document containing comments which themselves contain HTML code,
think facebook's view-source:https://www.facebook.com/JustinBieber/

<code class="hidden_elem" id="u_0_15">
<!-- <div class="_5ay5"><div class="_4-u2 _4-u8"><div id="u_0_14"></div></div></div> -->
</code>

parsel does not support something like selector.css('code#u_0_15').xpath('string(comment())').xpath('//@id') for the previous example:

>>> selector = parsel.Selector(text=u'''<code class="hidden_elem" id="u_0_15">
... <!-- <div class="_5ay5"><div class="_4-u2 _4-u8"><div id="u_0_14"></div></div></div> -->
... </code>''')

>>> selector.css('#u_0_15').xpath('comment()').extract_first()
u'<!-- <div class="_5ay5"><div class="_4-u2 _4-u8"><div id="u_0_14"></div></div></div> -->'

>>> selector.css('#u_0_15').xpath('string(comment())').extract_first()
u' <div class="_5ay5"><div class="_4-u2 _4-u8"><div id="u_0_14"></div></div></div> '

>>> selector.css('#u_0_15').xpath('string(comment())').xpath('//@id')
[]

You still have to reinject into another selector to work on the embedded HTML:

>>> parsel.Selector(
...         selector.css('#u_0_15').xpath('string(comment())').extract_first()
...     ).xpath('//@id').extract()
[u'u_0_14']
>>> 

@EchoShoot
Copy link
Contributor

EchoShoot commented Jan 2, 2020

Finding a mixed form of the json and xml\html is not rare when we crawl

I have been submit pull request #181 .
Implemented a method named jpath that could be use it like xpath and css with chaining.

Here are some example

when have json in html

<div>
    <h1>Information</h1>
    <content>
            {
              "user": [
                        { "name": "A", "age": 18},
                        {"name": "B","age": 32},
                        {"name": "C","age": 22},
                        {"name": "D","age": 25}
              ],
              "total": 4,
              "status": "ok"
            }
    </content>
</div>
  • extract with this syntax
>>> sel.xpath('//div/content').jpath('user[*].name').getall()
['A', 'B', 'C', 'D']

when have html in json

{
    "content": [
                        { "name": "A", "value": "a" },
                        {"name": {"age": 18}, "value": "b"},
                        {"name": "C", "value": "c"},
                        {"name": "<a>D</a>", "value": "<div>d</div>"}
                    ],
    "html": [
                  "<div><a>AAA<br>Test</a>aaa</div><div><a>BBB</a>bbb<b>BbB</b><div/>"
                 ]
}
  • extract with this syntax
>>> sel.jpath('html').xpath('//div/a/text()').getall()
['AAA', 'Test', 'BBB']

By the way, it will called json.loads() inside of selector,
it means we could use it normally that Selector(text="{"A":"a"}")
It will also facilitate the implementation of response.jpath ('...') rather than Selector(json=response.json()) in scrapy

@EchoShoot
Copy link
Contributor

EchoShoot commented Jan 3, 2020

Hey guys! I think we need to discuss which name is better, jsonpath? Jpath? Jmespath?

  • In my opinion, jpath is relatively short and suitable for developers from all over the world to remember. it may be confusing, but will gradually become mainstream over time.
  • jsonpath is also a good name, but a bit long, not conducive to chained calls.
  • jmespath is difficult to remembered, especially for a man who first language is not english.

@Gallaecio
Copy link
Member

JMESPath and JSONPath are different JSON query languages. If we use jpath, it is unclear which one we are using, and things can get worse if a new JSON query language is ever implemented with that name (JPath).

Moreover, just as we support 2 different HTML/XML query languages (CSS and XPath), at some point we may support multiple JSON query languages (e.g. JMESPath, JSONPath and jq); so I really believe that jpath is a bad choice in the long run.

Yesterday I found out that Parsel used to have a select method, probably back when only one of CSS and XPath was supported. Care to guess which one it used? :)

@EchoShoot
Copy link
Contributor

You convinced me, I agree with you now, I decided to adopt jmespath, thank you very much for your help.^0^

@xPi2
Copy link

xPi2 commented Jul 28, 2020

How is going this?
I'm trying to implement this myself over parsel selector in my own project but I'm sure you know how to do it better.

@EchoShoot EchoShoot mentioned this issue Jul 28, 2020
2 tasks
@Gallaecio
Copy link
Member

I’m not entirely against it, but given that we use selector.xpath() instead of selector.x(), I think jmespath is more coherent, and it is not that long.

@Granitosaurus
Copy link

Granitosaurus commented Nov 16, 2020

Not to derail this but I'd argue that implementing JSONpath[1] would actually be more fitting for parsel as it is xpath like. For example Jmespath doesn't support recursive queries (like //node xpath) while Jsonpath does (as $..node); also the whole protocol structure is much more similar to that of xpath.

Ideally it would be great to have both! More and more web is using json and would be great to have one good parser for both html and json.

1 - https://github.com/h2non/jsonpath-ng jsonpath implementation in Python

@Gallaecio
Copy link
Member

I’ve added JMESPath support to a real-life project, and I must say @Granitosaurus you are completely right. The lack of the concept of parent nodes in JMESPath can be quite limiting, just as in CSS. It feels like JMESPath is to JSONpath what CSS is to XPath.

So, once this is fixed, I agree we should aim to extend support to JSONpath. Hopefully it won’t be too hard at that point.

@EchoShoot
Copy link
Contributor

I think Jmespath should be supported first, because it has been actively maintained over the years, and has plenty of resources and documentation. Many developers can find a way to get started. Then we can wait for a better and more robust json parser to appear.This doesn't conflict, just like css doesn't conflict with xpath, both are supported by parsel at the same time.

@wRAR wRAR closed this as completed in #181 Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants