Change capture_token -> capture (and removal of the old capture) #58

NelsonEloi · 2022-03-04T08:28:49Z

NelsonEloi
Mar 4, 2022

Good day,

I don't think capture_token offers the same functionality of the old capture just with the trailing spaces removed.

Let me explain, if you had an old rule with capture that included productions (via rules or other captures), you have a callback with the lexeme, plus a parameter for each production.

The capture_token, first forces you to turn that rule into a token, and has a result you lose the parameters for the productions in callback, and get only the lexeme.

You probably could get around that by removing the capture_token/token pair and place a position front and back, and build the lexeme in the callback.

Error propagation also works differently.

If the design is to stay has is, I think you could have capture_token, turn every rule into a token by default, since it's going to force you indo using anyway.

Than you,
Nelson Eloi

foonathan · 2022-03-04T09:49:55Z

foonathan
Mar 4, 2022
Maintainer

My issue with the original design was the following:

dsl::capture(token) with automatic whitespace skipping is essentially always a bug, as it will also capture the trailing whitespace. I ran into that bug myself while implementing dsl::integer and dsl::identifier, which used it internally at some point.
dsl::capture(rule) with automatic whitespace skipping is almost always wrong as well, for the same reason.
dsl::capture(rule) without automatic whitespace skipping is fine, but I don't expect it to be common. I think the vast majority of use cases just want to capture a token.

Based on that, I've decided to make dsl::capture(token) do what you want and actually capture the token. As you've said, if you want the old behavior of dsl::capture(rule) you can emulate it using dsl::position + rule + dsl::position. Since that is the less common use case, I think the additional boilerplate is fine.

If the design is to stay has is, I think you could have capture_token, turn every rule into a token by default, since it's going to force you into using anyway.

I don't want to implicitly change the parsing behavior/error handling/values in a context where it's observable. For example, dsl::peek() internally wraps everything into dsl::token(), but that's fine, since it's unobservable anyway. For dsl::capture(), it makes a difference, as you said, so I don't want to make it implicit.

I could add dsl::capture_token() back as dsl::capture(dsl::token(...)), but I don't think those couple of characters you safe are worth it.

That being said, what is your use case where you have dsl::capture(non-token-rule)?

0 replies

NelsonEloi · 2022-03-04T10:31:39Z

NelsonEloi
Mar 4, 2022
Author

Hello Jonathan, I'm currently developing a parser for a DSL, in essence is a subset of C++, one of the things it's going to support is a *enum*. I'm still undecided if I'm going to support everything a struct can be in C++, but the DSL is designed to be both parsed by this parser, and still be compiled by a C++ compiler. Has reference https://en.cppreference.com/w/cpp/language/enum In simple terms a enum is enum-key enum-name enum-base(optional) { enumerator-list(optional) }; Meaning the following 3 would be valid. enum name {}; enum struct name {}; enum class name {}; I would like to parse that and produce different objects for the Ast, that also contain the full string used in them, this was why I was using capture(rule), just to get the parsed string, the enum-key, enum-name would still come from rule and U would get in the callback the lexeme and those productions. I could make 3 production rules, one for each enum-key type (enum struct, enum class and enum (need to be the last to check) and instantiate the correct object type for the Ast, but everything after this first check is essentially the same and code bloat. I tried but was successful (might have to check this again) using a base production and passing it the enum-key rule, the result would be a variant. And the other solution was to to do something like: rule = capture( enum >> capture_token( token( if_( struct | class ) ) ) + id + { enumerator_rule }; then the callback would have the signature: lexeme(full rule), lexeme(struct|class|empty), id, enumerator In this callback I could string compare the second lexeme with the struct|class|empty to instantiate the right object type and then assign the rest of the parameters to it, would still return a variant from that callback Maybe I'm missing something and there is a simpler solution to this problem. Thank you very much, Nelson Elói

…

On Fri, Mar 4, 2022 at 9:50 AM Jonathan Müller ***@***.***> wrote: My issue with the original design was the following: - dsl::capture(token) with automatic whitespace skipping is essentially always a bug, as it will also capture the trailing whitespace. I ran into that bug myself while implementing dsl::integer and dsl::identifier, which used it internally at some point. - dsl::capture(rule) with automatic whitespace skipping is almost always wrong as well, for the same reason. - dsl::capture(rule) without automatic whitespace skipping is fine, but I don't expect it to be common. I think the vast majority of use cases just want to capture a token. Based on that, I've decided to make dsl::capture(token) do what you want and actually capture the token. As you've said, if you want the old behavior of dsl::capture(rule) you can emulate it using dsl::position + rule + dsl::position. Since that is the less common use case, I think the additional boilerplate is fine. If the design is to stay has is, I think you could have capture_token, turn every rule into a token by default, since it's going to force you into using anyway. I don't want to implicitly change the parsing behavior/error handling/values in a context where it's observable. For example, dsl::peek() internally wraps everything into dsl::token(), but that's fine, since it's unobservable anyway. For dsl::capture(), it makes a difference, as you said, so I don't want to make it implicit. I could add dsl::capture_token() back as dsl::capture(dsl::token(...)), but I don't think those couple of characters you safe are worth it. ------------------------------ That being said, what is your use case where you have dsl::capture(non-token-rule)? — Reply to this email directly, view it on GitHub <#58 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AODTZAXU3HNX4HRYSWG7NLDU6HME3ANCNFSM5P44DRXA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

foonathan Mar 4, 2022
Maintainer

Ah, I see. I have three solutions.

Use a rule like enum >> opt(capture(struct) | capture(class)) + id + body. This would produce a nullopt, lexeme for enum foo, lexeme, lexeme for enum struct foo and enum class foo.
use a separate production for the name part enum >> (dsl::p<enum_struct_name> | dsl::p<enum_class_name> | dsl::else_ >> dsl::p<enum_plain_name>) + body. Each production will then parse their keyword plus the identifier, and you'll have the opportunity to have different callbacks for them.
Convince me to re-add a better form of dsl::label. This was a rule I had previously, but I wasn't quite happy with the design. It allowed you to label a branch with a tag value that will be produced when the branch is matched. That way you can do enum >> (label<struct_tag>(struct) | label<class_tag>(class) | label<plain_tag>) + id + body.

NelsonEloi · 2022-03-04T12:14:57Z

NelsonEloi
Mar 4, 2022
Author

Initially I had: capture( enum >> capture_token( token( if_( struct | class ) ) ) + id + body )
This gives you 1 call back with 4 parameters:

lexeme with the full rule
lexeme with struct, class or empty
lexeme with id
body

Your solutions:

doesn't have the full rule lexeme, would require to add position on both sides and rebuild it from them, and you'll need 2 callbacks (1 if you do a constexpr if and check for dsl::nullopt). Surrounding it with capture would lose the inner productions.
still need the positions at the start and end, and would require either 3 callbacks or 1 with an auto parameter (3 would still be generated)
don't think is required

For 1 I would do position >> enum >> capture( token( if_( struct | class ) + id + body + position
For 2 is position >> enum >> ( dsl::p<enum_struct_name> | dsl::p<enum_class_name> | dsl::else_ >> dsl::p<enum_plain_name> ) + body + position

Both could backtrack while parsing struct | class, but nothing you can do about it.

I like the first, for the 1 callback option without any extra code generated ()well you still have 3 ifs inside it)
I like the second, for its cleanest design.

All in all, don't think you need to change anything, you can get the old capture with productions functionality with small changes.

Just a note, if you have a rule with and error inside the token, it doesn't get propagated outside, you just get the missing token error and not the specialized one.

Thank you very much for a great library.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change capture_token -> capture (and removal of the old capture) #58

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Change capture_token -> capture (and removal of the old capture) #58

NelsonEloi Mar 4, 2022

Replies: 3 comments · 1 reply

foonathan Mar 4, 2022 Maintainer

NelsonEloi Mar 4, 2022 Author

foonathan Mar 4, 2022 Maintainer

NelsonEloi Mar 4, 2022 Author

NelsonEloi
Mar 4, 2022

Replies: 3 comments 1 reply

foonathan
Mar 4, 2022
Maintainer

NelsonEloi
Mar 4, 2022
Author

foonathan Mar 4, 2022
Maintainer

NelsonEloi
Mar 4, 2022
Author