Change capture_token -> capture (and removal of the old capture) #58
Replies: 3 comments 1 reply
-
My issue with the original design was the following:
Based on that, I've decided to make
I don't want to implicitly change the parsing behavior/error handling/values in a context where it's observable. For example, I could add That being said, what is your use case where you have |
Beta Was this translation helpful? Give feedback.
-
Hello Jonathan,
I'm currently developing a parser for a DSL, in essence is a subset of
C++, one of the things it's going to support is a *enum*. I'm still
undecided if I'm going to support everything a struct can be in C++, but
the DSL is designed to be both parsed by this parser, and still be compiled
by a C++ compiler.
Has reference https://en.cppreference.com/w/cpp/language/enum
In simple terms a enum is
enum-key enum-name enum-base(optional)
{ enumerator-list(optional) };
Meaning the following 3 would be valid.
enum name
{};
enum struct name
{};
enum class name
{};
I would like to parse that and produce different objects for the Ast, that
also contain the full string used in them, this was why I was using
capture(rule), just to get the parsed string, the enum-key, enum-name would
still come from rule and U would get in the callback the lexeme and those
productions.
I could make 3 production rules, one for each enum-key type (enum struct,
enum class and enum (need to be the last to check) and instantiate the
correct object type for the Ast, but everything after this first check is
essentially the same and code bloat.
I tried but was successful (might have to check this again) using a base
production and passing it the enum-key rule, the result would be a variant.
And the other solution was to to do something like:
rule = capture( enum >> capture_token( token( if_( struct | class ) ) ) +
id + { enumerator_rule };
then the callback would have the signature:
lexeme(full rule), lexeme(struct|class|empty), id, enumerator
In this callback I could string compare the second lexeme with the
struct|class|empty to instantiate the right object type and then assign the
rest of the parameters to it, would still return a variant from that
callback
Maybe I'm missing something and there is a simpler solution to this problem.
Thank you very much,
Nelson Elói
…On Fri, Mar 4, 2022 at 9:50 AM Jonathan Müller ***@***.***> wrote:
My issue with the original design was the following:
- dsl::capture(token) with automatic whitespace skipping is
essentially always a bug, as it will also capture the trailing whitespace.
I ran into that bug myself while implementing dsl::integer and
dsl::identifier, which used it internally at some point.
- dsl::capture(rule) with automatic whitespace skipping is almost
always wrong as well, for the same reason.
- dsl::capture(rule) without automatic whitespace skipping is fine,
but I don't expect it to be common. I think the vast majority of use cases
just want to capture a token.
Based on that, I've decided to make dsl::capture(token) do what you want
and actually capture the token. As you've said, if you want the old
behavior of dsl::capture(rule) you can emulate it using dsl::position +
rule + dsl::position. Since that is the less common use case, I think the
additional boilerplate is fine.
If the design is to stay has is, I think you could have capture_token,
turn every rule into a token by default, since it's going to force you into
using anyway.
I don't want to implicitly change the parsing behavior/error
handling/values in a context where it's observable. For example,
dsl::peek() internally wraps everything into dsl::token(), but that's
fine, since it's unobservable anyway. For dsl::capture(), it makes a
difference, as you said, so I don't want to make it implicit.
I could add dsl::capture_token() back as dsl::capture(dsl::token(...)),
but I don't think those couple of characters you safe are worth it.
------------------------------
That being said, what is your use case where you have
dsl::capture(non-token-rule)?
—
Reply to this email directly, view it on GitHub
<#58 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AODTZAXU3HNX4HRYSWG7NLDU6HME3ANCNFSM5P44DRXA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Initially I had: capture( enum >> capture_token( token( if_( struct | class ) ) ) + id + body )
Your solutions:
For 1 I would do position >> enum >> capture( token( if_( struct | class ) + id + body + position Both could backtrack while parsing struct | class, but nothing you can do about it. I like the first, for the 1 callback option without any extra code generated ()well you still have 3 ifs inside it) All in all, don't think you need to change anything, you can get the old capture with productions functionality with small changes. Just a note, if you have a rule with and error inside the token, it doesn't get propagated outside, you just get the missing token error and not the specialized one. Thank you very much for a great library. |
Beta Was this translation helpful? Give feedback.
-
Good day,
I don't think capture_token offers the same functionality of the old capture just with the trailing spaces removed.
Let me explain, if you had an old rule with capture that included productions (via rules or other captures), you have a callback with the lexeme, plus a parameter for each production.
The capture_token, first forces you to turn that rule into a token, and has a result you lose the parameters for the productions in callback, and get only the lexeme.
You probably could get around that by removing the capture_token/token pair and place a position front and back, and build the lexeme in the callback.
Error propagation also works differently.
If the design is to stay has is, I think you could have capture_token, turn every rule into a token by default, since it's going to force you indo using anyway.
Than you,
Nelson Eloi
Beta Was this translation helpful? Give feedback.
All reactions