Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream parser fails to read long strings ending in a single quote #340

Open
YourFin opened this issue Jun 14, 2024 · 3 comments
Open

Stream parser fails to read long strings ending in a single quote #340

YourFin opened this issue Jun 14, 2024 · 3 comments
Labels

Comments

@YourFin
Copy link

YourFin commented Jun 14, 2024

When using the ion command line tool (and by extension this library), my understanding is that a file containing:

'''x''''

should be equivalent to

"x'"

Instead, it reads ~as:

"x"
Syntax error: unexpected "'" before end of file

I'm submitting this from my work laptop (feel free to look me up internally), but I can supply the actual error message from my personal machine when I get home if that would be helpful. I also have not tested this with the non-steam parser; this came up as I was attempting to a validate an ion tree-sitter grammar.

The x isn't required to surface the bug - ''''''' also fails to read as "'", but the x makes the example string easier to read.


I suspect this is a result of greedily treating three single quotes in a long string as termination, despite that not necessarily being the case.

As far as I can tell, there isn't any reason that this process should be greedy. Should a bunch of single quotes show up in a row at the end of a long string, parsing them while allowing for this case is unambiguous (1).

In terms of streaming performance, I also cannot see a reason why allowing for such strings would cause a change in behavior. Long strings can be split across multiple sections, so a full string can't be emitted from a stream reader when one long-string section is terminated.

E.g.: if a process was reading a stream of data via an ion stream reader:

---- Packet 1, T1 ----
'''\
Some long string
'''
---- Packet 2, T2 = T1 + 10 seconds ----
'''\
Line 2 of a long string
'''
---- Packet 3, T3 = T2 + 10 seconds ----
123

The long string starting in Packet 1 could not be processed fully processed until T3 anyways, due to the possibility of another long string section.


(1):

'''
x'// Parser state: a superposition of:
// - "in the middle of a long string section, last two characters in the long string are 
//   single quotes"
//   String so far: "\nx'"
//   ^ Non ' will collapse to this branch
// - "In the middle of a long string section start token"
//   String so far: "\nx"
'''
x''// Parser state: a superposition of:
// - "in the middle of a long string section, last two characters in the long string are 
//   single quotes"
//   String so far: "\nx''"
//   ^ Non ' will collapse to this branch
// - "In the middle of a long string section start token"
//   String so far: "\nx"
'''
x'''// Parser state: Long string containing "\nx"
'''
x''''// Parser state: a superposition of:
// - "A long string section, the contents of which end with a single quote" 
//   String so far: "\nx'"
//   ^ Non ' will collapse to this branch
// - "One ' read in a ''' long string section start token"
//   String so far: "\nx"
'''
x'''''// Parser state: a superposition of:
// - "A long string section, the contents of which end with a single quote"
//   String so far: "\nx''"
//   ^ Non ' will collapse to this branch
// - "Two ' read in a ''' long string section start token"
//   String so far: "\nx"
'''
x''''''// Parser state: Just started a new long string section. String so far: "\nx"
'''
x'''''''// Parser state: a superposition of: 
// - Just started a new long string section. 
//   String so far: "\nx'"
//   ^ Non ' will collapse into this branch
// - One ' into a ''' long string end token. 
//   String so far: "\nx"
'''
x''''''''// Parser state: a superposition of: 
// - Just started a new long string section. 
//   String so far: "\nx''"
//   ^ Non ' will collapse into this branch
// - Two ' into a ''' long string end token. 
//   String so far: "\nx"
'''
x'''''''''// Parser state: Just started a new long string section. String so far: "\nx"

At which point we loop back to the first state should we see another '

@YourFin YourFin added the bug label Jun 14, 2024
@tgregg
Copy link
Contributor

tgregg commented Jun 17, 2024

I believe this behavior is actually correct, and all Ion parsing implementations that I've tried with this input (ion-c, ion-rust, ion-java) behave the same way.

Note that this does not prevent you from encoding the text x' using a long string. You simply need to escape the ' character that is intended to be part of the data, e.g. '''x\''''.

A similar thing happens when using short strings, e.g. """ produces an error, but "\"" produces the text ".

@YourFin
Copy link
Author

YourFin commented Jun 18, 2024

Ah, I forgot about quoted symbols.

If Ion did not allow for quoted symbols, I think there would be a stronger case for not tokenizing ‘’’ greedily. As it stands, though, that greatly complicates the “seven single quotes in a row” handling in the cases where a symbol is valid following a string, and the permissive interpretation would probably come with some nasty surprises.

Sorry for the spurious report! Glad to see I was wrong on this one.

Would you accept a contribution of seven single quotes in a row as bad test data? I couldn't find it there, and it would be nice to make this policy explicit. On the other hand, it would be a bit of an odd case since the "bad" test data would start with a valid string.

@tgregg
Copy link
Contributor

tgregg commented Jun 18, 2024

Yes, I think that would be a good test case to add to bad/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants