Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set %option 8bit in KOREScanner.l #949

Merged
merged 1 commit into from
Jan 12, 2024
Merged

Set %option 8bit in KOREScanner.l #949

merged 1 commit into from
Jan 12, 2024

Conversation

Scott-Guest
Copy link
Contributor

@Scott-Guest Scott-Guest commented Jan 11, 2024

Fixes #948

Flex does not directly support Unicode, but it does support %option 8bit where every 8-bit byte in the input stream is treated as a separate character. To fix #948, rather than escaping in the frontend then, we can just accept any byte sequence inside comments (in practice always UTF-8), allowing the original source text to be passed through unmodified.

However, 8-bit mode also makes it so that . and negated character classes [^bar] accept non-ASCII bytes, which we sometime want to disallow, e.g., in string literals, so we need to update every such regex accordingly.

@Scott-Guest Scott-Guest self-assigned this Jan 11, 2024
@Scott-Guest Scott-Guest marked this pull request as ready for review January 11, 2024 23:05
@Baltoli
Copy link
Contributor

Baltoli commented Jan 12, 2024

Confirmed that this fixes the issue; good catch @Scott-Guest!

@rv-jenkins rv-jenkins merged commit d7fd6d2 into master Jan 12, 2024
7 checks passed
@rv-jenkins rv-jenkins deleted the unicode-comments branch January 12, 2024 09:42
Baltoli pushed a commit that referenced this pull request Jan 15, 2024
Fixes #948

Flex does not directly support Unicode, but it does support `%option
8bit` where every 8-bit byte in the input stream is treated as a
separate character. To fix #948, rather than escaping in the frontend
then, we can just accept any byte sequence inside comments (in practice
always UTF-8), allowing the original source text to be passed through
unmodified.

However, 8-bit mode also makes it so that `.` and negated character
classes `[^bar]` accept non-ASCII bytes, which we sometime want to
disallow, e.g., in string literals, so we need to update every such
regex accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-ASCII characters in source file path break tokenizer
3 participants