Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of Unicode vs Byte strings #828

Merged
merged 33 commits into from
Sep 26, 2023
Merged

Conversation

Scott-Guest
Copy link
Contributor

@Scott-Guest Scott-Guest commented Aug 17, 2023

Part of runtimeverification/k#3344

Fixes the conflation between String and Bytes with regards to how escapes are handled. Specifically, now

  • \xHH is always treated as U+00HH at parse time / when stored in a KOREStringPattern.
  • Created a separate SortCategory::Bytes and BYTES_LAYOUT
  • The runtime representation for strings now uses the bit just before the length to mark whether it represents a UTF-8 encoded string or a byte string.
  • We correctly print Unicode escapes (\x, \u, \U) versus individual byte escapes (only \x) depending on the sort.

A few other minor fixes along the way:

  • Fixed some undefined behavior in sfprintf by ensuring va_lists are copied rather than re-used
  • Renamed set_len to init_with_len because it clears also all non-length bits (got bit by this 😄)
  • Re-ordered SortCategory to ensure MInt is last so that valueType.cat + valueType.bits uniquely identifies a ValueType

Note that this just fixes the infrastructural issues - a few String hook algorithms still need to be updated to give correct results with non-ASCII characters.

@Scott-Guest Scott-Guest self-assigned this Aug 17, 2023
@Scott-Guest Scott-Guest marked this pull request as ready for review August 23, 2023 20:11
Copy link
Collaborator

@dwightguth dwightguth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good but it's incomplete. I can tell you tried to update every place where we were switching on the sort category, but you didn't get them all. If you see my comment above, we are defining cmake variables corresponding to each sort category and then turning them into a macro in config/macros.h. You need to make sure you update every place where those macros occur also. I suspect if this PR were merged in its current form, it would trigger regressions in garbage collection. I would recommend trying to create a test that exercises that regression and adding it to the test suite. It should just be a matter of having a term of sort Bytes live in the configuration during a minor collection cycle.

cmake/RuntimeConfig.cmake Show resolved Hide resolved
@rv-jenkins rv-jenkins merged commit 4c9d010 into master Sep 26, 2023
6 checks passed
@rv-jenkins rv-jenkins deleted the unicode-strings branch September 26, 2023 14:26
@Scott-Guest Scott-Guest restored the unicode-strings branch September 29, 2023 12:53
Scott-Guest added a commit that referenced this pull request Sep 29, 2023
Scott-Guest added a commit that referenced this pull request Sep 30, 2023
The Unicode changes caused a regression in the C-semantics. Reverting
the PR until this is investigated.

---------

Co-authored-by: rv-jenkins <devops@runtimeverification.com>
Scott-Guest added a commit that referenced this pull request Sep 30, 2023
rv-jenkins pushed a commit that referenced this pull request Oct 2, 2023
Using the same `va_list` multiple times is UB, and we should call
`va_copy` before each use instead. We also need to call `va_end`.

(Pulling this out from the #828 because it was reverted).

---------

Co-authored-by: rv-jenkins <devops@runtimeverification.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants