-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of Unicode vs Byte strings #828
Conversation
…lized variable warning
…te string. Update all printers.
6d64c73
to
463cb95
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good but it's incomplete. I can tell you tried to update every place where we were switching on the sort category, but you didn't get them all. If you see my comment above, we are defining cmake variables corresponding to each sort category and then turning them into a macro in config/macros.h
. You need to make sure you update every place where those macros occur also. I suspect if this PR were merged in its current form, it would trigger regressions in garbage collection. I would recommend trying to create a test that exercises that regression and adding it to the test suite. It should just be a matter of having a term of sort Bytes live in the configuration during a minor collection cycle.
…set_is_bytes after this init
This reverts commit 4c9d010.
Using the same `va_list` multiple times is UB, and we should call `va_copy` before each use instead. We also need to call `va_end`. (Pulling this out from the #828 because it was reverted). --------- Co-authored-by: rv-jenkins <devops@runtimeverification.com>
Part of runtimeverification/k#3344
Fixes the conflation between String and Bytes with regards to how escapes are handled. Specifically, now
\xHH
is always treated asU+00HH
at parse time / when stored in aKOREStringPattern
.char[]
of bytes when the runtime token is constructedKOREStringPattern::contents
is always a UTF-8 encoded string and aligns with how Pyk represents Bytes (resolving Python binding forStringPattern.contents
is broken #824)SortCategory::Bytes
andBYTES_LAYOUT
\x
,\u
,\U
) versus individual byte escapes (only\x
) depending on the sort.A few other minor fixes along the way:
sfprintf
by ensuringva_list
s are copied rather than re-usedset_len
toinit_with_len
because it clears also all non-length bits (got bit by this 😄)SortCategory
to ensureMInt
is last so thatvalueType.cat + valueType.bits
uniquely identifies aValueType
Note that this just fixes the infrastructural issues - a few String hook algorithms still need to be updated to give correct results with non-ASCII characters.