Improve handling of Unicode vs Byte strings #828

Scott-Guest · 2023-08-17T15:05:53Z

Fixes the conflation between String and Bytes with regards to how escapes are handled. Specifically, now

\xHH is always treated as U+00HH at parse time / when stored in a KOREStringPattern.
- For Bytes, we later produce the actual char[] of bytes when the runtime token is constructed
- This ensures KOREStringPattern::contents is always a UTF-8 encoded string and aligns with how Pyk represents Bytes (resolving Python binding for StringPattern.contents is broken #824)
Created a separate SortCategory::Bytes and BYTES_LAYOUT
The runtime representation for strings now uses the bit just before the length to mark whether it represents a UTF-8 encoded string or a byte string.
We correctly print Unicode escapes (\x, \u, \U) versus individual byte escapes (only \x) depending on the sort.

A few other minor fixes along the way:

Fixed some undefined behavior in sfprintf by ensuring va_lists are copied rather than re-used
Renamed set_len to init_with_len because it clears also all non-length bits (got bit by this 😄)
Re-ordered SortCategory to ensure MInt is last so that valueType.cat + valueType.bits uniquely identifies a ValueType

Note that this just fixes the infrastructural issues - a few String hook algorithms still need to be updated to give correct results with non-ASCII characters.

… values

…lized variable warning

…te string. Update all printers.

…n tree

dwightguth

The code looks good but it's incomplete. I can tell you tried to update every place where we were switching on the sort category, but you didn't get them all. If you see my comment above, we are defining cmake variables corresponding to each sort category and then turning them into a macro in config/macros.h. You need to make sure you update every place where those macros occur also. I suspect if this PR were merged in its current form, it would trigger regressions in garbage collection. I would recommend trying to create a test that exercises that regression and adding it to the test suite. It should just be a matter of having a term of sort Bytes live in the configuration during a minor collection cycle.

cmake/RuntimeConfig.cmake

…set_is_bytes after this init

This reverts commit 4c9d010.

The Unicode changes caused a regression in the C-semantics. Reverting the PR until this is investigated. --------- Co-authored-by: rv-jenkins <devops@runtimeverification.com>

…845)" This reverts commit a2577f4.

Using the same `va_list` multiple times is UB, and we should call `va_copy` before each use instead. We also need to call `va_end`. (Pulling this out from the #828 because it was reverted). --------- Co-authored-by: rv-jenkins <devops@runtimeverification.com>

Scott-Guest added 13 commits July 25, 2023 14:45

Always parse \xHH as U+00HH, then decode when needed for Bytes domain…

dd6d275

… values

Merge remote-tracking branch 'origin/master' into unicode-strings

5b1934f

Fix bytesStringPatternToBytes -> kllvm::bytesStringPatternToBytes

588d624

bytesStringPatternToBytes(): Add dummy assignment to silence uninitia…

a85c868

…lized variable warning

Refactor bytesStriingPatternToBytes

adc4d4b

Merge remote-tracking branch 'origin/master' into unicode-strings

df8bdc6

Add a bit to the string representation to indicate whether it is a by…

77c77ee

…te string. Update all printers.

Add dummy return to silence warning

a3c88e7

Set IS_BYTES bit in bytes2string and string2bytes

6e31d8e

Rename missed usage of bytes2string to allocStringCopy

acfae41

escapeString: Correct lengths passed to snprintf

cf08f4f

emitGetToken: Fix type CurrentBlock -> CaseBlock for Bytes case

d64ef7a

Make bytesStringPatternToBytes extern C

26e4722

Scott-Guest self-assigned this Aug 17, 2023

Scott-Guest added 6 commits August 17, 2023 14:11

Refactor KOREScanner UTF-8 conversion to use UTF8EncodingType

5dbcb97

Convert Bytes back to UTF-8 encoded version when serializing

0670ca0

sfprintf: Correctly va_copy and va_end to avoid undefined behavior

f53aea9

Convert to UTF-8 encoded representation of Bytes when parsing decisio…

f1b34e3

…n tree

Update test-unicode's output to use a Unicode escape

11270b0

Create Bytes SortCategory

463cb95

Scott-Guest force-pushed the unicode-strings branch from 6d64c73 to 463cb95 Compare August 23, 2023 04:05

Added test-unicode-strings

0165962

Scott-Guest requested a review from dwightguth August 23, 2023 20:11

Scott-Guest marked this pull request as ready for review August 23, 2023 20:11

Merge branch 'master' into unicode-strings

14e71e1

dwightguth requested changes Aug 24, 2023

View reviewed changes

cmake/RuntimeConfig.cmake Show resolved Hide resolved

Scott-Guest added 4 commits September 5, 2023 12:28

Merge remote-tracking branch 'origin/master' into unicode-strings

86e1e0c

Add BYTES_LAYOUT

a27d3f2

Add missing calls to set_is_bytes in hooks

92df03a

Set is_bytes bit in hook_BYTES_empty

88ee65d

Scott-Guest added 7 commits September 7, 2023 11:39

Set is_bytes to false in STRING hooks which delegate to BYTES

650a5a5

Correct ! to ~ in set_is_bytes

c78ce03

Update set_len to not unintentionally clear is_bytes bit

c9a4b1a

Fix set_len to also clear original NOT_YOUNG_OBJECT_BIT

11749cd

Restore set_len behavior, rename to init_with_len, move all calls to …

591f80b

…set_is_bytes after this init

Fix json.cpp formatting, caused by bug in clang-format-14

3decc94

Reorder SortCategory so that MInt is last

44cf5fc

Scott-Guest requested a review from dwightguth September 19, 2023 14:27

dwightguth approved these changes Sep 25, 2023

View reviewed changes

Merge branch 'master' into unicode-strings

06cf769

Scott-Guest added the automerge label Sep 26, 2023

rv-jenkins merged commit 4c9d010 into master Sep 26, 2023
6 checks passed

rv-jenkins deleted the unicode-strings branch September 26, 2023 14:26

Scott-Guest mentioned this pull request Sep 28, 2023

Unicode support is inconsistent due to conflating Unicode strings and byte strings runtimeverification/k#3344

Open

1 task

Scott-Guest restored the unicode-strings branch September 29, 2023 12:53

Scott-Guest added a commit that referenced this pull request Sep 29, 2023

Revert "Improve handling of Unicode vs Byte strings (#828)"

6092863

This reverts commit 4c9d010.

Scott-Guest added a commit that referenced this pull request Sep 30, 2023

Revert "Revert "Improve handling of Unicode vs Byte strings (#828)" (#…

8be9c8d

…845)" This reverts commit a2577f4.

Scott-Guest mentioned this pull request Sep 30, 2023

Add calls to va_copy and va_end to fix UB in sfprintf #850

Merged

Scott-Guest mentioned this pull request Jan 11, 2024

Infinite loop for Bytes2String #847

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of Unicode vs Byte strings #828

Improve handling of Unicode vs Byte strings #828

Scott-Guest commented Aug 17, 2023 •

edited

Loading

dwightguth left a comment

Improve handling of Unicode vs Byte strings #828

Improve handling of Unicode vs Byte strings #828

Conversation

Scott-Guest commented Aug 17, 2023 • edited Loading

dwightguth left a comment

Choose a reason for hiding this comment

Scott-Guest commented Aug 17, 2023 •

edited

Loading