Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate character encoding #18

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

rotu
Copy link

@rotu rotu commented Apr 3, 2024

Where the schema expects a declared character encoding, validate that it is one of the character encodings recognized by the compiler. When using a smart editor, this makes it easier to author a .ksy file.

image

I took this list from the recognized encoding list in Kaitai Struct compiler.

"windows-1258",
"IBM437",
"IBM866"
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to keep this list in sync with https://github.com/kaitai-io/kaitai_struct_compiler/blob/master/shared/src/main/scala/io/kaitai/struct/EncodingList.scala

I wonder if there is any easy way out here? I can imagine:

  • some tool which will copy ksc's EncodingList.scala list to here automatically
  • some tool which will copy this file to EncodingList.scala
  • getting ksc to read ksy_schema.json as source of truth (but then we'll need to keep a list of aliases somewhere?)

Copy link
Member

@generalmimon generalmimon Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • some tool which will copy ksc's EncodingList.scala list to here automatically

I think this is the way to go (but I don't think it has to be "automatically"). Practically speaking, I'm thinking of adding a CI job to kaitai_struct_compiler that compares EncodingList.scala with the latest ksy_schema.json. If it finds any inconsistencies, it can do one of two things:

  1. just fail the CI job and thus prompt a human operator to look into the CI log, which would contain the details about the failed comparison. The human would then go here to https://github.com/kaitai-io/ksy_schema and update the list manually.

    This should be simple to implement and would probably do the job good enough. Yes, the updating would still need to be done manually, but I don't think the encoding list in KSC will be changing so rapidly that this becomes annoying.

  2. have @kaitai-bot open a PR with the update in https://github.com/kaitai-io/ksy_schema/pulls and wait for the human operator to merge it. This would perhaps be more convenient if we were changing the encoding list in KSC often, but as I already mentioned, I don't think we will, so this is probably an overkill. It adds a bunch of implementation challenges on top of the variant 1 - for example, editing the JSON file so that the overall formatting stays intact, opening a PR via some API, etc.

    To be honest, I don't think we need this, it seems overengineered. I suppose that there will be like less than 10 updates of the encoding list in the next 5 years, and we can certainly handle that much manually.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to keep this list in sync

The same could be said for the list of attributes (e.g. the valid attribute, implemented by the compiler but missing from the spec kaitai-io/kaitai_struct#944).

Practically speaking, I'm thinking of adding a CI job to kaitai_struct_compiler that compares EncodingList.scala with the latest ksy_schema.json

I think it would be ideal to add a CI job in https://github.com/kaitai-io/kaitai_struct_formats which checks that *.ksy files validate against the schema.

@GreyCat GreyCat self-requested a review April 4, 2024 11:50
Copy link
Member

@GreyCat GreyCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me, if anything, that's a great first step towards user-friendly canonicalization.

@generalmimon
Copy link
Member

generalmimon commented Apr 4, 2024

@rotu On the one hand, naming Git branches after animals is kind of funny (I wonder what tool does that?), but on the other hand, sensible branch names that at least slightly indicate the content are better. It's not that different from the default naming scheme that GitHub uses for edits created via the Web GUI (patch-1, patch-2 etc.), but we can probably agree that this is not exactly best practice either.

Speaking of which, please try to write at least a brief explanation in the pull request description. In general, every change deserves it.

@rotu
Copy link
Author

rotu commented Apr 4, 2024

@rotu On the one hand, naming Git branches after animals is kind of funny (I wonder what tool does that?), but on the other hand, sensible branch names that at least slightly indicate the content are better. It's not that different from the default naming scheme that GitHub uses for edits created via the Web GUI (patch-1, patch-2 etc.), but we can probably agree that this is not exactly best practice either.

This is the default behavior of Visual Studio Code. The branch names should not be conveying meaning - please see the PR for canonical description.

Speaking of which, please try to write at least a brief explanation in the pull request description. In general, every change deserves it.

I thought the title of the PR was sufficient but I'll elaborate.

@generalmimon
Copy link
Member

generalmimon commented Apr 4, 2024

The branch names should not be conveying meaning - please see the PR for canonical description.

Why shouldn't they?

I thought the title of the PR was sufficient but I'll elaborate.

It's not just about what you did (commit / PR titles and code are probably indeed sufficient for that), but why and how you did it are often important questions. For example, here we had to guess that you took the list from EncodingList.scala, which you didn't mention. In kaitai-io/kaitai_struct#1101, @GreyCat had to ask you why you're making that change.

And I'm not talking just about this PR, but in general - most of the time, leaving PR descriptions empty is a bad habit.

@rotu
Copy link
Author

rotu commented Apr 4, 2024

The branch names should not be conveying meaning - please see the PR for canonical description.

Why shouldn't they?

Here's why I don't like naming my branches: (1) branch names are immutable, even if the scope of a PR changes or you named the branch poorly to start (2) it creates a conflicting source of truth for the intent of a branch.

I thought the title of the PR was sufficient but I'll elaborate.

It's not just about what you did (commit / PR titles and code are probably indeed sufficient for that), but why and how you did it are often important questions. For example, here we had to guess that you took the list from EncodingList.scala, which you didn't mention). In kaitai-io/kaitai_struct#1101, @GreyCat had to ask you why you're making that change.

And I'm not talking just about this PR, but in general - most of the time, leaving PR descriptions empty is a bad habit.

Good points!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants