Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special character "=" not allowed in file path but necessary for parquet data with partitions #10933

Open
kuriwaki opened this issue Oct 17, 2024 · 0 comments
Labels
Type: Bug a defect

Comments

@kuriwaki
Copy link
Member

What happens?

  • parquet datasets partitions its data into subsets and stores each subset in key-value pair filenames with an "=", as in state="MA". However, dataverse does not appear to allow = as a subdirectory name. Moreover it will silently convert "=" in subdirectory paths into periods ".". So, when an unknowing user downloads the parquet file with the hierarchy, it will be a corrupt parquet file that cannot be read in (manually renaming the post-downloaded subdirectory names results in a readable file).

What steps does it take to reproduce the issue? / Which page(s) does it occurs on?

When does this issue occur?

  • When a parquet format with partitions is uploaded to dataverse. One way to avoid this is to double-zip (Support uploading of archives (ZIP, other). #8029 (comment)) the parquet before uploading so the partitions do not unzip. An example appears in my dataverse. However, that is cumbersome to do manually and it also defeats some of the purpose of paritioning.

What did you expect to happen?

  • Allow Dataverse to retain the "=" in a filepath, or at least throw a warning that it changed the filepath.

Which version of Dataverse are you using?

  • Latest Harvard Dataverse and Demo Dataverse

Screenshots:

See one in
#9897 (comment)

Are you thinking about creating a pull request for this issue?
I cannot

@kuriwaki kuriwaki added the Type: Bug a defect label Oct 17, 2024
@kuriwaki kuriwaki changed the title Special character "=" necessary for parquet format not allowed Special character "=" not allowed in file path but necessary for parquet data with partitions Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug a defect
Projects
None yet
Development

No branches or pull requests

1 participant