You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
parquet datasets partitions its data into subsets and stores each subset in key-value pair filenames with an "=", as in state="MA". However, dataverse does not appear to allow = as a subdirectory name. Moreover it will silently convert "=" in subdirectory paths into periods ".". So, when an unknowing user downloads the parquet file with the hierarchy, it will be a corrupt parquet file that cannot be read in (manually renaming the post-downloaded subdirectory names results in a readable file).
What steps does it take to reproduce the issue? / Which page(s) does it occurs on?
See also my file pre-unzipping that I tried to upload, attached. mwe_parq.zip
When does this issue occur?
When a parquet format with partitions is uploaded to dataverse. One way to avoid this is to double-zip (Support uploading of archives (ZIP, other). #8029 (comment)) the parquet before uploading so the partitions do not unzip. An example appears in my dataverse. However, that is cumbersome to do manually and it also defeats some of the purpose of paritioning.
What did you expect to happen?
Allow Dataverse to retain the "=" in a filepath, or at least throw a warning that it changed the filepath.
kuriwaki
changed the title
Special character "=" necessary for parquet format not allowed
Special character "=" not allowed in file path but necessary for parquet data with partitions
Oct 17, 2024
What happens?
state="MA"
. However, dataverse does not appear to allow = as a subdirectory name. Moreover it will silently convert "=" in subdirectory paths into periods ".". So, when an unknowing user downloads the parquet file with the hierarchy, it will be a corrupt parquet file that cannot be read in (manually renaming the post-downloaded subdirectory names results in a readable file).What steps does it take to reproduce the issue? / Which page(s) does it occurs on?
mwe_parq.zip
When does this issue occur?
What did you expect to happen?
Which version of Dataverse are you using?
Screenshots:
See one in
#9897 (comment)
Are you thinking about creating a pull request for this issue?
I cannot
The text was updated successfully, but these errors were encountered: