You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, this project does not have a stable alternative that allows detecting CSV file configuration. An example of this is raised in #1719, where the utility fails to detect the configuration for the given files.
Details
At the moment, @jqnatividad has begun digging into the problem and claiming
Perhaps, we can tag-team on qsv-sniffer to make its CSV schema inferencing more reliable?
He pointed
Aligning qsv-sniffer's behavior with python's csv sniffer is the way to go!
The work path to go, until now, is outlined in jqnatividad/qsv-sniffer#14. Currently, all tasks are under study but not completed.
New path
In this I will discuss a new approach to implement dialect detection in qsv using trivial elements:
Regexes: determine fields data types.
Current implemented parser: load data.
Table Uniformity measure: detect the table with the best structure.
With this approach the dialect detection is reliable as the CleverCSV one, being able to obtain results with greater certainty. The process is as follows:
In the first phase, potential dialects are built based on field/column separator, quotation marks, and record delimiter characters. In this stage user can provide custom delimiter list, giving the tool a level of flexibility.
With each potential dialect, we attempt to parse the CSV file and use the data to construct temporary table.
The table is scored using the Table Uniformity measurement. Each score is saved in a collection using the dialect as a key.
The dialect that produces the table with the highest score is then selected as the desired one.
A Python implementation of this exact approach is described in a GitHub repository. The evaluation of this methods gives:
Tool
F1 score
CSVsniffer
0.9260
CleverCSV
0.8425
csv.Sniffer
0.8049
This sheds light over one point: the presented approach is clearly outperforming csv.Sniffer and also CleverCSV in the research datasets.
Hoping this can help this wonderful project!
Edit:
Code snippet will be presented in the discussion.
The text was updated successfully, but these errors were encountered:
You can use the paper only to implement some logic if you're confused at porting the Python code. So, look at the research as a backup reference to dive in into the implementation.
Discussed in #2246
Originally posted by ws-garcia October 25, 2024
Problem overview
Currently, this project does not have a stable alternative that allows detecting CSV file configuration. An example of this is raised in #1719, where the utility fails to detect the configuration for the given files.
Details
At the moment, @jqnatividad has begun digging into the problem and claiming
He pointed
The work path to go, until now, is outlined in jqnatividad/qsv-sniffer#14. Currently, all tasks are under study but not completed.
New path
In this I will discuss a new approach to implement dialect detection in qsv using trivial elements:
With this approach the dialect detection is reliable as the CleverCSV one, being able to obtain results with greater certainty. The process is as follows:
A Python implementation of this exact approach is described in a GitHub repository. The evaluation of this methods gives:
CSVsniffer
CleverCSV
csv.Sniffer
This sheds light over one point: the presented approach is clearly outperforming
csv.Sniffer
and alsoCleverCSV
in the research datasets.Hoping this can help this wonderful project!
Edit:
Code snippet will be presented in the discussion.
The text was updated successfully, but these errors were encountered: