Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV dialect detection: implementation without third party libraries #2247

Open
ws-garcia opened this issue Oct 25, 2024 Discussed in #2246 · 2 comments
Open

CSV dialect detection: implementation without third party libraries #2247

ws-garcia opened this issue Oct 25, 2024 Discussed in #2246 · 2 comments
Assignees
Labels
enhancement New feature or request. Once marked with this label, its in the backlog.

Comments

@ws-garcia
Copy link

ws-garcia commented Oct 25, 2024

Discussed in #2246

Originally posted by ws-garcia October 25, 2024

Problem overview

Currently, this project does not have a stable alternative that allows detecting CSV file configuration. An example of this is raised in #1719, where the utility fails to detect the configuration for the given files.

Details

At the moment, @jqnatividad has begun digging into the problem and claiming

Perhaps, we can tag-team on qsv-sniffer to make its CSV schema inferencing more reliable?

He pointed

Aligning qsv-sniffer's behavior with python's csv sniffer is the way to go!

The work path to go, until now, is outlined in jqnatividad/qsv-sniffer#14. Currently, all tasks are under study but not completed.

New path

In this I will discuss a new approach to implement dialect detection in qsv using trivial elements:

  • Regexes: determine fields data types.
  • Current implemented parser: load data.
  • Table Uniformity measure: detect the table with the best structure.

With this approach the dialect detection is reliable as the CleverCSV one, being able to obtain results with greater certainty. The process is as follows:

  • In the first phase, potential dialects are built based on field/column separator, quotation marks, and record delimiter characters. In this stage user can provide custom delimiter list, giving the tool a level of flexibility.
  • With each potential dialect, we attempt to parse the CSV file and use the data to construct temporary table.
  • The table is scored using the Table Uniformity measurement. Each score is saved in a collection using the dialect as a key.
  • The dialect that produces the table with the highest score is then selected as the desired one.

A Python implementation of this exact approach is described in a GitHub repository. The evaluation of this methods gives:

Tool F1 score
CSVsniffer 0.9260
CleverCSV 0.8425
csv.Sniffer 0.8049

This sheds light over one point: the presented approach is clearly outperforming csv.Sniffer and also CleverCSV in the research datasets.

Hoping this can help this wonderful project!

Edit:

Code snippet will be presented in the discussion.

@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Oct 25, 2024
@jqnatividad
Copy link
Owner

Thanks @ws-garcia !

This is very timely as I was dreading taking on the csv-sniffer python port, thus the lack of activity.

Your step-by-step "new path" breakdown is certainly easier to digest than the paper :)

Will be sure to loop you in as we mark progress...

@jqnatividad jqnatividad self-assigned this Oct 25, 2024
@ws-garcia
Copy link
Author

ws-garcia commented Oct 25, 2024

You can use the paper only to implement some logic if you're confused at porting the Python code. So, look at the research as a backup reference to dive in into the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Once marked with this label, its in the backlog.
Projects
None yet
Development

No branches or pull requests

2 participants