Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support #2239

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

samet-akcay
Copy link
Contributor

@samet-akcay samet-akcay commented Aug 8, 2024

📝 Description

This PR introduces the following changes:

CSV Data Support

  • With this PR users will be able to provide a csv file for their custom datasets. For example;
        >>> data_module = CSV(
        ...     name="custom_format_dataset",
        ...     csv_path="path/to/sample_dataset.csv",
        ...     task=TaskType.CLASSIFICATION,
        ...     sep=";",
        ...     extension=[".jpg", ".png"],
        ... )

### New Splitting Mechanism via SplitMode

  • Existing splitting mechanism for TestSplitMode and ValSplitMode has some duplication and overall is a bit confusing to use. Instead, we introduce SplitMode to standardise the splitting mechanism across each subset.
  • This PR also provides a backward compatibility layer to map the old splitting keys to the new one.
class SplitMode(str, Enum):
    SYNTHETIC = "synthetic"
    PREDEFINED = "predefined"
    AUTO = "auto"
        >>> resolve_split_mode(TestSplitMode.NONE)  # Legacy input (deprecated)
        DeprecationWarning: The split mode TestSplitMode.NONE is deprecated. Use 'SplitMode.AUTO' instead.
        SplitMode.AUTO

        >>> resolve_split_mode(TestSplitMode.SYNTHETIC)  # Legacy input (deprecated)
        DeprecationWarning: The split mode TestSplitMode.SYNTHETIC is deprecated. Use 'SplitMode.SYNTHETIC' instead.
        SplitMode.SYNTHETIC

        >>> resolve_split_mode(ValSplitMode.FROM_TRAIN)  # Legacy input (deprecated)
        DeprecationWarning: The split mode ValSplitMode.FROM_TRAIN is deprecated. Use 'SplitMode.AUTO' instead.
        SplitMode.AUTO

        >>> resolve_split_mode(SplitMode.PREDEFINED)  # Current input (preferred)
        SplitMode.PREDEFINED

Dataset Filtering

  • We propose a new dataset filter object to be able to filter datasets easily. For example;
            #Apply filters via apply method:
            >>> dataset.filter.apply("normal")  # label
            >>> dataset.filter.apply(100)       # count
            >>> dataset.filter.apply(0.5)       # ratio
            >>> dataset.filter.apply({"label": "normal", "count": 100})  # multiple filters

Dataset Splitting

  • In addition to the dataset filtering, this PR introduces dataset splitting via:
            #Create a subset based on label values:
            >>> normal_dataset, abnormal_dataset = dataset.create_subset("label")

            #Create a subset based on specific sample indices:
            >>> train_set, val_set, test_set = dataset.create_subset([[0, 2, 3], [1, 4], [5]])

            #Create a subset based on specific split ratios:
            >>> train_set, val_set, test_set = dataset.create_subset([0.6, 0.2, 0.2], seed=42)

            #Create a subset based on the number of samples:
            >>> dataset.create_subset(100)

            #Create a subset based on custom criteria:
            >>> dataset.create_subset({"label": "normal", "count": 100})

✨ Changes

Select what type of change your PR is:

  • 🐞 Bug fix (non-breaking change which fixes an issue)
  • 🔨 Refactor (non-breaking change which refactors the code base)
  • 🚀 New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📚 Documentation update
  • 🔒 Security update

✅ Checklist

Before you submit your pull request, please make sure you have completed the following steps:

  • 📋 I have summarized my changes in the CHANGELOG and followed the guidelines for my type of change (skip for minor changes, documentation updates, and test enhancements).
  • 📚 I have made the necessary updates to the documentation (if applicable).
  • 🧪 I have written tests that support my changes and prove that my fix is effective or my feature works (if applicable).

For more information about code review checklists, see the Code Review Checklist.

@samet-akcay samet-akcay changed the title 🔨 Dataset refactor: Add CSV Data support and custom validation set support 🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support Aug 8, 2024
Copy link
Contributor

@djdameln djdameln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the huge effort.

Here's an inital (partial) round of review.

In general, the logic might still be a bit hard to follow for newcomers. This is inevitable because of the many scenarios that we need to cover, but at least we should make sure that it is thoroughly covered in the documentation.

Comment on lines +129 to +137
@property
def category(self) -> str:
"""Get the category of the datamodule."""
return self._category

@category.setter
def category(self, category: str) -> None:
"""Set the category of the datamodule."""
self._category = category
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these used anywhere? It might be a bit confusing because not all datasets consist of multiple categories.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part is used for saving the images to filesystem. Maybe we could address this part in another PR, as the scope will expand

Comment on lines +152 to +159
mapping = {
"none": SplitMode.AUTO,
"from_dir": SplitMode.PREDEFINED,
"synthetic": SplitMode.SYNTHETIC,
"same_as_test": SplitMode.AUTO,
"from_train": SplitMode.AUTO,
"from_test": SplitMode.AUTO,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mapping will not lead to the exact same behaviour between the new version and legacy versions. Not sure how big of an issue this is, but it's something to be aware of. Maybe we could include it in the warning message.


# Check validation set
if hasattr(self, "val_data") and not (self.val_data.has_normal and self.val_data.has_anomalous):
msg = "Validation set should contain both normal and abnormal images."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be too strict. Some users may not have access to abnormal images at training time, but may still benefit from running a validation sequence on normal images for adaptive thresholding. (The adaptive threshold value in this case will default to the highest anomaly score predicted over the normal validation images, which turns out to be a not-too-bad estimate in absence of anomalous samples).


# Check test set
if hasattr(self, "test_data") and not (self.test_data.has_normal and self.test_data.has_anomalous):
msg = "Test set should contain both normal and abnormal images."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may also be too strict. In some papers the pixel-level performance is reported over only the anomalous images of the test set. While this may not be the best practice, I think we should support it for those users that want to use this approach.

src/anomalib/data/base/datamodule.py Outdated Show resolved Hide resolved
)
elif self.val_split_mode == SplitMode.SYNTHETIC:
logger.info("Generating synthetic val set.")
self.val_data = SyntheticAnomalyDataset.from_dataset(self.train_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to split the dataset first. Otherwise the training set and the validation set will consist of the same images.

)
elif self.test_split_mode == SplitMode.SYNTHETIC:
logger.info("Generating synthetic test set.")
self.test_data = SyntheticAnomalyDataset.from_dataset(self.train_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. We need to split the train set first, to ensure that the train and test sets are mutually exclusive.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

codecov bot commented Aug 22, 2024

Codecov Report

Attention: Patch coverage is 79.12525% with 105 lines in your changes missing coverage. Please review.

Project coverage is 80.74%. Comparing base (2bd2842) to head (c575d6a).
Report is 5 commits behind head on main.

Files Patch % Lines
src/anomalib/data/base/datamodule.py 53.76% 43 Missing ⚠️
src/anomalib/data/image/csv.py 72.94% 23 Missing ⚠️
src/anomalib/data/utils/filter.py 86.81% 12 Missing ⚠️
src/anomalib/data/utils/split.py 90.47% 12 Missing ⚠️
src/anomalib/data/base/dataset.py 77.27% 10 Missing ⚠️
src/anomalib/data/image/folder.py 76.19% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2239      +/-   ##
==========================================
- Coverage   80.90%   80.74%   -0.16%     
==========================================
  Files         248      250       +2     
  Lines       10859    11232     +373     
==========================================
+ Hits         8785     9069     +284     
- Misses       2074     2163      +89     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@ashwinvaidya17 ashwinvaidya17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the massive effort. At risk of adding more work. I do have a few comments.

# Avenue dataset does not provide a validation set
# Auto behaviour is to clone the test set as validation set.
if self.val_split_mode == SplitMode.AUTO:
self.val_data = self.test_data.clone()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably inform users about the selection. Maybe logger.info("Using testing data for validation")

assert all(filtered_samples.iloc[i]["image_path"] == f"image_{indices[i]}.jpg" for i in range(len(indices)))


def test_filter_by_ratio(sample_classification_dataframe: pd.DataFrame) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also test edge cases like ratio = 0, and 1?

"""Test filtering by count."""
dataset_filter = DatasetFilter(sample_segmentation_dataframe)
count = 50
filtered_samples = dataset_filter.apply(by=count, seed=42)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we also test label_aware filter by count?

return copy.deepcopy(self)

# Alias for copy method
clone = copy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of defining this here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants