From dcd8b8c22018bb97deb7a000b3df9f6430737e06 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Thu, 24 Oct 2024 17:23:20 +0200 Subject: [PATCH] Update datasets-download-stats.md (#1469) * Update datasets-download-stats.md * same title for models * Apply suggestions from code review Co-authored-by: Julien Chaumond --------- Co-authored-by: Julien Chaumond --- docs/hub/datasets-download-stats.md | 6 +++--- docs/hub/models-download-stats.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/hub/datasets-download-stats.md b/docs/hub/datasets-download-stats.md index 3f47b74b1..20d5a3693 100644 --- a/docs/hub/datasets-download-stats.md +++ b/docs/hub/datasets-download-stats.md @@ -1,10 +1,10 @@ # Datasets Download Stats -## How are download stats generated for datasets? +## How are downloads counted for datasets? -Counting the number of downloads for datasets is not a trivial task, as a single dataset repository might contain multiple files, from multiple subsets and splits (e.g. train/validation/test) and sometimes with many files in a single split. To solve this issue and avoid counting one person's download multiple times, we treat all files downloaded by a user within a 5-minute window as a single dataset download. This counting happens automatically on our servers when files are downloaded (through GET or HEAD requests), with no need to collect any user information or make additional calls. +Counting the number of downloads for datasets is not a trivial task, as a single dataset repository might contain multiple files, from multiple subsets and splits (e.g. train/validation/test) and sometimes with many files in a single split. To solve this issue and avoid counting one person's download multiple times, we treat all files downloaded by a user (based on their IP address) within a 5-minute window as a single dataset download. This counting happens automatically on our servers when files are downloaded (through GET or HEAD requests), with no need to collect any user information or make additional calls. -## Before Setpember 2024 +## Before September 2024 The Hub used to provide download stats only for the datasets loadable via the `datasets` library. To determine the number of downloads, the Hub previously counted every time `load_dataset` was called in Python, excluding Hugging Face's CI tooling on GitHub. No information was sent from the user, and no additional calls were made for this. The count was done server-side as we served files for downloads. This means that: diff --git a/docs/hub/models-download-stats.md b/docs/hub/models-download-stats.md index fe1ed08e5..83c760f45 100644 --- a/docs/hub/models-download-stats.md +++ b/docs/hub/models-download-stats.md @@ -1,6 +1,6 @@ # Models Download Stats -## How are download stats generated for models? +## How are downloads counted for models? Counting the number of downloads for models is not a trivial task, as a single model repository might contain multiple files, including multiple model weight files (e.g., with sharded models) and different formats depending on the library (GGUF, PyTorch, TensorFlow, etc.). To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. No information is sent from the user, and no additional calls are made for this. The count is done server-side as the Hub serves files for downloads.