Fast download in hf file system #2143

Wauplin · 2024-03-22T11:11:13Z

Should fix the current regression in file download speed. When downloading an entire file, let's rely on hf_hub_download which is faster, has retry mechanism and can be speed-up with hf_transfer (on-demand).

import time
from tempfile import TemporaryDirectory
from datasets import Dataset

repo_id = "Open-Orca/OpenOrca"
filename = "3_5M-GPT3_5-Augmented.parquet"

parquet = f"hf://datasets/{repo_id}/{filename}"

t0 = time.time()
with TemporaryDirectory() as temp_dir:
    Dataset.from_parquet(parquet, cache_dir=temp_dir)
t1 = time.time()
print(t1 - t0)

=> on my local setup, speed went from ~15MB/s to 45MB/s (on wifi, no hf_transfer) which is strictly the same speed as a normal hf_hub_download.

cc @lhoestq @julien-c (related to https://twitter.com/ashvardanian/status/1769964480086024203)

For the review, it's best to select Hide whitespace on Github.

TODO:

add proper test in hfh
test in datasets? (cc @lhoestq)
fix what if callback.tqdm doesn't exist?

HuggingFaceDocBuilderDev · 2024-03-22T11:16:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

codecov-commenter · 2024-03-22T11:19:14Z

Codecov Report

Attention: Patch coverage is 51.06383% with 23 lines in your changes are missing coverage. Please review.

Project coverage is 82.64%. Comparing base (3252e27) to head (1df891d).

Files	Patch %	Lines
src/huggingface_hub/file_download.py	63.63%	12 Missing ⚠️
src/huggingface_hub/hf_file_system.py	21.42%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2143      +/-   ##
==========================================
- Coverage   82.70%   82.64%   -0.07%     
==========================================
  Files         103      103              
  Lines        9628     9644      +16     
==========================================
+ Hits         7963     7970       +7     
- Misses       1665     1674       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lhoestq

Looks great !

I added suggestions to handle the case where the callback has no tqdm.

src/huggingface_hub/hf_file_system.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

mariosasko

Cool!

src/huggingface_hub/hf_file_system.py

mariosasko · 2024-03-22T16:38:07Z

src/huggingface_hub/hf_file_system.py

+
+        # Custom implementation of `get_file` to use `http_get`.
+        resolve_remote_path = self.resolve_path(rpath, revision=kwargs.get("revision"))
+        expected_size = self.info(rpath)["size"]


We also need to pass the kwargs' revision to the self.info call (or replace rpath with resolve_remote_path.unresolve())

lhoestq

LGTM after mario's comment is fixed :)

Wauplin · 2024-03-22T16:52:45Z

While I was at it, I reviewed also how files are read when reading the whole file at once (with either read_bytes, read_text, cat_file, read) so that we also use get_file which downloads to a temporary file. On posix, that means "in memory" which is good, otherwise we download to local drive and read from there. It's less optimal because of the "save to disk, read from disk" step but overall should still be more efficient than current solution.

I'm run this script that tests several ways of downloading a file with fsspec:

# branch fast-download-in-hf-file-system
import time
from tempfile import NamedTemporaryFile, TemporaryDirectory, TemporaryFile

from datasets import Dataset

from huggingface_hub import HfFileSystem, hf_hub_download


# 3GB
repo_id = "Open-Orca/OpenOrca"
filename = "3_5M-GPT3_5-Augmented.parquet"

# 50MB
repo_id = "bigcode/the-stack-v2"
filename = "data/1C_Enterprise/train-00000-of-00001.parquet"

# 440MB
repo_id = "HuggingFaceM4/WebSight"
filename = "data/train-00000-of-00071-eb722b04b83e13b7.parquet"

parquet = f"hf://datasets/{repo_id}/{filename}"

def timeit(title):
    def decorator(fn):
        t0 = time.time()
        result = fn()
        t1 = time.time()
        print(title, t1 - t0)
        return result
    return decorator


@timeit("Dataset.from_parquet")
def _():
    with TemporaryDirectory() as temp_dir:
        Dataset.from_parquet(parquet, cache_dir=temp_dir)

@timeit("fs.get_file (TemporaryFile)")
def _():
    with TemporaryFile() as temp_file:
        HfFileSystem().get_file(parquet, temp_file)

@timeit("fs.get_file (NamedTemporaryFile)")
def _():
    with NamedTemporaryFile() as temp_file:
        HfFileSystem().get_file(parquet, temp_file.name)

@timeit("hf_hub_download")
def _():
    with TemporaryDirectory() as temp_dir: 
        hf_hub_download(repo_id=repo_id, filename=filename, repo_type="dataset", cache_dir=temp_dir)

@timeit("fs.open(parquet).read()")
def _():
    HfFileSystem().open(parquet).read()

@timeit("fs.read_bytes(parquet)")
def _():
    HfFileSystem().read_bytes(parquet)

All of them work properly with the same speed (except Dataset.from_parquet because it also loads the dataset).

Downloading data: 100%|████| 438M/438M [00:12<00:00, 35.5MB/s]
Generating train split: 11592 examples [00:00, 22892.31 examples/s]
Dataset.from_parquet 15.710246324539185
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:12<00:00, 35.0MB/s]
fs.get_file (TemporaryFile) 13.357447147369385
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:09<00:00, 47.6MB/s]
fs.get_file (NamedTemporaryFile) 10.060921669006348
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:09<00:00, 45.6MB/s]
hf_hub_download 10.187127828598022
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:09<00:00, 44.1MB/s]
fs.open(parquet).read() 10.543320655822754
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:10<00:00, 41.7MB/s]
fs.read_bytes(parquet) 11.385062456130981

Unfortunately I still have tests to fix on Windows. I'll try to get this fix soon (need to start a machine on AWS) and then we can merge for next release on Monday.

mariosasko · 2024-03-22T17:35:12Z

src/huggingface_hub/hf_file_system.py

+        if self.mode == "rb" and (length is None or length == -1):
+            # Open a temporary file to store the downloaded content
+            if HF_HUB_ENABLE_HF_TRANSFER:
+                with tempfile.TemporaryDirectory() as tmp_dir:
+                    # if hf_transfer, we want to provide a real temporary file so hf_transfer can write concurrently to it
+                    tmp_path = os.path.join(tmp_dir, "tmp_file")
+                    self.fs.get_file(self.resolved_path.unresolve(), tmp_path)
+                    with open(tmp_path, "rb") as f:
+                        return f.read()
+            else:
+                # otherwise, we don't care where the file is stored (e.g. in memory or on disk)
+                with tempfile.TemporaryFile() as tmp_file:
+                    self.fs.get_file(self.resolved_path.unresolve(), tmp_file)
+                    return tmp_file.read()


I'm not a big fan of this code, so maybe hf_transfer should support writing the fetched bytes to in-memory buffers

AFAIK it's not possible to run hf_transfer in-memory. The current implementation here is not the best but still works and is purely based on public methods.

AFAIK it's not possible to run hf_transfer in-memory

Indeed, but let's then open an issue in the hf_transfer repo? Besides the use case here, hf_transfer is meant for power users, and they have a lot of RAM (we can assume), so supporting in-memory downloads makes sense from that point of view, too.

@mariosasko @lhoestq I'll make a release of hfh soon (today?) with this PR. We can discuss/implement an in-memory download for hf_transfer but that will not be ready in time. So for this release, would you prefer:

to have hf_transfer download to tmp file, then load it in memory and delete tmp file (current implementation). Pro: allows for hf_transfer to work. Con: more I/O operations.

to disable hf_transfer when using fs.read(). Pro: no extra I/O ops. Cons: no hf_transfer.

wait for an hf_transfer PR before merging and releasing this one. Pro: no in-between solution. Cons: will delay the get_file fix.

First, note that .read() can be called multiple times, e.g.

header = f.read(5) full_data = header + f.read()

Then I think it's ok to not have the full speed for .read() - big files are generally downloaded to disk and .read() is generally used for smaller files (readme, json, images, audio...). I would also expect it to work in memory.

Finally fsspec fiels have the notion of block_size and cache, which you are bypassing in the current implementation.

I think it's fine to not override .read() for now. Especially since the original issue is about fast download to disk in get_file

If you really want to ship a speed improvement in .read() for today's release, maybe you can check if self.loc == 0 and use HfFileSystemStreamFile.read() ? This wouldn't use the disk / hf_transfer but at least unblock the current speed issue

Updated in ae9e2f7.

lhoestq

LGTM

src/huggingface_hub/hf_file_system.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Wauplin added 2 commits March 21, 2024 15:36

first draft

8a4ad13

Faster downloads in HfFileSystem

7f665ab

Wauplin requested review from LysandreJik, lhoestq and mariosasko March 22, 2024 11:11

Merge branch 'main' into fast-download-in-hf-file-system

1df891d

lhoestq reviewed Mar 22, 2024

View reviewed changes

Apply suggestions from code review

eb94fb8

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

mariosasko reviewed Mar 22, 2024

View reviewed changes

src/huggingface_hub/hf_file_system.py Outdated Show resolved Hide resolved

Wauplin added 3 commits March 22, 2024 16:47

speed-up get_file and read

8e1ff33

add tests

7ba8f55

fix test

f87611c

Wauplin requested review from mariosasko and lhoestq March 22, 2024 16:32

mariosasko reviewed Mar 22, 2024

View reviewed changes

lhoestq approved these changes Mar 22, 2024

View reviewed changes

fix revision

78e9f85

fix on windows

fa0d46b

mariosasko reviewed Mar 22, 2024

View reviewed changes

simplified file.read

ae9e2f7

Wauplin requested review from lhoestq and mariosasko March 25, 2024 13:01

Merge branch 'main' into fast-download-in-hf-file-system

7c89a2a

lhoestq approved these changes Mar 25, 2024

View reviewed changes

src/huggingface_hub/hf_file_system.py Outdated Show resolved Hide resolved

Update src/huggingface_hub/hf_file_system.py

3294dcb

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Wauplin merged commit 59be8f2 into main Mar 25, 2024

Wauplin deleted the fast-download-in-hf-file-system branch March 25, 2024 13:36

severo mentioned this pull request Apr 16, 2024

Update huggingface-hub to 0.22.2 huggingface/dataset-viewer#2729

Merged

Wauplin mentioned this pull request Jun 3, 2024

Fix progress bar not always closed in file_download.py #2308

Merged

lhoestq mentioned this pull request Jun 24, 2025

update hffilesystem docstring #3177

Open

Fast download in hf file system #2143

Fast download in hf file system #2143

Uh oh!

Conversation

Wauplin commented Mar 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 22, 2024

Uh oh!

codecov-commenter commented Mar 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mariosasko Mar 22, 2024

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Wauplin commented Mar 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariosasko Mar 22, 2024

Choose a reason for hiding this comment

Uh oh!

Wauplin Mar 22, 2024

Choose a reason for hiding this comment

Uh oh!

mariosasko Mar 22, 2024

Choose a reason for hiding this comment

Uh oh!

Wauplin Mar 25, 2024

Choose a reason for hiding this comment

Uh oh!

lhoestq Mar 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq Mar 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Wauplin Mar 25, 2024

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Wauplin commented Mar 22, 2024 •

edited

Loading

codecov-commenter commented Mar 22, 2024 •

edited

Loading

Wauplin commented Mar 22, 2024 •

edited

Loading

lhoestq Mar 25, 2024 •

edited

Loading

lhoestq Mar 25, 2024 •

edited

Loading