Skip to content

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Mar 22, 2024

Should fix the current regression in file download speed. When downloading an entire file, let's rely on hf_hub_download which is faster, has retry mechanism and can be speed-up with hf_transfer (on-demand).

import time
from tempfile import TemporaryDirectory
from datasets import Dataset

repo_id = "Open-Orca/OpenOrca"
filename = "3_5M-GPT3_5-Augmented.parquet"

parquet = f"hf://datasets/{repo_id}/{filename}"

t0 = time.time()
with TemporaryDirectory() as temp_dir:
    Dataset.from_parquet(parquet, cache_dir=temp_dir)
t1 = time.time()
print(t1 - t0)

=> on my local setup, speed went from ~15MB/s to 45MB/s (on wifi, no hf_transfer) which is strictly the same speed as a normal hf_hub_download.

cc @lhoestq @julien-c (related to https://twitter.com/ashvardanian/status/1769964480086024203)

For the review, it's best to select Hide whitespace on Github.


TODO:

  • add proper test in hfh
  • test in datasets? (cc @lhoestq)
  • fix what if callback.tqdm doesn't exist?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@codecov-commenter
Copy link

codecov-commenter commented Mar 22, 2024

Codecov Report

Attention: Patch coverage is 51.06383% with 23 lines in your changes are missing coverage. Please review.

Project coverage is 82.64%. Comparing base (3252e27) to head (1df891d).

Files Patch % Lines
src/huggingface_hub/file_download.py 63.63% 12 Missing ⚠️
src/huggingface_hub/hf_file_system.py 21.42% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2143      +/-   ##
==========================================
- Coverage   82.70%   82.64%   -0.07%     
==========================================
  Files         103      103              
  Lines        9628     9644      +16     
==========================================
+ Hits         7963     7970       +7     
- Misses       1665     1674       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great !

I added suggestions to handle the case where the callback has no tqdm.

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@Wauplin Wauplin requested review from mariosasko and lhoestq March 22, 2024 16:32

# Custom implementation of `get_file` to use `http_get`.
resolve_remote_path = self.resolve_path(rpath, revision=kwargs.get("revision"))
expected_size = self.info(rpath)["size"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to pass the kwargs' revision to the self.info call (or replace rpath with resolve_remote_path.unresolve())

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after mario's comment is fixed :)

@Wauplin
Copy link
Contributor Author

Wauplin commented Mar 22, 2024

While I was at it, I reviewed also how files are read when reading the whole file at once (with either read_bytes, read_text, cat_file, read) so that we also use get_file which downloads to a temporary file. On posix, that means "in memory" which is good, otherwise we download to local drive and read from there. It's less optimal because of the "save to disk, read from disk" step but overall should still be more efficient than current solution.

I'm run this script that tests several ways of downloading a file with fsspec:

# branch fast-download-in-hf-file-system
import time
from tempfile import NamedTemporaryFile, TemporaryDirectory, TemporaryFile

from datasets import Dataset

from huggingface_hub import HfFileSystem, hf_hub_download


# 3GB
repo_id = "Open-Orca/OpenOrca"
filename = "3_5M-GPT3_5-Augmented.parquet"

# 50MB
repo_id = "bigcode/the-stack-v2"
filename = "data/1C_Enterprise/train-00000-of-00001.parquet"

# 440MB
repo_id = "HuggingFaceM4/WebSight"
filename = "data/train-00000-of-00071-eb722b04b83e13b7.parquet"

parquet = f"hf://datasets/{repo_id}/{filename}"

def timeit(title):
    def decorator(fn):
        t0 = time.time()
        result = fn()
        t1 = time.time()
        print(title, t1 - t0)
        return result
    return decorator


@timeit("Dataset.from_parquet")
def _():
    with TemporaryDirectory() as temp_dir:
        Dataset.from_parquet(parquet, cache_dir=temp_dir)

@timeit("fs.get_file (TemporaryFile)")
def _():
    with TemporaryFile() as temp_file:
        HfFileSystem().get_file(parquet, temp_file)

@timeit("fs.get_file (NamedTemporaryFile)")
def _():
    with NamedTemporaryFile() as temp_file:
        HfFileSystem().get_file(parquet, temp_file.name)

@timeit("hf_hub_download")
def _():
    with TemporaryDirectory() as temp_dir: 
        hf_hub_download(repo_id=repo_id, filename=filename, repo_type="dataset", cache_dir=temp_dir)

@timeit("fs.open(parquet).read()")
def _():
    HfFileSystem().open(parquet).read()

@timeit("fs.read_bytes(parquet)")
def _():
    HfFileSystem().read_bytes(parquet)

All of them work properly with the same speed (except Dataset.from_parquet because it also loads the dataset).

Downloading data: 100%|████| 438M/438M [00:12<00:00, 35.5MB/s]
Generating train split: 11592 examples [00:00, 22892.31 examples/s]
Dataset.from_parquet 15.710246324539185
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:12<00:00, 35.0MB/s]
fs.get_file (TemporaryFile) 13.357447147369385
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:09<00:00, 47.6MB/s]
fs.get_file (NamedTemporaryFile) 10.060921669006348
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:09<00:00, 45.6MB/s]
hf_hub_download 10.187127828598022
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:09<00:00, 44.1MB/s]
fs.open(parquet).read() 10.543320655822754
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:10<00:00, 41.7MB/s]
fs.read_bytes(parquet) 11.385062456130981

Unfortunately I still have tests to fix on Windows. I'll try to get this fix soon (need to start a machine on AWS) and then we can merge for next release on Monday.

Comment on lines 740 to 753
if self.mode == "rb" and (length is None or length == -1):
# Open a temporary file to store the downloaded content
if HF_HUB_ENABLE_HF_TRANSFER:
with tempfile.TemporaryDirectory() as tmp_dir:
# if hf_transfer, we want to provide a real temporary file so hf_transfer can write concurrently to it
tmp_path = os.path.join(tmp_dir, "tmp_file")
self.fs.get_file(self.resolved_path.unresolve(), tmp_path)
with open(tmp_path, "rb") as f:
return f.read()
else:
# otherwise, we don't care where the file is stored (e.g. in memory or on disk)
with tempfile.TemporaryFile() as tmp_file:
self.fs.get_file(self.resolved_path.unresolve(), tmp_file)
return tmp_file.read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of this code, so maybe hf_transfer should support writing the fetched bytes to in-memory buffers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK it's not possible to run hf_transfer in-memory. The current implementation here is not the best but still works and is purely based on public methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK it's not possible to run hf_transfer in-memory

Indeed, but let's then open an issue in the hf_transfer repo? Besides the use case here, hf_transfer is meant for power users, and they have a lot of RAM (we can assume), so supporting in-memory downloads makes sense from that point of view, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariosasko @lhoestq I'll make a release of hfh soon (today?) with this PR. We can discuss/implement an in-memory download for hf_transfer but that will not be ready in time. So for this release, would you prefer:

  1. to have hf_transfer download to tmp file, then load it in memory and delete tmp file (current implementation). Pro: allows for hf_transfer to work. Con: more I/O operations.
  2. to disable hf_transfer when using fs.read(). Pro: no extra I/O ops. Cons: no hf_transfer.
  3. wait for an hf_transfer PR before merging and releasing this one. Pro: no in-between solution. Cons: will delay the get_file fix.

Copy link
Member

@lhoestq lhoestq Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, note that .read() can be called multiple times, e.g.

header = f.read(5)
full_data = header + f.read()

Then I think it's ok to not have the full speed for .read() - big files are generally downloaded to disk and .read() is generally used for smaller files (readme, json, images, audio...). I would also expect it to work in memory.

Finally fsspec fiels have the notion of block_size and cache, which you are bypassing in the current implementation.

I think it's fine to not override .read() for now. Especially since the original issue is about fast download to disk in get_file

Copy link
Member

@lhoestq lhoestq Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really want to ship a speed improvement in .read() for today's release, maybe you can check if self.loc == 0 and use HfFileSystemStreamFile.read() ? This wouldn't use the disk / hf_transfer but at least unblock the current speed issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in ae9e2f7.

@Wauplin Wauplin requested review from lhoestq and mariosasko March 25, 2024 13:01
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@Wauplin Wauplin merged commit 59be8f2 into main Mar 25, 2024
@Wauplin Wauplin deleted the fast-download-in-hf-file-system branch March 25, 2024 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants