-
Notifications
You must be signed in to change notification settings - Fork 794
Fast download in hf file system #2143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2143 +/- ##
==========================================
- Coverage 82.70% 82.64% -0.07%
==========================================
Files 103 103
Lines 9628 9644 +16
==========================================
+ Hits 7963 7970 +7
- Misses 1665 1674 +9 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great !
I added suggestions to handle the case where the callback has no tqdm
.
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
|
||
# Custom implementation of `get_file` to use `http_get`. | ||
resolve_remote_path = self.resolve_path(rpath, revision=kwargs.get("revision")) | ||
expected_size = self.info(rpath)["size"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to pass the kwargs
' revision
to the self.info
call (or replace rpath
with resolve_remote_path.unresolve()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after mario's comment is fixed :)
While I was at it, I reviewed also how files are read when reading the whole file at once (with either I'm run this script that tests several ways of downloading a file with # branch fast-download-in-hf-file-system
import time
from tempfile import NamedTemporaryFile, TemporaryDirectory, TemporaryFile
from datasets import Dataset
from huggingface_hub import HfFileSystem, hf_hub_download
# 3GB
repo_id = "Open-Orca/OpenOrca"
filename = "3_5M-GPT3_5-Augmented.parquet"
# 50MB
repo_id = "bigcode/the-stack-v2"
filename = "data/1C_Enterprise/train-00000-of-00001.parquet"
# 440MB
repo_id = "HuggingFaceM4/WebSight"
filename = "data/train-00000-of-00071-eb722b04b83e13b7.parquet"
parquet = f"hf://datasets/{repo_id}/{filename}"
def timeit(title):
def decorator(fn):
t0 = time.time()
result = fn()
t1 = time.time()
print(title, t1 - t0)
return result
return decorator
@timeit("Dataset.from_parquet")
def _():
with TemporaryDirectory() as temp_dir:
Dataset.from_parquet(parquet, cache_dir=temp_dir)
@timeit("fs.get_file (TemporaryFile)")
def _():
with TemporaryFile() as temp_file:
HfFileSystem().get_file(parquet, temp_file)
@timeit("fs.get_file (NamedTemporaryFile)")
def _():
with NamedTemporaryFile() as temp_file:
HfFileSystem().get_file(parquet, temp_file.name)
@timeit("hf_hub_download")
def _():
with TemporaryDirectory() as temp_dir:
hf_hub_download(repo_id=repo_id, filename=filename, repo_type="dataset", cache_dir=temp_dir)
@timeit("fs.open(parquet).read()")
def _():
HfFileSystem().open(parquet).read()
@timeit("fs.read_bytes(parquet)")
def _():
HfFileSystem().read_bytes(parquet) All of them work properly with the same speed (except Downloading data: 100%|████| 438M/438M [00:12<00:00, 35.5MB/s]
Generating train split: 11592 examples [00:00, 22892.31 examples/s]
Dataset.from_parquet 15.710246324539185
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:12<00:00, 35.0MB/s]
fs.get_file (TemporaryFile) 13.357447147369385
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:09<00:00, 47.6MB/s]
fs.get_file (NamedTemporaryFile) 10.060921669006348
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:09<00:00, 45.6MB/s]
hf_hub_download 10.187127828598022
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:09<00:00, 44.1MB/s]
fs.open(parquet).read() 10.543320655822754
(…)-00000-of-00071-eb722b04b83e13b7.parquet: 100%|████| 438M/438M [00:10<00:00, 41.7MB/s]
fs.read_bytes(parquet) 11.385062456130981 Unfortunately I still have tests to fix on Windows. I'll try to get this fix soon (need to start a machine on AWS) and then we can merge for next release on Monday. |
if self.mode == "rb" and (length is None or length == -1): | ||
# Open a temporary file to store the downloaded content | ||
if HF_HUB_ENABLE_HF_TRANSFER: | ||
with tempfile.TemporaryDirectory() as tmp_dir: | ||
# if hf_transfer, we want to provide a real temporary file so hf_transfer can write concurrently to it | ||
tmp_path = os.path.join(tmp_dir, "tmp_file") | ||
self.fs.get_file(self.resolved_path.unresolve(), tmp_path) | ||
with open(tmp_path, "rb") as f: | ||
return f.read() | ||
else: | ||
# otherwise, we don't care where the file is stored (e.g. in memory or on disk) | ||
with tempfile.TemporaryFile() as tmp_file: | ||
self.fs.get_file(self.resolved_path.unresolve(), tmp_file) | ||
return tmp_file.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of this code, so maybe hf_transfer
should support writing the fetched bytes to in-memory buffers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK it's not possible to run hf_transfer in-memory. The current implementation here is not the best but still works and is purely based on public methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK it's not possible to run hf_transfer in-memory
Indeed, but let's then open an issue in the hf_transfer
repo? Besides the use case here, hf_transfer
is meant for power users, and they have a lot of RAM (we can assume), so supporting in-memory downloads makes sense from that point of view, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mariosasko @lhoestq I'll make a release of hfh
soon (today?) with this PR. We can discuss/implement an in-memory download for hf_transfer
but that will not be ready in time. So for this release, would you prefer:
- to have
hf_transfer
download to tmp file, then load it in memory and delete tmp file (current implementation). Pro: allows forhf_transfer
to work. Con: more I/O operations. - to disable
hf_transfer
when usingfs.read()
. Pro: no extra I/O ops. Cons: no hf_transfer. - wait for an hf_transfer PR before merging and releasing this one. Pro: no in-between solution. Cons: will delay the
get_file
fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, note that .read()
can be called multiple times, e.g.
header = f.read(5)
full_data = header + f.read()
Then I think it's ok to not have the full speed for .read()
- big files are generally downloaded to disk and .read()
is generally used for smaller files (readme, json, images, audio...). I would also expect it to work in memory.
Finally fsspec
fiels have the notion of block_size
and cache, which you are bypassing in the current implementation.
I think it's fine to not override .read()
for now. Especially since the original issue is about fast download to disk in get_file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you really want to ship a speed improvement in .read()
for today's release, maybe you can check if self.loc == 0
and use HfFileSystemStreamFile.read()
? This wouldn't use the disk / hf_transfer but at least unblock the current speed issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in ae9e2f7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Should fix the current regression in file download speed. When downloading an entire file, let's rely on
hf_hub_download
which is faster, has retry mechanism and can be speed-up withhf_transfer
(on-demand).=> on my local setup, speed went from ~15MB/s to 45MB/s (on wifi, no hf_transfer) which is strictly the same speed as a normal
hf_hub_download
.cc @lhoestq @julien-c (related to https://twitter.com/ashvardanian/status/1769964480086024203)
For the review, it's best to select
Hide whitespace
on Github.TODO:
callback.tqdm
doesn't exist?