Skip to content

Thoughts about multi-file transfer APIs #471

@arogozhnikov

Description

@arogozhnikov

Hi Kyle,
very interesting project!

I've ran some minimal non-async test for downloads (from R2, which is S3-compatible).

                  obstore    s3fs
20 x 100 bytes    2.85 s     2.65 s
20 x 1e6 bytes    4.12 s     3.36 s

In this regime obstore was somewhat slower than s3fs. I initially thought this could be because s3fs authorizes in service only once, but it seems there is difference in download speed as well. As a side comment, overhead for uploading a single small file is large in both cases.

So, why wouldn't I just use async?

Well, there are a couple of points with async:

  1. inconvenience: one needs to know if you're already in an event loop, and otherwise create one. Every piece of code should deal with this. That's on python, but still should be mentioned
  2. load with asyncs is unpredictable, that's a blocker. My common usecase is to upload/download multiple files at once. And sometimes it is a couple of large files, sometimes many small files, sometimes many large files.
    Running many files in parallel would eat up CPU and saturate outbound channel (happened several times), so there should be some global switch for this operation to use e.g. no more than 12 cores (as in your defaults). But as I understand (correct me if I'm wrong here), every call of obstore operates independently and spawns additional threads in case of multipart upload. So total number of threads can go 12 x n_files currenlty uploaded.

So, my wish would be to have something like

put_many(tuples_of_src_tgt, ...)
get_many(tuples_of_src_tgt, ...)

That would automatically allocate a pool of (12 or what user asks) threads and optimally use them until all objects/chunks are uploaded / downloaded.

LMK if that's too much to ask 😆

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions