Explore S3 caching options #15

mjuric · 2019-12-12T00:23:22Z

We've talked about a use case where archives decide to keep datasets internally, but put up S3 API facade for remote access with AXS. E.g., imagine the data is physically in IPAC and MAST, but being analyzed at TACC. The question then is whether accesses to the datasets can transparently be cached where AXS is running, for faster repeated access.

Option 1: Spark seems to have recently added support for caching of remote datasets through Delta cache. It's not clear to me whether this is broadly available, or a Databricks-only thing? This should be the thing to investigate first.

Option 2: Another way to do this may be to have AXS access the files through a caching layer. I looked at S3 caching options, and found there are many. Example:

(and see the list of more projects at the bottom of s3fs-fuse README).

Opening this issue so we don't forget about this use case.

(@dennyglee, @zecevicp, any thoughts/ideas/comments?)

mjuric added enhancement New feature or request help wanted Extra attention is needed labels Dec 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore S3 caching options #15

Explore S3 caching options #15

mjuric commented Dec 12, 2019

Explore S3 caching options #15

Explore S3 caching options #15

Comments

mjuric commented Dec 12, 2019