Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copying files between filesystems via GenericFileSystem broken for filesystems requiring init parameters #1167

Open
Metamess opened this issue Jan 26, 2023 · 11 comments

Comments

@Metamess
Copy link
Contributor

Metamess commented Jan 26, 2023

In our project, we use the GenericFileSystem's _cp_file function to (asynchronously) copy files from a remote SSH filesystem to our local filesystem (as suggested in this comment).

Before the release of 2023.1.0, this function accepted the parameters fs and fs2, which could be used to pass in existing filesystem instances between which the copy operation should take place. Since 2023.1.0 however, these parameters have been removed (silently and without reference in the changelog, though perhaps the 'experimental' note in the docs means changes won't get a notice?).

With the removal of these parameters, this function now only uses filesystem instances as resolved by the _resolve_fs function based on the url parameter. But, since the resolve function returns the result of a call to registry.filesystem() with only the inferred protocol as parameter (and no **storage_options), this fundamentally breaks for filesystems that require some parameter for their __init__ function, like the SFTPFileSystem (the default implementation for the SSH protocol, and which requires a host parameter).

This issue can easily be reproduced by calling the _cp_file function on a GenericFileSystem where one of the url parameters starts with "ssh://" as protocol, or any other protocol whose FileSystem implementation requires additional parameters.

Of course, if there is another workflow to get the _cp_file function working for pre-defined filesystems that is the 'intended' way, I would very much like to hear it and this issue should be closed.

@martindurant
Copy link
Member

I am sorry that the parameters you were using disappeared.

The functionality can still be achieved, though, with the following intended workflow:

# create the instances as you did before
fsspec.generic._generic_fs[protocol1] = fsspec.filesystem(...) fsspec.generic._generic_fs[protocol2] = fsspec.filesystem(...)
fs = fsspec.filesystem("generic", default_method="generic")
fs.cp(...)

The "generic" method means look up instances in the in-module dict rather than call the typical fsspec machinery for creating a new instance.

Indeed, this is very experimental, and your feedback is important to get the process smooth. Perhaps an official "set_generic_instance" function would be helpful. Also, the "generic" method has no fallback if the instance isn't in the dict.

The other default_methods of interest are:

  • "current": follows the usual way of identifying the class from the protocol and uses whatever instance of that class was most recently made
  • "options" (not yet implemented) a whole {protocol: kwargs} dict to pass around and create instances for the found protocols when needed.

@martindurant
Copy link
Member

Honestly, I was not aware of anyone using genericFS yet, and I'm glad you are! Do you have an interesting use case you can share?

@Metamess
Copy link
Contributor Author

Thank you for the suggestion! As the dict key would only be the protocol used, that would mean you would not be able to copy files between two (for example) SSH filesystems, correct? Or (as is one of our use cases) copy from N>1 remote machines over SSH to the local filesystem (unless you alter the dict before every call, but that would really be a work-around).

Our current use case it this: We currently use the GenericFileSystem exclusively to facilitate the downloading of specific batches of files (multiple async function calls, called in bulk using asyncio.gather) from remote machines over SSH (if the files exist using _exists), and copying them to the local machine (a docker container). We use the _cp_file function specifically because it is the async variant. (Our IDE's constantly lament our 'accessing of protected members', but that's another story, and perhaps we're also missing some intended usage there)

The 'source' (SFTPFileSystems) filesystems (and the 'destination' (LocalFileSystem)) are created once at the start of the process and provided with data such as hostnames, usernames and private keys. After that we only need to provide the internal paths to the function performing the copies.

I hope that helps, and of course thank you for all the work put into fsspec!

@martindurant
Copy link
Member

I had not anticipated that this could be used for different instances of the same type of filesystem! I suppose you could still make it work by adding instances to the _generic_fs dict as I mention, but give the two FSs different protocols, like sftp1://, sftps2://. This dict is not used by the general fsspec lookup mechanism, so you wouldn't be breaking anything.

I am surprised that you want to use this for the async functionality, though, since the SFTP implementation is not itself async, so I don't think you'll see any benefit. Are you using https://github.com/fsspec/sshfs , which is based on asyncssh rather than paramiko?

@ryaminal
Copy link

ryaminal commented Apr 16, 2024

i am using GenericFileSystem to leverage the rsync method. because i don't want to write it myself.

Not certain if there is a better way, but i found this to work:

import fsspec
import fsspec.generic

fsspec.url_to_fs("sftp://host:12345")

fsspec.generic.rsync(
    "sftp:///path/to/fake_sftp/",
    "gs://le_bucket/fake_sftp/",
    inst_kwargs={"default_method": "current"},
)

note: this method does not work if i use sshfs. no error output, just some debug messages like:

DEBUG:fsspec.generic:ignoring cp exception for sftp:///path/to/fake_sftp/blah.txt: [Errno 2] No such file or directory
DEBUG:fsspec.generic:ignoring cp exception for sftp:///path/to/fake_sftp/blah/ha.txt: [Errno 2] No such file or directory
DEBUG:fsspec.generic:ignoring cp exception for sftp:///path/to/fake_sftp/foo.txt: [Errno 2] No such file or directory

so, that's weird

generic

generic, with gs, required me to do something like:

fsspec.generic._generic_fs["sftp"] = fsspec.filesystem("sftp", host="host", port=12345, username="meh")
fsspec.generic._generic_fs["gcs"] = fsspec.generic._generic_fs["gs"] = fsspec.filesystem("gs")

generic_fs = fsspec.filesystem("generic", default_method="generic")
generic_fs.rsync("sftp:///path/to/fake_sftp", "gs://le_bucket/ake_sftp")

might get more cumbersome for other filesystems.

@ryaminal
Copy link

ah, just saw #1398. slightly different issues, but very similar.

@martindurant
Copy link
Member

Using "current" or setting keys in _generic_fs are totally reasonable ways to specify options for rsync. You can make the FS instance(s) using the constructor or fsspec.filesystem rather than url_to_filesystem, if that is easier.

@ryaminal
Copy link

ryaminal commented Apr 16, 2024

The largest problem i'm facing now is the if not fs.isdir(source) check in fsspec.generic.rsync.
if i remove that check it all works. it doesn't work, at least the diff doesn't, as expected.
i think someone else pointed to this but it appears to be how the _strip_protocol is working.
I'll try find a fix, but if someone else already knows of one or has more questions or thoughts, let me know.

@ryaminal
Copy link

ryaminal commented Apr 17, 2024

Two issues i think as i dive in to this

1. isdir issue

ah, perhaps we are missing something like path = self._strip_protocol(path) in sshfs's implementation of _info?
here
the isdir issue i mentioned above is likely from sshfs and i think can be fixed. i'll open a PR over in that repo.

2. copy fails: file not found

for the second problem, i think this might be a problem in fsspec.generic.GenericFileSystem._copy() here.
i think we need to _strip_protocol the paths here.

also, why isn't fsspec.generic.GenericFileSystem._cp_file being used? i would very much like a method that doesn't create a temp file. does a call to _cp_file need to be added to _copy?

probably best if i move my stuff to a new/different ticket as the original issue of being able to pass init parameters in is "working"...

@martindurant
Copy link
Member

Please link the sshfs issue or PR is there is one, and I agree we can close this given the discussion in #1578

@ryaminal
Copy link

@martindurant , here is the SSHFS fix for isdir(issue 1 above)

And here is a proposed fix for issue 2 above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants