Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Documentation] Serializing Pipeline unclear #13642

Open
DomHudson opened this issue Sep 30, 2024 · 2 comments
Open

[Documentation] Serializing Pipeline unclear #13642

DomHudson opened this issue Sep 30, 2024 · 2 comments

Comments

@DomHudson
Copy link
Contributor

Summary

On this page, it claims to serialize a pipeline, you use the following methods:

config = nlp.config
bytes_data = nlp.to_bytes()

and that you you must take care of storing both and then loading from disk.

However, it also appears that:

nlp.to_disk('directory_name')

coupled with:

spacy.load('directory_name')

works and this is a lot more simple. The code executes and I can call a built nlp object on text successfully.

Questions

  1. Does this approach actually work identically?

    1. If so, can we update the documentation? The nlp.config and to_bytes seem like implementation details rather than the API for serializing?
    2. I didn't see a mention on this page that you can load the persisted pipeline from disk with spacy.load, should this be added?
  2. If this approach doesn't work, I think we should call this out and build a function/method that handles loading and saving to disk with a single call - this seems better than having to write your own disk persistence for the config and bytes object. What do you think?

Thanks!

Which page or section is this issue related to?

https://spacy.io/usage/saving-loading

@honnibal
Copy link
Member

It's true that the docs shouldn't really lead with the to_bytes() example, since it's usually less useful than nlp.to_disk().

The different serialization functions do different things, for different contexts. The main thing to keep in mind is that initialization and deserialization of data are handled in different steps, so that you can do one without the other. The spacy.load() function does both: it uses the config to initialize the nlp object, and then loads in the data. The nlp.from_disk() and nlp.from_bytes() functions only load in data, trusting that you've set up the nlp object correctly beforehand. The nlp.to_bytes() and nlp.to_disk() function give you the data that you could later load in with from_{disk/bytes}.

Sometimes your model will need custom code in order to be loaded. For this you can make your model a Python package, and then spacy.load() can take an entry-point that will resolve to your package. This is what we do for the built-in models: there's a package called e.g. en_core_web_sm, and that's where it loads the model from.

@DomHudson
Copy link
Contributor Author

Hi @honnibal ,

Thank you for your response!

If I just want to persist and load a model from disk, is this code accurate?

import spacy
nlp = spacy.load('en_core_web_lg')

nlp.to_disk('/path/to/directory')
nlp = spacy.load('/path/to/directory')

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants