Skip to content

Conversation

@mart-r
Copy link
Collaborator

@mart-r mart-r commented Dec 22, 2025

So far, if there was no Internet access, the fallback spacy model download failure would raise an exception and stall the entire process. Most models should come with their own spacy model anyway (if that's what they use). So the fallback model shouldn't be needed most of the time.

So this PR allows the subprocess for spacy model download to fail if there's a network issue. This should make it easier to use the library in scenarios where this method is called. This normally happens if/when a model is created from scratch and no on-disk model is provided. But it can alsoaffects converting models from v1 to v2 format.

The other thing this PR does is fix the renaming of 'odd' spacy models (e.g spacy_model and unsuported stuff like en_core_sci_*). This was the main reason that the en_core_web_md model was attempted to be downloaded at conversion time. Now, we use the path to the previously saved model instead and change the name of the model later down the road.

There was another nuance that I had to address. Some older CDBs included a config. And if the config is converted along with the CDB, it normally makes more sense to fix the spacy model name right there and then. Only if we're in a full-model conversion scenario does it make sense to delay that (there's no way to guarantee the absolute path of the spacy model when converting a CDB on its own). This omission originally caused downstream stuff (medcat-service) workflows to fail.

So TLDR:

  • This PR is designed to fix issues with converting v1 models in Internet-constrained environments
    • Previously a spacy model download may have been initiated that could fail
    • Now the model is loaded off disk instead (as would have probably been expected)

So far, if there was no Internet access, the fallback spacy model download failure would raise an exception and stall the entire process. Most models should come with their own spacy model anyway (if that's what they use). So the fallback model shouldn't be needed most of the time.

So this PR allows the subprocess for spacy model download to fail if there's a network issue. This should make it easier to use the library in scenarios where this method is called. This normally happens if/when a model is created from scratch and no on-disk model is provided. But it can alsoaffects converting models from v1 to v2 format.
…downloading of models instead of using the one off disk
@tomolopolis
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants