Glad to finally share this release! I'll paste in something I shared on Twitter, about one of the features I think some will find surprising (at least, in the sense I might have been skeptical about it if it had been pitched to me a year ago). The twitter thread is here: https://twitter.com/honnibal/status/1316792607205470209
This release was SO much work! Glad to finally have it out.
:beer:
The big impact for users will definitely be the transformer models and config system, but I want to talk about a feature I wasn't expecting to build until a few months ago: the new workflow system, spaCy Projects. :thread:
spaCy Projects was inspired by @DVCorg
, and has an easy integration for DVC users. But it's also standalone: you can write a single YML file and spaCy will get your data, trigger your processing steps, and use remote cache. It even generates readmes: https://t.co/uRcfJWZlsQ?amp=1
The requirements that spaCy Projects handles are more general than spaCy. So why put it in the library? The case against it -- simple minimalism -- is pretty clear. The case for it is a bit more subtle.
We built spaCy Projects for the design-space it opens up for the rest of the library. If we can assume a workflow system (by giving it to you!), it's much easier to let things happen in separate steps. We can embrace approaches that would otherwise be awkward.
A good example is the way we've stream-lined the training utilities. spaCy used to have this awkward class, GoldParse. That's gone now: the training annotations are represented with the Doc class, and the training data is read in from DocBin format -- which is about 100x smaller.
Why didn't we ditch GoldParse sooner? Well, there's definitely a downside to reading data from a binary format. It adds an extra conversion step. But with the projects system, that's no longer such a big deal. We now have a conventional way to describe such multi-step workflows.
I think the impact will be especially big for the ecosystem. It's extremely standard in NLP to have a bunch of steps to run to build counts, lists, models etc. Previously every spaCy Universe project had to build and describe their own conventions. Now we have a standard.
I’m constantly blown away at how high quality SpaCy is with such a small team. Everything: from API design, to documentation, to education, to tool ecosystem.
Congrats on the (preview) release and thank you for the great software. Exciting to see SpaCy pull transformers into the mix!
I’d definitely second this. I’m no NLP expert by any means but SpaCy makes it so easy to hit the ground running and start doing interesting things with NLP.
I was very surprised to see how small the team was but it looks like they have some great talent. A little bit off topic but the FastAPI creator works at explosion. Using FastAPI has been the most enjoyable Python web development experiences for me.
Ok so I'll break it down for you:
Spacy doesn't has autocompletion means that when you use Spacy or any other data structure from spacy you won't have any autocompletion that allow you to know the API you're using.
Not on any IDE I've tried and even on the simplest examples.
Was it too hard to understand or the reality is too ridiculous to even think about it being true?
A friend of me had the same problem so I think everybody has it, if you have autocompletion when using Spacy then say it so you would add intellectual value to the thread by constraining the scope of this existential bug for the python Nlp industry.
Python libraries don't implement autocompletion, language servers or completion engines do. I suggest you configure your editor correctly, because spacy works just fine with vscode vim, or whatever have you.
There's a lot that can go wrong with getting Python autocomplete working, and it's possible you are seeing something different. But I would suggest it is environmental rather than something Spacy is doing.
You are likely getting downvoted because your complaint doesn't make any sense.
Autocompletion isn't a standard feature of any major NLP library, and blaming it on Python (currently the most popular language for NLP) isn't a good argument.
Autocompletion isn't a standard feature of any major NLP library, and blaming it on Python (currently the most popular language for NLP) isn't a good argument.
I install a package (spacy)
I don't get any autocompletion of the API from the main spacy object and any other subsequent spacy data structure.
Indeed autocompletion is usually on the language side but here it's a combination of the weaknesses of python and of the lazyness of spacy developers by making everything dynamic they broke the autocompletion that is normally working in python.
Therefore it's not production ready, what else is there to say?
I tried both pycharm and vscode
> lazyness of spacy developers by making everything dynamic they broke the autocompletion that is normally working in python.
Actually it's the opposite of "making everything dynamic". spaCy is mostly implemented in Cython, which is a language for writing C extensions for Python. So things are statically typed and memory managed, which sometimes doesn't work well with inspect. We set whatever compilation flags we can to make the annotations passed through though. It's definitely not a question of laziness though...You could say we worked extremely hard to break things in this particular way.
In v3 we've been able to finally ditch Python 2 support, which means we're able to embrace type annotations. Our machine learning library, https://thinc.ai, is extensively type-annotated and actually you can get type hints and type errors for composing layers incorrectly.
Personally I use vim and don't use autocompletion, I find it makes the typing laggy and I've just never really wanted it. We do have developers on the team who use autocompletion though and I think spaCy actually supports it better than you suggest, I'm not sure whether you had a specific problem, but maybe see whether it works if you install spacy-nightly.
Btw I do think you've come across as a weird jerk in this thread though? You might want to read over your messages and think about that. Like, I (and I think most others) genuinely didn't get what you meant in your first few messages; I thought you were saying spaCy didn't let you _build_ autocomplete.
I love Spacy, but I was recently trying to embed documents with it and it seems that it uses word-vector averaging for that. Is it possible to do sentence or paragraph embedding like sentenceBert?
Document similarity is a tricky thing to get into the main API though, which is organised around the idea of sending a Doc through a sequence of steps, each of which adds or updates annotations. The signature for document similarity is fundamentally different. In the end we decided to not try to shoe-horn it in. Not everything needs to be one function call. So document relation predictions (including document similarity) wouldn't be a standard pipeline component.
I generally ignore products who's front page talks as if the person reading already knows who they are and what they do. I have no idea what this is. Clicking on the front page it says, "spaCy v3.0 is going to be a huge release! It features new transformer-based pipelines that get spaCy’s accuracy right up to the current state-of-the-art"
What the heck does that mean? What is spaCy first of all.
This just tells me they're not to be taken seriously.
Tone down the entitlement and the attitude, and think about what the linked article is exactly. It's a bunch of release notes for the 3.0 version.
If I click back to the actual homepage[0], it says:
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.
"spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning."
This release was SO much work! Glad to finally have it out. :beer:
The big impact for users will definitely be the transformer models and config system, but I want to talk about a feature I wasn't expecting to build until a few months ago: the new workflow system, spaCy Projects. :thread:
spaCy Projects was inspired by @DVCorg , and has an easy integration for DVC users. But it's also standalone: you can write a single YML file and spaCy will get your data, trigger your processing steps, and use remote cache. It even generates readmes: https://t.co/uRcfJWZlsQ?amp=1
The requirements that spaCy Projects handles are more general than spaCy. So why put it in the library? The case against it -- simple minimalism -- is pretty clear. The case for it is a bit more subtle.
We built spaCy Projects for the design-space it opens up for the rest of the library. If we can assume a workflow system (by giving it to you!), it's much easier to let things happen in separate steps. We can embrace approaches that would otherwise be awkward.
A good example is the way we've stream-lined the training utilities. spaCy used to have this awkward class, GoldParse. That's gone now: the training annotations are represented with the Doc class, and the training data is read in from DocBin format -- which is about 100x smaller.
Why didn't we ditch GoldParse sooner? Well, there's definitely a downside to reading data from a binary format. It adds an extra conversion step. But with the projects system, that's no longer such a big deal. We now have a conventional way to describe such multi-step workflows.
I think the impact will be especially big for the ecosystem. It's extremely standard in NLP to have a bunch of steps to run to build counts, lists, models etc. Previously every spaCy Universe project had to build and describe their own conventions. Now we have a standard.