Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SpaCy v3.0 Nightly (explosion.ai)
224 points by binarymax on Oct 15, 2020 | hide | past | favorite | 29 comments


Glad to finally share this release! I'll paste in something I shared on Twitter, about one of the features I think some will find surprising (at least, in the sense I might have been skeptical about it if it had been pitched to me a year ago). The twitter thread is here: https://twitter.com/honnibal/status/1316792607205470209

This release was SO much work! Glad to finally have it out. :beer:

The big impact for users will definitely be the transformer models and config system, but I want to talk about a feature I wasn't expecting to build until a few months ago: the new workflow system, spaCy Projects. :thread:

spaCy Projects was inspired by @DVCorg , and has an easy integration for DVC users. But it's also standalone: you can write a single YML file and spaCy will get your data, trigger your processing steps, and use remote cache. It even generates readmes: https://t.co/uRcfJWZlsQ?amp=1

The requirements that spaCy Projects handles are more general than spaCy. So why put it in the library? The case against it -- simple minimalism -- is pretty clear. The case for it is a bit more subtle.

We built spaCy Projects for the design-space it opens up for the rest of the library. If we can assume a workflow system (by giving it to you!), it's much easier to let things happen in separate steps. We can embrace approaches that would otherwise be awkward.

A good example is the way we've stream-lined the training utilities. spaCy used to have this awkward class, GoldParse. That's gone now: the training annotations are represented with the Doc class, and the training data is read in from DocBin format -- which is about 100x smaller.

Why didn't we ditch GoldParse sooner? Well, there's definitely a downside to reading data from a binary format. It adds an extra conversion step. But with the projects system, that's no longer such a big deal. We now have a conventional way to describe such multi-step workflows.

I think the impact will be especially big for the ecosystem. It's extremely standard in NLP to have a bunch of steps to run to build counts, lists, models etc. Previously every spaCy Universe project had to build and describe their own conventions. Now we have a standard.


I’m constantly blown away at how high quality SpaCy is with such a small team. Everything: from API design, to documentation, to education, to tool ecosystem.

Congrats on the (preview) release and thank you for the great software. Exciting to see SpaCy pull transformers into the mix!


I’d definitely second this. I’m no NLP expert by any means but SpaCy makes it so easy to hit the ground running and start doing interesting things with NLP.

I was very surprised to see how small the team was but it looks like they have some great talent. A little bit off topic but the FastAPI creator works at explosion. Using FastAPI has been the most enjoyable Python web development experiences for me.


The documentation is accessible and the examples are informative. Most approachable NLP toolset I have seen.


I wish the team was a little bit less hostile to redistributors :/


High quality what? Spacy doesn't has autocompletion and it's the 21st century, only a python package can achieve such prodigy


Now we're all playing "what is (s)he talking about?"

I'm gonna go for: "spyder"

Final answer


Ok so I'll break it down for you: Spacy doesn't has autocompletion means that when you use Spacy or any other data structure from spacy you won't have any autocompletion that allow you to know the API you're using. Not on any IDE I've tried and even on the simplest examples.

Was it too hard to understand or the reality is too ridiculous to even think about it being true? A friend of me had the same problem so I think everybody has it, if you have autocompletion when using Spacy then say it so you would add intellectual value to the thread by constraining the scope of this existential bug for the python Nlp industry.


>Spacy doesn't has autocompletion

Python libraries don't implement autocompletion, language servers or completion engines do. I suggest you configure your editor correctly, because spacy works just fine with vscode vim, or whatever have you.


I was sure I've had autocomplete work for me, so I just tested with this nightly release and VS.Code and it works fine for me.

Here's a screenshot: https://imgur.com/a/9bcgPkz

There's a lot that can go wrong with getting Python autocomplete working, and it's possible you are seeing something different. But I would suggest it is environmental rather than something Spacy is doing.


You can downvote but this is the cold hard truth


You are likely getting downvoted because your complaint doesn't make any sense.

Autocompletion isn't a standard feature of any major NLP library, and blaming it on Python (currently the most popular language for NLP) isn't a good argument.


Autocompletion isn't a standard feature of any major NLP library, and blaming it on Python (currently the most popular language for NLP) isn't a good argument.

I install a package (spacy) I don't get any autocompletion of the API from the main spacy object and any other subsequent spacy data structure. Indeed autocompletion is usually on the language side but here it's a combination of the weaknesses of python and of the lazyness of spacy developers by making everything dynamic they broke the autocompletion that is normally working in python. Therefore it's not production ready, what else is there to say? I tried both pycharm and vscode


> lazyness of spacy developers by making everything dynamic they broke the autocompletion that is normally working in python.

Actually it's the opposite of "making everything dynamic". spaCy is mostly implemented in Cython, which is a language for writing C extensions for Python. So things are statically typed and memory managed, which sometimes doesn't work well with inspect. We set whatever compilation flags we can to make the annotations passed through though. It's definitely not a question of laziness though...You could say we worked extremely hard to break things in this particular way.

In v3 we've been able to finally ditch Python 2 support, which means we're able to embrace type annotations. Our machine learning library, https://thinc.ai, is extensively type-annotated and actually you can get type hints and type errors for composing layers incorrectly.

Personally I use vim and don't use autocompletion, I find it makes the typing laggy and I've just never really wanted it. We do have developers on the team who use autocompletion though and I think spaCy actually supports it better than you suggest, I'm not sure whether you had a specific problem, but maybe see whether it works if you install spacy-nightly.

Btw I do think you've come across as a weird jerk in this thread though? You might want to read over your messages and think about that. Like, I (and I think most others) genuinely didn't get what you meant in your first few messages; I thought you were saying spaCy didn't let you _build_ autocomplete.


fwiw, I love spacy but the autocomplete has never work for me in pycharm either.

Many autocomplete implementations seem to struggle with anything that isn’t “pure python”; pyspark is notorious for it too.

...but, the type annotations do work! That’s definitely the right way to fix it.

Can’t wait~


I mentioned this elsewhere in thread, but here is an image of it working with VS.Code: https://imgur.com/a/9bcgPkz


I'm always impressed by SpaCy's web design, it's so distinctive and the documentation is a pleasure to navigate and read.


Ines Montani is the woman behind the awesome web design. I worked with her previously - she is amazing!


Spacy is so well built and the NLP capability is on point. I'm super proud of you guys! Great Job.


This is a huge release! Weights and Biases support, transformers, Ray support -- it just never stops!


SpaCy is the Ruby on Rails of NLP, clearly marking a before and after.


I love Spacy, but I was recently trying to embed documents with it and it seems that it uses word-vector averaging for that. Is it possible to do sentence or paragraph embedding like sentenceBert?


Yes, that's the sort of thing that's now much better. See here for discussion about using transformers in v3: https://nightly.spacy.io/usage/embeddings-transformers

Document similarity is a tricky thing to get into the main API though, which is organised around the idea of sending a Doc through a sequence of steps, each of which adds or updates annotations. The signature for document similarity is fundamentally different. In the end we decided to not try to shoe-horn it in. Not everything needs to be one function call. So document relation predictions (including document similarity) wouldn't be a standard pipeline component.


Really awesome to see that they added support for parallel training using Ray!


Please mention a brief description of what it is, in the title line. @dang


I generally ignore products who's front page talks as if the person reading already knows who they are and what they do. I have no idea what this is. Clicking on the front page it says, "spaCy v3.0 is going to be a huge release! It features new transformer-based pipelines that get spaCy’s accuracy right up to the current state-of-the-art"

What the heck does that mean? What is spaCy first of all.

This just tells me they're not to be taken seriously.


Tone down the entitlement and the attitude, and think about what the linked article is exactly. It's a bunch of release notes for the 3.0 version.

If I click back to the actual homepage[0], it says:

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

[0] https://explosion.ai/software#spacy


Not sure what front page you are looking at. First thing it says on spacy.io is

Industrial-Strength Natural Language Processing in Python

Followed by some decent descriptions and code examples.


"spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning."

https://explosion.ai/software#spacy




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: