There are different models for each language pair. Currently there are only pre-...

yorwba · on Feb 6, 2021

Thanks for the explanation. Pivoting through English isn't ideal, but I'm just glad someone is working on this at all.

Thinking about it a bit more, it's a bit weird that the failure mode of a weak model would be to regurgitate the input unchanged. I'd rather have expected random Chinese gibberish in that case. Doesn't that mean the model has seen at least a few cases where English sentences were left untranslated in the training data?

I wanted to download the training data to check, but the instructions here https://github.com/argosopentech/onmt-models#download-data say to use OPUS-Wikipedia, which has no en-zh pairs, so the Chinese data must be from some other source.

pjfin123 · on Feb 6, 2021

Pivoting through English isn't inherent to Argos Translate, you could train a French-German model or whatever you want I've just been focusing on training models to add new languages. The ideal strategy is to have models that know multiple languages.

Quoting a previous HN comment:

I think cloud translation is still pretty valuable in a lot of cases since the model for one single direction translation is ~100MB. In addition to having more language options without a large download cloud translations let you use more specialized models for example French to Spanish. I just have a model to and from English for each language and any other translations have to "pivot" through English. For cloud translations you can also use one model with multiple input and output languages which gives you better quality translation between languages that don't have as much data available and lets you support direct translation between a large number of languages. Here's a talk where Google explains how they do this for Google Translate: https://youtu.be/nR74lBO5M3s?t=1682. You could do this locally but it would have its own set of challanges for getting the right model for the languages you want to translate.

> Thinking about it a bit more, it's a bit weird that the failure mode of a weak model would be to regurgitate the input unchanged. I'd rather have expected random Chinese gibberish in that case. Doesn't that mean the model has seen at least a few cases where English sentences were left untranslated in the training data?

This was added last week, it's just not live on libretranslate.com yet:

https://github.com/uav4geo/LibreTranslate/issues/33

The training scripts are just an example for English-Spanish, Opus(http://opus.nlpl.eu/) has data for English-Chinese.

yorwba · on Feb 6, 2021

> https://github.com/uav4geo/LibreTranslate/issues/33

That issue is about emojis, which confused me a bit, but I see that the linked commit is about replacing <unk> by the corresponding source token. https://github.com/argosopentech/argos-translate/commit/6a0f...

That's definitely an improvement, but what I was actually wondering about was why the model was copying the input verbatim in the first place.

Looking through the en-zh data in Opus, it looks like a bit of a mixed bag. This sample from OpenSubtitles v1 contains an ad for MyDVDrip in the Chinese version only: http://opus.nlpl.eu/OpenSubtitles/v1/en-zh_sample.html Filtering the data with some heuristics might be a good idea.

pjfin123 · on Feb 6, 2021

You're right there's surprisingly little data available for English-Chinese. I'm not sure why, it seems like there would be a lot of demand for translating between them.

For the en-zh model copying this is a known issue: https://github.com/argosopentech/argos-translate/issues/4

yorwba · on Feb 7, 2021

There's a lot of demand, but the people most likely to satisfy that demand for free are Chinese fan translators, and they're more likely to upload their work to Chinese sites where Western dataset collectors are unlikely to find them...