> Cambridge research clearly shows that most compilers can be tricked with Unicode into processing code in a different way than a reader would expect it to be processed.
Unless I misunderstand the premise, this in not right. The compiler is not "tricked" into doing anything different - it interprets the code the same way as it always did. It's like saying "rm" command "can be tricked into" deleting important files. The rm tool doesn't know which files are important to you, and the compiler doesn't - and shouldn't - know what you consider to be "correct" code. It would correctly compile any code that is syntactically correct - if there are strings inside that look weird to you, it doesn't matter to the compiler.
The entity that can be "tricked" here is the reviewer of the code - who, indeed, might probably be tricked into accepting code that does something different than they'd think it does (though it'd require a very clever attacker to for the code to both do something nefarious with Unicode and still look innocent and not weird to the reviewer). Fortunately, this is quite easy to fix - just don't accept any patches with source code that have any non-ASCII outside small set of localization resources (proper code would have localizable resources outside the code anyway, tbh) and no Unicode would ever trick you.
> Fortunately, this is quite easy to fix - just don't accept any patches with source code that have any non-ASCII outside small set of localization resources
There are plenty of projects out there written by people who aren't English speakers who depend on the Unicode capabilities of languages to write code that is actually readable to them. Turning that off is far from a solution.
Does anyone actually do that in a production code?
I myself am not native English speaker and use unicode when writing in my mother tongue, but in 20+ years of programming I've never seen anyone using non-ascii chars in their professionally written code? Of course, you use the language in localization files, and perhaps in comments occasionally - especially in TODO stuff that's not meant to be permanent - but not in the actual code, like e.g. for a variable or function names.
I'd actually consider it a bad idea, as it limits significantly who can manage that code in the future.
It's a very western / Anglosphere attitude, and I think you underestimate how much code is produced in e.g. China and Japan nowadays, with comments in their native language.
How would you name a FooBarWicket if you don't speak a word of English?
I mean don't get me wrong, ideally everybody writes code in perfect English and sticks to a set of ~50 ascii characters, but it's not an ideal world and you have to keep other languages and cultures in mind.
I would argue that even if you decide that you are using some other language and not English, there is only a well-defined subset of Unicode characters that should ever be allowed in the codebase. Bidi override control characters are clearly not among them, whichever language you choose.
> Bidi override control characters are clearly not among them, whichever language you choose.
Not sure how would you write a comment in an RTL human language in the middle of LTR code without it. Lots of people write learn RTL languages well before writing any code.
What compilers can do is to process those characters and assign them semantic value that makes the code equivalent to what is expected to be rendered.
Now, bidi overrides in identifier names is a nightmare I’d prefer to avoid.
You do not actually need the bidi override control character to put a comment in an RTL language in the middle of LTR code.
You only need it if you are doing this, and the default Unicode algorithm for guessing LTR/RTL boundaries gets it wrong, so you need to override with an explicit bidi override control. I'm not even sure how feasible that is to do in current editor/IDE environments developers who have this use case might use.
I am genuinely curious how often these sorts of situations come up in actual development.
> What compilers can do is to process those characters and assign them semantic value that makes the code equivalent to what is expected to be rendered.
I don't understand what you mean or how that's even possible, for the kinds of attacks discussed in OP.
Btw here's proof. Here is ltr text and rtl עִברִית text عربي
interspersed with no bidi override control characters to be found.
Unicode can handle this, it has a heuristic algorithm for it. Note how if you try to select the text character-by-character, your selection does funny things at the rtl to ltr boundaries, because the byte order doesn't match the order on the screen. It really is handling the directionality changes, with the letters entered in "order" across changes, there is no funny entry or ordering going on, this is plain old normal unicode handling interspersed directionality changes just fine, with no bidi overrides.
It just sometimes gets it wrong for the intent of the author. Especially when there are characters at the boundaries that are themselves not strongly associated as rtl or ltr (like ordinary "western arabic numerals" or punctuation). That's what the bidi override control char is for.
The same way as you write a comment in a LTR human language in the middle of RTL code - you don't. You stick to either LTR or RTL. This is code, not prose.
> there is only a well-defined subset of Unicode characters that should ever be allowed in the codebase
It's not even remotely well-defined, and probably never will be. Also, as long as we keep adding to unicode, you will need to keep your whitelist of code points updated.
You can however find a well-defined subset of characters that can be allowed.
In either case you'd be essentially excluding entire languages.
>> There is only ... that should ever be allowed...
What I am saying is someone decides to code in a non-english language (which is completely reasonable) they should define a subset of unicode characters that is acceptable. Additionally, the allowed characters should not permit tricks like these.
As for excluding entire languages... well, yes. This is already the case today. But OTOH it's not like understanding what "if" means gives you any special advantage in programming.
Well, what you call an Anglosphere attitude is a reality of learning in a majority of non-english speaking countries: There's simply not enough resources for learning in your own language.
China is huge so I can see how it could work for them, but I still have to admit it's very hard for me to imagine someone becoming say a competent web dev without picking at least some basic English along the way, so they can handle at least the documentation and stay in a loop on new tech coming out all the time. It's not anything new as a concept, nor I see it as damaging for local cultures in any way - back in my University days I've learned myself some Russian so that I could read their physics and chemistry books which were excellent and way cheaper and easier for me to get than those from the West. One day I'll have no problem learning some Chinese if (or more likely when?) they become the referent source of knowledge.
> China is huge so I can see how it could work for them, but I still have to admit it's very hard for me to imagine someone becoming say a competent web dev without picking at least some basic English along the way,
Having worked with some large software teams in China my experience was that most people could speak a bit of English (but generally didn't want to) and were nowhere near at the level needed to actually design and write software in English.
If we forced them to do everything in English quality was terrible and everything took ages, but it we let them write in Mandarin things were much better.
> it's very hard for me to imagine someone becoming say a competent web dev without picking at least some basic English along the way, so they can handle at least the documentation and stay in a loop on new tech coming out all the time.
Why would they need to learn English to do those things? I'm sure there are Chinese-language tech news sites, and Chinese-language documentation.
When you code for yourself, write what you want. If you write to collaborate then use English/ASCII.
Imagine international aviation if they allowed the same BS that people in IT allow and now even try to promote - everyone talking their own language and not understanding each other - we would have planes colliding and crashing all over the place.
Agreed, but I'm still curious (and don't know the answer) how often someone actually needs to put a "Bidi override" in a comment... if I were a language designer I'd be tempted to just say they aren't allowed in comments or identifiers or anywhere but string literals/data, and have the compiler/interpreter just reject it.
(I have used a bidi override before myself, for non-malicious purposes!)
I'm not sure what they are being posted as examples of there? Can a bidi char in a string literal be succesfully used in the sort of attack in the OP, a "Trojan Source" attack?
If so, that is devious!
It's not clear to me if those example show that though. They show bidi characters being highlighted in a string literal, right.
My hypothesis was that such could not be part of a "trojan source" attack... but this stuff is confusing and I could have it wrong?
> How would you name a FooBarWicket if you don't speak a word of English?
How would you learn how to make a FooBarWicket without knowing a word of English? Any programming languages control constructs are almost by definition English.
I still wonder though, just how much production non-comment source code is not written in the ASCII character set.
The libraries of most programming languages (developed in the west) are in ASCII - frameworks and middleware too. Have people in countries like Japan and China actually translated all of that code - renaming functions, classes, and variable names to their native tongue in Unicode - or do they just learn the English names (they are all nouns/pronouns and at most simple phrases so translation should not be too difficult; they don’t have to understand English grammar).
Microsoft translated all the commands in the scripting language for excell to native language, making it totally impossible to use for anyone. You can't even google it because the help is so split up in different languages.
> Does anyone actually do that in a production code?
Would you accept teaching code as production code? Specifically, if you were to teach programming to young non English speakers, wouldn't you accept them to use words of their native tongue for variables and such?
> I'd actually consider it a bad idea, as it limits significantly who can manage that code in the future.
Wouldn't you say that solely using roman letters in code would impose a similar limit? In countries where these letters are seldom used (like for instance greek letters in western countries), only those accustomed to them would be able to handle code (as it has been the case until the last decade perhaps).
I can attest that it happens, even in (natural) languages that use Latin scripts. Sure, "just use en.US-ASCII" is a mitigation, and most (Euroamerican) code follows this; the bug extends to string literals however ("they don't end where you see them // this is actually not part of the string; return;"), so a different approach is needed.
Professionally made GUI software needs Unicode even when English localized, for typography.
Proper quotes, proper dashes (ASCII doesn't have a dash character, it only has minus), non-breakable space, soft hyphen, € character, Greek letters like π and μ, etc.
Internationalization is not limited to putting strings into a table in resource. It also needs non-trivial amount of code. Printing numbers into strings is code not data. Yet if you want the numbers to look good, like "600 μm" or "6×10⁻⁴ meters", you gonna have Unicode in code, not the resources.
Another thing, not every software needs i18n. Depends on the market. I'm yet to see a C++ compiler which would localize their output messages.
I've definitely seen it done, in both code I was adjacent to and code I was pulling from outside. I have vivid memories of stumbling on a lib doing seemingly what I needed but with all comments in Chinese and variables/funcs in Pinyin.
Can you give an example? I've never seen a project (outside domains on APL, etc.) that seriously relied on any Unicode capabilities in the code itself (again, I am not talking about localized strings). My native language is not English, I've worked with people all over Europe, China, India, Japan, Israel, etc. - there are a lot of exciting i18n/l10n problems but I have never seen much of what a compiler would need to be concerned with.
> The rm tool doesn't know which files are important to you, and the compiler doesn't - and shouldn't - know what you consider to be "correct" code.
This is actually no longer true. Many rm implementations today prevent you from deleting a path including the root directory, unless you explicitly specify `--no-preserve-root`. Similarly, a lot of compilers tend to warn you or outright stop if they detect code that is very likely to be buggy - the rust compiler warning about these control characters is just the latest example.
Of course, in theory, each tool should do its job and the user should be the boundary to know whats right. In practice, though, these heuristics tend to catch bugs-to-be 95% of the time (at least in my experience) and are easily disabled otherwise, so they are good to have.
I couldn't care less about my root directory. The only things I care about are the motherboard firmware and the /home directory, and nothing prevents `rm` from deleting those.
The `--one-file-system` or `--preserve-root=all` flags are more useful than `--preserve-root`, but they're not defaults. (For a good reason: compatibility.)
You argument away your own fix.
Proposed fix is like if rm was limited to files outside of /sys, plenty of projects depend on the standardized behavior.
Unless I misunderstand the premise, this in not right. The compiler is not "tricked" into doing anything different - it interprets the code the same way as it always did. It's like saying "rm" command "can be tricked into" deleting important files. The rm tool doesn't know which files are important to you, and the compiler doesn't - and shouldn't - know what you consider to be "correct" code. It would correctly compile any code that is syntactically correct - if there are strings inside that look weird to you, it doesn't matter to the compiler.
The entity that can be "tricked" here is the reviewer of the code - who, indeed, might probably be tricked into accepting code that does something different than they'd think it does (though it'd require a very clever attacker to for the code to both do something nefarious with Unicode and still look innocent and not weird to the reviewer). Fortunately, this is quite easy to fix - just don't accept any patches with source code that have any non-ASCII outside small set of localization resources (proper code would have localizable resources outside the code anyway, tbh) and no Unicode would ever trick you.