Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As brazzy said, there's no such thing as extended ASCII. There's just a huge number of ASCII-compatible eight-bit encodings. The original IBM (and DOS) character set, hardwired into ROM, is the one you're thinking of, and went by various names such as "Personal Computer, MS-DOS United States, MS-DOS Latin US, OEM United States, DOS Extended ASCII (United States), PC-ASCII" [1].

DOS 3.3, in 1987, was the first version to support localized character sets, via a system of "code pages". You'd select an encoding/"character set" that suits your language in AUTOEXEC.BAT – or just used the default 437 if you were a US user and never had to worry about these things. For me, the most relevant code page was 850, aka "OEM Multilingual Latin 1" (not at all the same as ISO/IEC 8859-1 which is also known as "Latin 1").

Why the apparently arbitrary numbers, I'm not sure, but Claude and ChatGPT both claim the codes were simply drawn from a more general-purpose sequence of product numbers used at IBM at the time.

This application, like other similar ones, uses Unicode box drawing characters that now all reside comfortably out of the eight-bit range.

[1] https://www.aivosto.com/articles/charsets-codepages-dos.html

 help



> Why the apparently arbitrary numbers, I'm not sure, but Claude and ChatGPT both claim the codes were simply drawn from a more general-purpose sequence of product numbers used at IBM at the time.

Claude and chatgpt are (probably) wrong. Wikipedia has 3 citations for the following statement:

> Originally, the code page numbers referred to the page numbers in the IBM standard character set manual

The reason they're so high is because code pages were assigned to EBCDIC first.


Yeah, I later found that quote on Wikipedia too. Though I don't think the cited source is super reliable either, or just folklore ("Oh, 'code page' refers to actual deadtree pages"). All the IBM documentation I could find showed big gaps in the sequence of code pages.

But I just now found the list at [1], I don't know why I didn't notice it before. It's certainly comprehensive! There's been some real detective work to be done in compiling that list. The gaps are much smaller, though still exist, eg. from 40 to 251. The 300s are rather sparse, there are only a few 4xx codes, and then there's a jump from 500 to 8xx (with some 7xx assigned later I think).

In any case, I agree that the LLMs seem to have hallucinated the "more general sequence" part. The code page IDs, or more formally CCSIDs, always were a specific set of 16-bit ID numbers. Why exactly the various gaps exist is probably lost in history by now, if there ever even were any particular reasons.

[1] https://en.wikipedia.org/wiki/Code_page




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: