Replies

But it’s worth pointing out that UTF-8 isn’t necessarily obvious within the byte data excluding the BOM.

UTF-8 doesn't need a BOM! It's just bytes! The beauty of it is, you can just send bytes, and the result will be perfectly fine ASCII, as long as you don't send anything outside 7-bit ASCII (IOW, confine yourself to straight quotes and hyphens). But, if you want curly quotes and em-dashes, and umlauts and Cyrillic and Chinese and Korean, it all fits into UTF-8. The advanced characters are just sequences of bytes in the 80-FF range. And today's browsers handle UTF-8 just fine! As long as they are not second-guessed by a broken server!

FR pages are all declared to be UTF-8 in their response headers. No need for a BOM (stupid MS concept). Characters are eight bits, except when they are not. That's the beauty of UTF-8! A megabyte of US ASCII is a megabyte long. A megabyte of Cyrillic is two megabytes (hey! they lost the Cold War!). A megabyte of curly quotes and em-dashes is three megabytes of FR teeth gnashing, LOL!

You’re right, I lapsed. The BOM helps to designate byte order — duh...

Still yet, how does one know if the stream is ANSI or UTF-8 until things get weird? At least with UTF-16+, the ANSI stream dies after the first character.