Free Republic
Browse · Search
News/Activism
Topics · Post Article

To: Gene Eric
But it’s worth pointing out that UTF-8 isn’t necessarily obvious within the byte data excluding the BOM.

UTF-8 doesn't need a BOM! It's just bytes! The beauty of it is, you can just send bytes, and the result will be perfectly fine ASCII, as long as you don't send anything outside 7-bit ASCII (IOW, confine yourself to straight quotes and hyphens). But, if you want curly quotes and em-dashes, and umlauts and Cyrillic and Chinese and Korean, it all fits into UTF-8. The advanced characters are just sequences of bytes in the 80-FF range. And today's browsers handle UTF-8 just fine! As long as they are not second-guessed by a broken server!

FR pages are all declared to be UTF-8 in their response headers. No need for a BOM (stupid MS concept). Characters are eight bits, except when they are not. That's the beauty of UTF-8! A megabyte of US ASCII is a megabyte long. A megabyte of Cyrillic is two megabytes (hey! they lost the Cold War!). A megabyte of curly quotes and em-dashes is three megabytes of FR teeth gnashing, LOL!

55 posted on 01/19/2016 2:36:19 AM PST by cynwoody
[ Post Reply | Private Reply | To 54 | View Replies ]


To: cynwoody

You’re right, I lapsed. The BOM helps to designate byte order — duh...

Still yet, how does one know if the stream is ANSI or UTF-8 until things get weird? At least with UTF-16+, the ANSI stream dies after the first character.


56 posted on 01/19/2016 2:41:44 AM PST by Gene Eric (Don't be a statist!)
[ Post Reply | Private Reply | To 55 | View Replies ]

Free Republic
Browse · Search
News/Activism
Topics · Post Article


FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson