Replies

On the front page, straight-up curl. I also tried wget. Same - it’s UTF-8 all the way.

I’m not seeing entities at all.

No, wait, you have nailed this. Some dumbass thing is marking up UTF-8 to entities. I see entities when I view the thread. The code trying to make safe HTML is messing up because it doesn’t understand UTF-8.

Nice work, sir or madam.

That’s the diagnosis.

The code trying to make safe HTML is messing up because it doesn’t understand UTF-8.

That's as good a guess as any, as to how the bug got introduced.

Once you understand the source of the problem, it's not too hard to make client side fixes (see the links in #15).

To display a clean page, you need to scan the JavaScript representation of the page's text for cases where the UTF-8 is getting byte-by-byte entified. Solution: a regular expression that finds the entifications and feeds them to a function that reverses them back to the original code point. Apply it to every text snippet in the article, and update any snippet that underwent a change. Decrudified! Done!

To post clean input, you need to entitize any non-7-bit ASCII that may be present. And, on the Preview side, you need to do the opposite, so that the user is not buried in strange entities while posting, say, Uncle Volodya's full name (Влади́мир Влади́мирович Пу́тин) in the original Cyrillic.

The extensions at the links in #15 take care of all that.