It's slippery. E.g., View Page Source will give different results from View Selection Source. The actual page, downloaded using a non-browser such as wget, may show entities, whereas the likes of View Selection Source or cut and paste into your favorite hex dumper will show clean UTF-8.
If I copy from the browser window, and paste through xxd, I see UTF-8. But, if I look at the actual HTML, I see entities. That is the key to the problem.
On the front page, straight-up curl. I also tried wget. Same - it’s UTF-8 all the way.
I’m not seeing entities at all.
No, wait, you have nailed this. Some dumbass thing is marking up UTF-8 to entities. I see entities when I view the thread. The code trying to make safe HTML is messing up because it doesn’t understand UTF-8.
Nice work, sir or madam.
That’s the diagnosis.
and the filter code is only operating when comments are displayed, which makes sense, because on the main page everything is pre-filtered. For comments, you want to run a filter again in case some commenter is trying to be fancy with JS or something.
That’s why and what it is.
I feel like one of Dr House’s interns. Great analysis.