Free Republic
Browse · Search
General/Chat
Topics · Post Article

Skip to comments.

What's wrong with "smart quotes" and possibly how to fix it (Vanity)
me

Posted on 11/19/2015 10:34:21 PM PST by some tech guy

So, I'm still seeing smart quotes mess up. There was a thread a little while ago whereby I said what the problem was, but it's still extant.

Here's the problem: All the web-tier stuff is using UTF-8 (which is a great idea), but *something*, probably the database tier, is using CP-1252.

I suspect either:

a) Database table for posts on SQL Server is set to collate and/or store the windows-1252/CP-1252 character set.

or

b) something between the web level and the DB level is doing character set interpretation in an incorrect way. There's 1252 in there somewhere, and that's what's breaking the quotes.


TOPICS: Computers/Internet; Focus Software
KEYWORDS:
Navigation: use the links below to view more comments.
first previous 1-2021-4041-48 next last
To: some tech guy
The declared character encoding of FR pages is UTF-8.

Anything in the chain that conflicts with that, such as Windoze-1252 or ISO-8859-1 or Latin-1, can be expected to cause problems.

This is a recent problem. It started around the time FR was having problems getting banned by Google's malware filter, apparently due to an FR page that linked to a malware sit. I would think the debugging process would start with the age-old question, What did I change last?

UTF-8 rules! The web should be UTF-8 end to end!

21 posted on 11/19/2015 11:34:31 PM PST by cynwoody
[ Post Reply | Private Reply | To 17 | View Replies]

To: cynwoody

Oh yeah! UTF-8 always and often!

I checked out your markup theory. On the main page, the smart quotes related to this thread:

http://www.freerepublic.com/focus/f-news/3362771/posts

are *not* &# or &something markup. They’re straight UTF-8, and work.

[redacted]-imac:~ [redacted]$ hexdump freep2.txt
0000000 e2 80 9d 0a
0000004

But when you click the link, all the quotes are messed up.

I’ve a soft spot for 8859-1, but once you go UTF-8, you never go back.


22 posted on 11/19/2015 11:37:57 PM PST by some tech guy (Stop trying to help, Obama)
[ Post Reply | Private Reply | To 21 | View Replies]

To: some tech guy
The bit I find particularly interesting is that the smart quotes work on the front page, but break on the thread view.

I sometimes get a google warning when I click on threads, but never on the front page. I very frequently get the warning when I click on links to individual posts.

This started at about the same time as the smart quote problem. I assume they are caused by the same glitch?

23 posted on 11/19/2015 11:45:22 PM PST by Ken H
[ Post Reply | Private Reply | To 17 | View Replies]

To: Ken H

What’s the warning? It may be unrelated, but it also might give a hint as to the issue.


24 posted on 11/19/2015 11:47:03 PM PST by some tech guy (Stop trying to help, Obama)
[ Post Reply | Private Reply | To 23 | View Replies]

To: some tech guy
Warning: Something's Not Right Here!

www.freerepublic.com contains malware. Your computer might catch a virus if you visit this site.

Google has found malicious software may be installed onto your computer if you proceed. If you've visited this site in the past or you trust this site, it's possible that it has just recently been compromised by a hacker. You should not proceed, and perhaps try again tomorrow or go somewhere else.

We have already notified www.freerepublic.com that we found malware on the site. For more about the problems found on www.freerepublic.com, visit the Google Safe Browsing diagnostic page.

If you understand that visiting this site may harm your computer, proceed anyway.

25 posted on 11/19/2015 11:51:52 PM PST by Ken H
[ Post Reply | Private Reply | To 24 | View Replies]

To: some tech guy
Were you viewing the actual content of the HTML page or your browser's rendition of it?

It's slippery. E.g., View Page Source will give different results from View Selection Source. The actual page, downloaded using a non-browser such as wget, may show entities, whereas the likes of View Selection Source or cut and paste into your favorite hex dumper will show clean UTF-8.

If I copy from the browser window, and paste through xxd, I see UTF-8. But, if I look at the actual HTML, I see entities. That is the key to the problem.

26 posted on 11/19/2015 11:58:40 PM PST by cynwoody
[ Post Reply | Private Reply | To 22 | View Replies]

To: Ken H

Ah, our FRiend cynwoody knew about this:

“This is a recent problem. It started around the time FR was having problems getting banned by Google’s malware filter, apparently due to an FR page that linked to a malware site.”

Something changed and broke smart quotes.


27 posted on 11/19/2015 11:58:46 PM PST by some tech guy (Stop trying to help, Obama)
[ Post Reply | Private Reply | To 25 | View Replies]

To: some tech guy

Here’s a link to some of the threads. Looks like it began around Sep 27 => http://www.freerepublic.com/focus/search?s=+malware&ok=Search&q=quick&m=any&o=time&SX=564ea2876c724978a516af590ae170c44b88e569


28 posted on 11/20/2015 12:02:33 AM PST by Ken H
[ Post Reply | Private Reply | To 27 | View Replies]

To: cynwoody

On the front page, straight-up curl. I also tried wget. Same - it’s UTF-8 all the way.

I’m not seeing entities at all.

No, wait, you have nailed this. Some dumbass thing is marking up UTF-8 to entities. I see entities when I view the thread. The code trying to make safe HTML is messing up because it doesn’t understand UTF-8.

Nice work, sir or madam.

That’s the diagnosis.


29 posted on 11/20/2015 12:05:00 AM PST by some tech guy (Stop trying to help, Obama)
[ Post Reply | Private Reply | To 26 | View Replies]

To: cynwoody

and the filter code is only operating when comments are displayed, which makes sense, because on the main page everything is pre-filtered. For comments, you want to run a filter again in case some commenter is trying to be fancy with JS or something.

That’s why and what it is.

I feel like one of Dr House’s interns. Great analysis.


30 posted on 11/20/2015 12:17:54 AM PST by some tech guy (Stop trying to help, Obama)
[ Post Reply | Private Reply | To 26 | View Replies]

To: some tech guy
The code trying to make safe HTML is messing up because it doesn’t understand UTF-8.

That's as good a guess as any, as to how the bug got introduced.

Once you understand the source of the problem, it's not too hard to make client side fixes (see the links in #15).

To display a clean page, you need to scan the JavaScript representation of the page's text for cases where the UTF-8 is getting byte-by-byte entified. Solution: a regular expression that finds the entifications and feeds them to a function that reverses them back to the original code point. Apply it to every text snippet in the article, and update any snippet that underwent a change. Decrudified! Done!

To post clean input, you need to entitize any non-7-bit ASCII that may be present. And, on the Preview side, you need to do the opposite, so that the user is not buried in strange entities while posting, say, Uncle Volodya's full name (Влади́мир Влади́мирович Пу́тин) in the original Cyrillic.

The extensions at the links in #15 take care of all that.

31 posted on 11/20/2015 12:22:50 AM PST by cynwoody
[ Post Reply | Private Reply | To 29 | View Replies]

To: cynwoody

Given that I’m server- and client-side, I’ll give my opinion (which is no more or less valid than yours)

Yes, you can fix it with JS, but that doesn’t address the root cause. I was *wrong* about the root cause and I don’t mind admitting that. The code and the db are solid.

There’s just a little bit of code which gets executed when displaying the thread with comments that screws everything up. People have reported that pasting text into the post box works, and that it’s fine on preview. It’s stored in the DB without problem. It shows on the main page just fine. UTF-8 FTW.

Something is filtering text on the display thread view. And that something only understands ASCII. That’s my bet. A $5 bet, if you’re interested ;)

Nobody should have to cleanse their input, everything should just work. That’s how it should be. In my opinion.


32 posted on 11/20/2015 12:30:38 AM PST by some tech guy (Stop trying to help, Obama)
[ Post Reply | Private Reply | To 31 | View Replies]

To: some tech guy

OK, quick test:

閪曬

I’m wrong again, that broke on preview.


33 posted on 11/20/2015 12:35:40 AM PST by some tech guy (Stop trying to help, Obama)
[ Post Reply | Private Reply | To 32 | View Replies]

To: some tech guy
Another useful clue is that, if you summon up old threads (prior to last October or whenever), there is no problem with the content. That indicates the bug is in whatever now processes new user input.

E.g., post straight quotes surrounding something. The server converts your non-HTML post to HTML, adding a <br> here or there, and converting your left and right straight quotes to curly quotes. Your post looks fine. You probably don't even notice what happened to your quotes.

Now somebody quotes your post. That involves selecting your now curly quotes and pasting them into some sort of reply. The bug intervenes and entitizes the three bytes of the UTF-8 representation of your opening and closing curlies. The result is a mess.

Nobody should have to cleanse their input, everything should just work. That’s how it should be. In my opinion.

Absolutely. UTF-8 end to end!

34 posted on 11/20/2015 12:46:25 AM PST by cynwoody
[ Post Reply | Private Reply | To 32 | View Replies]

To: cynwoody

閪曬


35 posted on 11/20/2015 12:48:38 AM PST by some tech guy (Stop trying to help, Obama)
[ Post Reply | Private Reply | To 34 | View Replies]

To: cynwoody

It’s not the preview (I just posted that previous without previewing)

Main-post filtering I think is different to comment-post filtering. *And* there’s some interaction that happens when you *view* a thread, rather than just reading the precis on the main page.

Heh, we’re nerding out on this so hard! If you’re anything like me, you’d just like to see it fixed. I love FR and the random characters just irritate me.

I think we got somewhere though.


36 posted on 11/20/2015 12:53:09 AM PST by some tech guy (Stop trying to help, Obama)
[ Post Reply | Private Reply | To 34 | View Replies]

To: some tech guy
Here's how that looks to me (screen shot, userscript installed):

I don't speak Chinese. What does it mean?

37 posted on 11/20/2015 12:54:21 AM PST by cynwoody
[ Post Reply | Private Reply | To 33 | View Replies]

To: cynwoody

I’m using latest firefox and that doesn’t work at all. Looks like your script does the business.

That’s something a bit rude in Cantonese.

So maybe it’s an easy fix on the FR side?


38 posted on 11/20/2015 12:56:14 AM PST by some tech guy (Stop trying to help, Obama)
[ Post Reply | Private Reply | To 37 | View Replies]

To: some tech guy
So maybe it’s an easy fix on the FR side?

As I posted, it starts with, what did we last change?

I'm guessing it's a fairly trivial error. It just needs to be addressed.

But it's nonetheless interesting that it can also be unraveled from the client side.

39 posted on 11/20/2015 1:03:41 AM PST by cynwoody
[ Post Reply | Private Reply | To 38 | View Replies]

To: some tech guy

Is it possible that there is a problem using magic quotes or similar on the code that is storing in utf8 and then some sort of caching that is causing the conversion outside of the database? I seem to remember WordPress having a similar issue when they tried an update in the 2.0s that was security related.


40 posted on 11/20/2015 1:35:59 AM PST by willyd (I for one welcome our NSA overlords)
[ Post Reply | Private Reply | To 22 | View Replies]


Navigation: use the links below to view more comments.
first previous 1-2021-4041-48 next last

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
General/Chat
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson