Posted on 10/31/2011 11:58:08 AM PDT by Danae
On Friday October 21, 2011, this column exposed the scrubbing of Supreme Court Cases from legal research website Justia.com. On the following Monday October 24th, Justia founder and CEO Tim Stanley gave a very short response to Declan McCullagh at cnet.com about this scandal. (CNET is a tech heavy website dedicated to developers more so than the legal community.)
There Stanley asserted that citations in the 25 relevant cases (and more) were mangled due to a coding error. The code in question is called Regular Expressions, Regex for short. This code is essentially a filter. It is simple in that it will include or exclude specific characters from a result. A result would be what you see on an internet browser. Pure data is filtered through Regex code and put into its correct positions on a webpage in a template format.
The code error Stanley attributes the missing data to is a .* instead of a \s .
"In this case, Stanley said, what happened is that Justia's programmers typed in ".*" (which matches any character) when creating a regex. It's now an "\s" (which matches only spaces),". - Declan McCullagh
This column investigates Tim Stanleys statements to cnet with regard to the plausibility of them by consulting a professional familiar with Regex. Dr. David Hansen PhD. is a current University Professor in Computer Science and he explains what those two bits of code do.
(Excerpt) Read more at examiner.com ...
Anyway, dynamically generated code has to be rendered into a web page template. Rendering is more or less directing what data goes to which area, it filters what goes where. It makes some sense to believe that the “scrubbing” took place in that rendering layer, that way they would never need to change their database, just change the filter(s).
I have to ask, which Regex explanation are you wanting to hear? Everything I know of Regex says Stanley was..... incorrect.
So...
Can regex go renegade and do things it wasn’t told to do?
Any UNIX programmer worth his/her salt would NEVER make such a mistake.
In terms of missing text, when presenting a whole page of information by finding it by Regex, the Regex would simply invariably return NOTHING (the pattern didn't match), not a corrupted page. Nor would one usually want to use Regex to return the opinion at all. One would locate the page with a search and a use of Regex, then simply present the raw text from the DB using an ordinary SQL or Linq2SQL conduit. There would be no need for Regex to return the actual record, and it would actually be cumbersome and VERY PRONE TO UNIVERSAL ERRORS to accomplish it that way. Regex finds, SQL returns, would be the sensible model, and the only model I would recommend.
VANISHINGLY Small chance of the missing-text-explanation, ZERO chance of the inserting-text-explanation.
Unless I am given the regex code outright, which I can analyze, and see if it is some sort of DB conduit -- which I doubt -- I am forced to call this a confirmed LIE.
To tutstar: Not really. If it goes at all renegade, it would usually be to return no results at all.
please see above and add your concur, or disagree.
First off, Stanley *did* say why they removed the history from Wayback. It was in the last two paragraphs of the CNET article.
As for why CNET doesn’t include Stanley giving the explanation I did, I can certainly speculate as to a few possibilities that are possible and consistent with what we know:
- He didn’t know. He asked his programmers what happened, but didn’t grill them on the WHY behind the changes. If it was the WND article that got CNET’s attention (which seems probable), that was only a day before the CNET article. It’s not like he had time to undertake an extensive investigation.
- He didn’t say because CNET didn’t ask.
- He didn’t say because he didn’t think CNET would care. CNET is a tech website, and its primary interest in the story was tech-related. It’s not like this was a story at a legal website.
- He *did* say, but CNET didn’t include it. Again, not CNET’s primary interest in the story.
- For what it’s worth, he also did pretty much say “Nothing sinister - just a batch program update gone wrong.” He just didn’t expand on that extensively.
He did expand on it a little, because one thing he *did* say that’s perfectly consistent with my analysis is “It was just the U.S. Supreme Court cases, not the state, federal appellate and district court cases.” Changing reporter citations would indeed only apply to Supreme Court cases.
One thing that’s handy about my analysis is that it should operate awfully well as a scientific hypothesis. Based on the evidence we’ve seen from selected portions of five cases, I’ve made a prediction about what we could expect to see in the rest of those cases, as well as in the 20 other cases that Leo and Danae haven’t even NAMED yet. That’s quite the testable data set. Once they publish those screenshots (and if they were thorough researchers, they ought to have full grabs of the entire decisions, not just selected samples), we’ll see if my hypothesis checks out. If we consistently find pre-1875 citations affected in the same way (name and old citation replaced by hyperlinked new citation) but post-1875 citations unaffected, my hypothesis is validated. On the other hand, my hypothesis could easily be defeated if they post full screengrabs showing Minor and Slaughterhouse being affected, but other pre-1875 cases in the same decisions NOT being affected.
If it was happening to ALL pre-1875 cases in those 25 decisions, then that certainly deflates the argument that they were singling out Minor and Slaughterhouse. On the other hand, if their screengrabs show Minor and Slaughterhouse being affected but other pre-1875 cases NOT being affected, then my hypothesis has a problem.
Meanwhile, as we wait for those screengrabs to appear, you might be able to use your coding experience to draw some conclusions of your own. The CNET article includes the code that went wrong:
http://i.i.com.com/cnwk.1d/i/tim/2011/10/24/justia.png
Now I don’t understand a lick of that, but I do see that the difference in the two codes immediately follows some code that includes ‘volume’ and ‘U’ and ‘S’ and ‘page.’
What would that change result in? And would it be consistent with my hypothesis?
Thank you!!
“More to come”.
Keep ‘em on their toes.
Just checking in to the thread - any SPs? Or still crickets/wind whistling through the trees?
I notice on the Examiner site there is only one comment - the idiot Squeeky. Either really brain dead, or as I think, pretending stoopid.
Interesting post on your part. One mistake my post made was to make an assumption about underlying codebase. I assumed a dotnet model; perl might be so dramatically different to render my analysis moot. In dotnet, you would index and find with Regex and deliver with a SQL or ODBC conduit (unless you loved way-overcomplicating things and making them prone to fail). I am still unconvinced that parsing would happen in the Regex, the code snippet does not show delivery of the found data; of course, I am not a Perl guru. Perhaps John Robinson, who runs Free Republic on Perl, can add something to this discussion, as he is a Perl guru.
>Regex can get pretty tricky if you are not careful. I always try to test the results with a online regex tester and a series of test data, of both desired includes and desired excludes. Im often surprised of what sneaks through in both cases.....
I generally hate regex; there’s always some odd-case that totally breaks it, and they’ll break utterly if there’s a change in the structure and they’re anything more complex than “remove the whitespace” or “remove everything that’s not a digit.” Then there’s always the possibility that you’ll suddenly need (due to the client saying something) balanced parentheses...
I think I’d rather learn SNOBOL than spend the effort to gain any sort of ‘mastery’ over them.
That's for sure. The explanation for such selective disappearance of text, combined with the appearance of new text, combined with the timing, makes these Justia people look like absolute clowns.
The excuse is absolutely ludicrous, and I can't believe anyone could have said it with a straight face.
So, at least now we know that:
Someone should setup an alternate website, one which can actually be trusted.
sfl
Just getting home from trick-or-treat. Tired head. Tired feet. Much research to do ... Back later for discussion. Good points consider and test.
Hey Laz thats great information, but the big question is would you still hit it?
Think about this for a moment... Stanley says that what happened was due to “errors”. He is calling this deliberate act an error. So when he says he blocked the Wayback Machine because of “errors” he means he blocked access to the Wayback Machine because of the deliberatly corrupted cases.
It really is just as simple as that.
If nothing else, people who depended on that resource for information for schoolwork, or research, or legal cases might have missed or otherwise incorrect cases need to know if their work was affected.
This was just wrong. If it was an accident, then say so and show why. Own up to it and come completely clean, admit how many cases were ACTUALLY corrupted. That’s honorable at least.
Vick is on the clock I think. LOLZ
Well heck, can’t they even get time and a half for overtime?
LOL Yeah, trickle up poverty at work. No OT pay... :\ Everyone has to sacrifice... lolz
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.