Posted on 10/31/2011 11:58:08 AM PDT by Danae
On Friday October 21, 2011, this column exposed the scrubbing of Supreme Court Cases from legal research website Justia.com. On the following Monday October 24th, Justia founder and CEO Tim Stanley gave a very short response to Declan McCullagh at cnet.com about this scandal. (CNET is a tech heavy website dedicated to developers more so than the legal community.)
There Stanley asserted that citations in the 25 relevant cases (and more) were mangled due to a coding error. The code in question is called Regular Expressions, Regex for short. This code is essentially a filter. It is simple in that it will include or exclude specific characters from a result. A result would be what you see on an internet browser. Pure data is filtered through Regex code and put into its correct positions on a webpage in a template format.
The code error Stanley attributes the missing data to is a .* instead of a \s .
"In this case, Stanley said, what happened is that Justia's programmers typed in ".*" (which matches any character) when creating a regex. It's now an "\s" (which matches only spaces),". - Declan McCullagh
This column investigates Tim Stanleys statements to cnet with regard to the plausibility of them by consulting a professional familiar with Regex. Dr. David Hansen PhD. is a current University Professor in Computer Science and he explains what those two bits of code do.
(Excerpt) Read more at examiner.com ...
Thank you! It has been a very interesting journey so far!
Um, actually, Regex is quite versatile and capable of doing substitution all day long......
http://www.java2s.com/Code/Java/Regular-Expressions/QuickdemoofRegularExpressionssubstitution.htm
True. However, the regular expression used to process the text clearly changed between 11/6/2006 and 11/18/2008, perhaps multiple times. So, it is impossible to say with certainty whether or not the actual text 83 U.S. 73 was or wasn't in the 11/6/2006 file.
If the regular expression used to parse the 11/6/2006 file unintentionally filtered that text out, subsequent changes to the regular expression could have resulted in that text being included. That is the inherent nature of parsing text with regular expressions. It is unclear at this point whether or not that specific cite was "inserted."
That said, the odds are astronomical that a regular expression unintentionally filtered out the Minor v. Happersett and Slaughterhouse references. I'd have to know more about how they processed those case files to say anything with certainty, but their explanation thus far is laughable. I'm surprised CNET didn't immediately counter that answer with more questions. I would have.
Ask Weazie. The Obama political operative scouring natural born citizen from the internet.
Regex is brittle, you have to get it right or the results get unpredictable.
“That said, the odds are astronomical that a regular expression unintentionally filtered out the Minor v. Happersett and Slaughterhouse references.”
I don’t know anything about coding, but I DO know that in every screenshot comparison that Leo or Danae have posted thus far, the changes have involved pre-1875 Supreme Court cases (like Minor and Slaughterhouse).
Before 1875, Supreme Court cases were assigned volume numbers based on the clerk of the court. Starting in 1875, they adopted the U.S. Reports numbering system, and retroactively assigned volume numbers to earlier cases.
http://en.wikipedia.org/wiki/Reporter_of_Decisions_of_the_Supreme_Court_of_the_United_States
So whereas Minor had been 21 Wall. 162, it now became 88 U.S. 162. Slaughterhouse had been 16 Wall. 36, and it became 83 U.S. 36.
The pre-2006 Justia pages cited to all these older Supreme Court cases with their clerk volume numbers. Then they apparently used some bad code to try to change them to the U.S. Reports numbers.
So if you look at the screenshots from Luria (which show only a small portion of the case), you see several cases cited, three of which are pre-1875 cases. All three of those (Minor, Osborn, Babbitt) got affected (specifically, for all three the case name and pre-1875 citation were replaced by a hyperlink showing the US Reports citation), while all the post-1875 cases weren’t touched.
Leo and Danae have only posted images of, I think, 5 of their supposed 25 cases, and even then mostly just very narrowly cropped screenshots. I’ll go ahead and predict now that if they ever publish more thorough screenshots from those other 20, you’ll see that same pattern generally hold across all of them.
Agreed. I’ve been writing regex for 20 years and I still surprise myself with undesired results. It’s the nature of the beast.
Any of you guys ever have Regex insert new words that weren’t in the text originally?
If that’s the case, then this becomes more curious in my opinion. Why not simply say, “Oh, we were changing all of the pre-1875 case references and that’s likely what happened. Nothing sinister - just a batch program update gone wrong.”
And why remove the change history from Wayback?
If they were simply changing volume numbers on a batch of pre-1875 cases, Stanley could’ve immediately given a very direct, very sensible answer to put the accusations to rest. Why the horsesh*t?
Sure, when I intentionally insert such text.
Now I have had weird stuff like that happen when I am reading from one file to update a second file. The regex processing engine is pretty complex. When you're reading from one file and writing to another, it's a bit like playing twister.
GREAT chatting with you; Thursday is pretty good for me as far as getting married. Surprised you proposed so quickly, though!
I see what you are getting at.... I’m going to do some homework on this in the next 2-3 hours....
Ha! I move quickly. Can’t do Thursday though. I’m celebrating my 19th wedding anniversary on the 6th ... so I’ll be out of town for the weekend.
:)
Thank GOD we are both Mormons. It allows us such marital flexibility. :)
The SCOTUS text shouldn’t need any “updating” though. It hasn’t been changed since the decision was first issued so why would there be any “updating”?
If they were changing their format - to allow for clickable links, for instance, or if (as someone here as suggested) the references were in a different format for older cases than for newer ones and the programmers wanted to make the format uniform throughout - that might be “updating”, but why would that include language pretending that it’s part of the original text of the court’s reasoning?
I’m about as far from a techno-nerd as a person can get so details are lost on me in this regard, but I do hope Laz will look at the specifics and tell us if the Regex explanation is plausible.
The timing of the changes and their correction is a whole ‘nother subject as well.
Yep. Precisely. Not only that, but we predicted they WOULD do that, because they had done it in the past. The best indication of future behavior is past behavior.
Happy anniversary, and congratulations! We just celebrated 20 in June; time flies.
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.