Another miserable FR Live Thread failure.

How does what is supposed to be the premiere conservative site on the 'net continue to fail?

I would have expanded on this except it took forever to get this simple post through the muck and molasses that we have to wade thru on night's like this.

No excuses, no denials.

Jim, in all the years I've been at FR, I've never known you to give excuses, nor denials. I'm not expecting to hear them from you now.

I suspect that some FReepers don't fully understand that this is your baby (and now John's as well), and just how frustrating it is to not see them quickly resolved.

I spent many years at remote locations, maintaining national defense radars, and I am well aware of the frustration of having an intermittent and/or non-repeatable failure drive one to the point of hair pulling.

On one of those systems, I arrived at day shift to find the maintenance crew had removed and replaced every Line Replaceable Unit (LRU) with one or more of the spares.

They had depleted all of the spares stock based on the Main Computer's software "instructing" them what LRU to replace.

They had followed the automated troubleshooting software's instructions throughout the night, and just before I arrived, it had gone full circle and was now instructing them to replace the first LRU it had called out at the beginning of their shift. Spares that had been called out, and subsequently identified by the computer as failed after retest, were to be found all over the maintenance area benches and shelves.

It was a nightmare, Headquarters was readying an emergency response team and transport to fly them in to "assist" us. The Site Commander was in a bad situation because, if they did have to be sent in, he was probably going to lose his Command position, and his upward mobility in rank would be forever lost.

As the Technical Services "expert", I was on the hot seat as well to find and fix the problem asap. This National Missile Early Warning System had been down for over 8 hours, and it's expected outage time was expected to be not more than 15 minutes (hey, it was an automated, multiple-stage computer-controlled, self-diagnosing and failure-unit-identifying system after all).

It took a couple of hours more to find the problem, and at first the maintenance crew I was technically guiding didn't believe the failure I led them to. After fixing what I showed them, the system was restored and operational within a few minutes. The Site Commander also wanted to be personally shown what the problem was, and some help in verbalizing the failure and fix to the General and his staff.

We just managed to restore the system in time to cancel the visit from the assistance crew (bet their families liked that because they didn't have to fly across country to help us).

As a Technical guy, it was a high stress, frustrating, and yet very satisfying experience.

So while some here are convinced that I am just a "cheerleader for FR", based on my experiences in finding and fixing hardware and/or software problems, you have, and continue to have my support.

FR is unique. For those who don't believe it, probably any other site/forum will do.

Oh yea, what was the problem I found? Two wire-wrapped power distribution standoff posts in the Radar Receiver Cabinets had gotten bent so that they were just close enough together to intermittently "arc-over", spiking and/or temporarily dropping the voltage to the LRUs below acceptable levels. This confused the computer software's automatic single-point failure diagnostics into reporting all of those (actually functional) LRUs as failures.

Keep on trucking Jim, and tell John to do the same. Hopefully you will both know if/when outside experts are required, and they will be summoned (and be of use).

Jim, I come from old mainframe COBOL DMSII programming environment.

When we knew our hardware, disk space and database settings were adequate and still had a problem we looked to the software.

Even on good days here on FR the response time is slower than it used to be.

I am wondering if there is a programming glitch that causes each transaction or maybe just one to somehow go into an unnecessary loop (or whatever they call it nowadays) before returning the response. This would eat up your memory and cause all other transactions to wait for this one to end.

This may not be noticeable on a slow day but of course be exacerbated with heavy volume.

Just a thought from a mainframe dinosaur.

I disagree with you on a few points:

1. Real time shouldn’t be demanded or expected. There should be cacheability for short time bursts, and since people tend to hit the same pages, the potential for cache misses should be low.

2. Transactional consistency shouldn’t be the rule here. No reason for it. Eventual is fine, and even that might be on the strict side.

3. There are plenty of improvements that can be made on the software here. I’m sure John has done an excellent job of making sure a lot of static content is cacheable, though.

There’s no real reason that the servers should be this slow, and from what John’s indicated, the problem doesn’t seem to be that the database or hardware is stressed. Synchronization points aren’t always obvious, though. In a distributed system, everything pulls on everything else, often in unexpected ways; it’s not a series of isolated components connected by a network. Could be improper backoff. Could be a case of improperly managed blocking queues. Or it could be external altogether & be DDOS.

Interesting problem to tackle, I’d love to give it a try.

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.