You may be interested in another EMC customer - GunBroker.com.
http://www.gunbroker.com/user/EmcCrash/EmcCrashFollowup.asp
This story was on the Yahoo Message Board a few months ago.
It is still on Gunbroker.com web site in the link above.
EMC Storage Array Crash - The Real Story
March 20, 2002
Two months ago we reported the EMC storage array crash that caused us nearly 40 hours of downtime. Many people wrote us and asked for additional information about the crash, but at the time we had no additional information to provide. After a month of demanding a detailed explanation we finally got some additional information which is provided herein.
It has also been brought to our attention that EMC Corporation disseminated some false and misleading information to customers, potential customers, and the press. We received a copy of this information and received independent confirmation from multiple sources that the information came directly from EMC Corporation. Because EMC Corporation has attempted to portray us as liars and has attempted to blame their failure on us we feel obligated to reveal the facts of the matter to defend our good name.
The Players
GB Holdings / WebVentures / GunBroker.com - www.gunbroker.com - that's us EMC Corporation - www.emc.com - the manufacturer of the Clariion storage array on which our database was being hosted.
ManagedStorage Inc. (MSI) - www.managedstorage.com - ManagedStorage is an EMC partner and is the actual owner of the storage array our data was hosted on. According to the EMC/ManagedStorage partnership press release "As a pure-play storage service provider achieving platinum status in EMC's xSPerience Provider Program, ManagedStorage will work with EMC to bring the benefits of ManagedStorage's unique expertise in storage and networking, and EMC's information infrastructure to Internet-based businesses".
Inflow - www.inflow.com - the data center that our systems are hosted in. Inflow also acts as a reseller for third-party services, and resells ManagedStorage services under the Inflow brand name StorageFlow
The Equipment
The storage array is an EMC Clariion. We were never told the model. Our systems were connected to the Clariion via dual redundant Fibre Channel cards hooked up to redundant Fibre Channel Switches.
What Caused the Crash and Downtime?
According to the information we have been given, the problem was caused by a bug in the EMC firmware that caused the EMC storage processor to hang when an MSI technician terminated a telnet session with the EMC array. The EMC storage processor hang was so severe that the system was unable to fail over to the backup storage processor.
Inflow's exact words: "On 18-Jan-2002 at 12:37 PM, an engineer from our SSP was performing remote status checks on host names and IP addresses on our disk platforms. Remote status verification was accomplished by accessing the Clariion disk platform through an IP connection utilizing a telnet session. For this particular status check, the host names and IP addresses were verified at the initial login and password challenge screen. Therefore, the SSP engineer did not proceed to actually login, but executed a "control"-"right bracket" command to terminate the telnet session. The termination of the telnet session in this manner caused the active storage processor to hang, preventing failover to the redundant processor."
EMC Corporation, which provides training, support, and on-site technical assistance to ManagedStorage, dispatched technicians to the Inflow data center to attempt to fix the problem. It was not until nearly eleven hours later that EMC technicians had the array back online.
Because we feared data corruption, we requested that EMC/MSI create a bit by bit backup of our data before we restarted our servers. Several more hours passed before this task was completed.
It took a few hours to restore the old database from tape. The majority of the remaining downtime was caused by the need to meticulously recover as much data as possible from the corrupted database.
What Caused the Data Corruption?
Inflow, EMC, and ManagedStorage apparently do not want to address this issue because we have been unable to get a solid written response. Our opinion is that EMC's technicians caused the data corruption. During the time that the array was offline we kept asking why it was taking so long to bring the system back up. The response we were given was that there was a 'dirty cache' on the storage processor and that the EMC technicians were having difficulty flushing the cache to physical storage. We believe that the EMC technicians were never able to flush the cache of the hung storage processor and instead chose to reset the equipment, losing the data in the cache and corrupting our database in the process.
Of course, if EMC's firmware did not contain a serious bug that compromised the system then this whole situation would never have occurred. It is simply ridiculous that a simple telnet session could crash the storage processor.
Inflow told us that EMC has since created an update to their 'flare' code (EMC's term for its firmware) that fixes this bug.
Supporting Documentation
Inflow Trouble Ticket revision 1 Inflow Trouble Ticket revision 2 EMC's Response to Gunbroker.com
EMC's False and Misleading Statements
"Gunbroker.com [sic] outsources most of its IT operations to a service provider, an EMC customer" - No, we don't outsource "most" of our IT operations. At the time we outsourced storage management to ManagedStorage, a company that is a platinum member of EMC's xSPerience Provider Program and that EMC lists as a "Tier 5 Participant" in their EMC Proven E-Infostructure program. MSI is not simply some random EMC customer; MSI is a company that EMC has partnered with specifically to generate demand for EMC products by providing services to the small business market segment occupied by companies like GB Holdings.
"While EMC Customer Service has been involved in diagnosing the problems experienced by the service provider, EMC storage systems have not been found to be at fault for the outage" - EMC is 100% at fault for the outage and the data corruption. The bug that caused the storage processor to hang was a bug in EMC firmware. Furthermore, EMC technicians were unable to flush the cache and instead reset the equipment, causing loss and corruption of our data.
"Gunbroker.com [sic] was responsible for conducting its own backups and the service provider involved has suggested this was the source of Gunbroker.coms [sic] corrupted database and lengthy downtime" - No, EMC, your buggy firmware and the actions of your technicians were the source of our corrupted database and lengthy downtime. See above.
What the Vendors did After the Crash
We would love to be able to report that EMC, ManagedStorage, and Inflow took responsibility for their problems and worked with us to fix the issues and to cover our financial loss that occurred as a direct result of the problem. Unfortunately this did not happen.
ManagedStorage - because of the possibility that the logical structure of the volume on the EMC was corrupted we backed our data up off the EMC volume and migrated to a new database server running Windows 2000 (the old server had been on NT 4.0). ManagedStorage was never able to get Windows 2000 to see the EMC volume. After a great many hours of following MSI's advise trying to resolve this issue we gave up and canceled the MSI / StorageFlow contract.
EMC Corporation - EMC contacted us directly and asked us to take down the original information we had written on Jan 20 that explained the outage to our customers. Their PR guy suggested that we explain that we had "hardware problems" instead of directly naming EMC Corporation. According to Inflow, EMC also pressured Inflow to try to get us to take the information down.
Inflow - we had high hopes that Inflow would step up to the plate and reimburse our loss. After initially being told that Inflow would take "full responsibility" for the matter, Inflow decided to back out and now refuses to reimburse us for the loss. Inflow's marketing states "Our Internet data centers and managed services provide 100% availability, performance optimization and comprehensive system management. We help customers ensure that their critical applications are always up, always open for business". They also advertise a "100% customer satisfaction guarantee". Inflow makes a lot of promises but when the chips are down Inflow does not deliver. As a result of their failure to deliver according to the terms of their SLA and their subsequent decision to hang us out to dry, we have elected to move out of their data center at the end of the current contract.
If you are looking to build a reliable infrastructure that depends on quality vendors who stand behind the products they sell we strongly recommend that you avoid EMC, Inflow, and ManagedStorage.
Preventing Future Problems
We understand that in the technology business problems are going to occur. What separates the wheat from the chaff is how you deal with the problems. The vendors named in this document serve as illustration of how some companies handle problems poorly. We are reviewing our procedures to make sure that we handle problems quickly and minimize the impact of any issues that arise.
Prior to this matter we relied too much on vendors who made grand promises but failed to deliver. Subsequent to the disaster we have rearchitected our infrastructure to take greater control in-house, reevaluated our choice of vendors for key services, and substantially overhauled our disaster recovery plan. Current and future changes will reduce our dependence on third-party vendors, minimize the possibility for loss of data, and allow us to more quickly recover in the event of a catastrophic failure. This has been a painful and costly lesson, but one that we learned well.
As always we are working hard on behalf of our customers to make sure that GunBroker.com provides the highest level of reliability.