19 Jan

The strangest Oracle problem I ever encountered – can you guess the cause?

Before I joined Blue Gecko, I did independent remote DBA work, and called myself ORA-600 Consulting. Stemming from my hair-raising experiences in the trenches at Amazon in the late ’90s / early 2000s, I decided to specialize in emergency DBA work for companies in the midst of crises (I know, great idea for someone who wanted to get away from the Amazon craziness, right?).

One day in 2009, a company in Florida called my cell phone at 2AM. They described their problem as follows:

We have a 32-bit Intel server running Red Hat Enterprise Linux 4 and Oracle Database Enterprise Edition 9.2.0.1. There are four databases ranging in size from 20G to 100G. The storage is EXT3 filesystems on partitions of an Apple Xserv RAID5 array.

We had a power outage yesterday, and the database server powered down and booted back up. Prior to yesterday, it has not rebooted for about one year. We have been running trouble-free for the previous year. Upon reboot, Oracle started automatically, but all of the databases appeared as they did about one year ago. It is like the database hasn’t been saving the changes we have been making for the past year. None of the inserts, updates or deletes made in the past year are present in the databases. We are absolutely flummoxed. Please help!

I logged into the server and it was just as they described. Even the alert log and messages files ended suddenly about one year prior, and picked up again on the day of the most recent reboot. There was no trace of the intervening 12 months of work. The customer was ready to resort to their backups, but wanted to understand the problem before they proceeded. In addition, restoring backups would mean losing the last 24 hours of transactions, since archivelogs had not gone to tape for that long, and they were missing just like everything else from before the most recent reboot.

They weren’t the only ones who were flummoxed. I just sat there thinking, “where do I start?” After some poking around, though, I solved the problem. Any guesses what went wrong here? I’ll post the solution in about a week. No fair posting the solution if I’ve told you this story before!

13 Responses to “The strangest Oracle problem I ever encountered – can you guess the cause?”

  1. Mark J. Bobak 19. Jan, 2012 at 2:23 pm #

    Wow…that sounds really bizarre….I’m stumped….

    I’ll post again if I come up with an idea….

  2. Tony van Esch 19. Jan, 2012 at 2:35 pm #

    Bizarre problem!

    Sounds like a reconfigured filesystem layout (snapshot/dup) that wasn’t commited in /etc/fstab

    After reboot old filesystems were mounted and hence the databases that hadn’t been opened since that duplicate action are back (and the current db’s hidden). Nice one :-)

    Am I close?

    Kind regards,
    Tony

  3. Elham Pishgah 19. Jan, 2012 at 2:40 pm #

    Em… very new to Oracle but if something like this happens I wont look for data, I look for TIME! maybe system time was messed up.

  4. Noons 19. Jan, 2012 at 7:19 pm #

    Yup, I’d go with the file system "swap" pointing to an old LUN config, along the general lines of Tony’s description. Quite a possible occurrence when dealing with SANs.

  5. Kevin Fries 20. Jan, 2012 at 10:33 pm #

    I’m at a site with loads of NFS mounts and I’ve seen many issues due to overlays of the filesystems in the "worng odrer". As Noons points out, it quite possibly be pointing to the wrong LUNs. My other thought is that there’s something to do with a disaster recovery test or an old copy of the DB saved off that was meant to saved for an auditing requirement.

  6. gary 24. Jan, 2012 at 4:23 am #

    Given that you solved the problem (I assume without relying on the 24hr out-of-date backup) it means the data changes were somewhere.

    I also infer that the ‘about a year’ is consistent with the last reboot. My guess is that at the last reboot there was a failover to a standby which they had actually been running on for the year. With the outage, they restarted everything and got reconnected to the primary which had been unused for a year rather than the up-to-date standby.

  7. Tangy 01. Feb, 2012 at 5:25 am #

    I hope this is not something related to standby or primary setup. He said alert logs have not been changed. So seems something with filesystem or disks that are mounted. Disks that might be mounted after the DR might not have been saved in fstab.

  8. Shah 07. Feb, 2012 at 8:27 am #

    ThIn my opnion one reason could be the retention policy……

  9. Kevin Fries 16. Feb, 2012 at 11:07 pm #

    Any chance of providing the solution you found? I’m curious and it’s been a while.

  10. Jeremiah Wilton 17. Feb, 2012 at 10:04 am #

    Most of the comments are pretty close. What the customer didn’t tell me was that they had not one, but two, Apple Xserve RAID arrays. One had been used to develop the system, and they had migrated to the other – wait for it – one year prior.

    I found this out by searching through the Linux messages file, in hopes I would find a clue of some sort. I saw two SCSI devices being discovered, one being assigned /dev/sda and the other /dev/sdb. Everything was mounted on /dev/sda, so I mounted /dev/sdb on /mnt just to look at it. There were the database and logs with entries ending in recent timestamps.

    The fstab had a line like:
    /dev/sda1 /u01 ext3 defaults 0 0

    So they were mounting based on device name. That means that whichever disk was discovered first by the scsi-probe got to be /dev/sda, and the first partition got mounted on /u01/. I helped them label the volumes using tune2fs and change the fstab to mount based on disk label. They were much happier after that! They came on board with Blue Gecko when ORA-600 Consulting and are still a customer!

    Thanks everyone for contributing!

  11. paul 10. Jul, 2012 at 7:41 pm #

    Database incarnation? But it would seem like a filesystem mount issue. Do they use EMC BCVs? ie did they mount a BCV volume that was synced a year ago?

  12. paul 10. Jul, 2012 at 7:42 pm #

    jeez I just read the answer…

  13. Stefan 04. Oct, 2012 at 8:24 am #

    Been there.. ;) In my case they put the VMWare Machine into Backup mode and forgot that. Then they ran out of diskspace on the ESX Host and just removed the changelogs (oops crashing machines..). All they told me in first place was, that they rebooted the machine and oracle has done a timewarp a year backwards.

Leave a Reply