Dear SalesForce.com - data loss isn't cool
Looks like the One Primary Data Store per Production Instance isn't working too well from an architecture point of view for some SalesForce.com customers. In North America, they have 45 instances, and NA14 (one of the 45) failed and had to be restored losing 5 hours worth of customer data.
What they are telling their customers:
"The service disruption was caused by a database failure on the NA14 instance, which introduced a file integrity issue in the NA14 database. The issue was resolved by restoring NA14 from a prior backup, which was not impacted by the file integrity issues.
We have determined that data written to the NA14 instance between 9:53 UTC and 14:53 UTC on May 10, 2016 could not be restored."
I believe some downtime is part of the deal when you use a system on-premise or in the cloud. Data loss is not.
Looking at a SF architecture presentation (link to ppt) from a prior DreamForce conference, they 8,000+ customers in one of these data stores. This reminds me of apartment living where if one neighbor sets a fire it can burn down the whole building. In my mind taking advantage of the cloud doesn't mean giving up some physical boundaries with my neighbors. Separate data stores for customer's data and metadata is a reasonable architecture ask and one of the reasons I appreciate Dynamics CRM
Not a lot is disclosed about what happened other than "file integrity issues in the database" - but I would have to assume that it crossed multiple customers since they offer a $10,000 service to restore individual customers here. On a side note wow! $10K to do a database restore.
SF customers should look carefully at the root cause analysis for this incident when it gets published and really do a careful assessment of future risk. In fact, I would put this on your calendar to check in a week to see if it is published yet! This type of failure indicates not only did the primary integrity get lost but the secondary logging also failed to provide any ability to do point in time recovery.