High availability clustering is complicated for no good reason

Sat, Mar 15, 2014 at 4:31PM

My research of replication and syncing different systems has constantly resulted in find other developers' articles that rant on endlessly about how many "gotchas" exist and how many workarounds there are.  There is so much manual labor involved with maintaining multiple systems that are supposed to be identical and this has went on for years and it doesn't seem to be getting much better unless you have already become an expert at maintaining such a solution and had enough resources to do it right.

I know replication can break and it can be difficult to fix on a live production system without causing downtime.   Being an expert at fixing replication problems is hard because we don't have to do it that often, and we are more likely to run into something unknown and dangerous when we aren't very knowledgable about something yet.   I encountered many problems when I was using mysql master -> slave replication to do backups in the past, and I gave up and switched back to mysqldump.  If you have a broken replica, you are going to have considerable downtime in order to fix it, and if you have multiple "masters" accepting write operations, fixing it becomes extremely hard to do since data changes on both sides.    I really want to have multiple masters so there is no single point of failure, but it seems like the whole world is screaming "No, that's a bad idea" when I read about doing it with MySQL.   Maybe I should just ignore that they have this feature.

If you rely on software you didn't make for managing all this, then you have to deal with all the cost of fixing the issues with them and that could be serious if you are relying on multiple master design you don't understand how to fix.

I've been planning a replication system from scratch that will ignore that these other systems like database replication, binary logs or distributed filesystems exist, and instead build my own API that will distribute changes to 1 or more servers in a cluster and each server will act like a unique instance other then my custom syncing scripts performing operations in the background to keep them identical in both directions.    By taking control and saying NO to all third party syncing mechanisms, the public web site will be extremely fast for read operations since it won't have any added latency due to checking the status of data on other servers.   Also, recovering from failure will be extremely obvious because there will be no hidden gotchas to deal with because my own solution will have a solution for every type of failure situation plus error messages and logging that I actually understand.   Even worst case failures like rebuilding a table or shutting down a server & replication until manually corrected will become easy to control and understand with time.   Any future upgrades of the underlying software will be less likely to change the behavior of my syncing systems, which will make it easier to upgrade everything.

Big companies like Facebook aren't using Mysql's built in replication features exclusively, they have built a lot of systems around it using their favorite language.  I don't want to rely on their crap because they are dealing with different problems then me.  I'm not going to try to scale to hundreds of machines. I just want to have 2+ and keep it small for quite a while.

Another thing that is common in high availability clustering is that wonderful "free open source" software sometimes turns into mandatory "enterprise commercial license" on a variety of platforms for some of the best features including MySQL and Nginx which both hide away some of their features to be available exclusively to enterprise customers.  Those hidden features may fix some of the replication and scaling headaches you have with the community supported version of the software.   My business plan doesn't include spending thousands per year with third company support subscriptions.  I want to be self-sufficient on all software or I'll stop using it.  I think there is an expectation in the software world that big business should pay for all the free stuff the little guys need.   It should be easier to function like a big company with resilient services, without having to pay the bill of a large company. 

My replication process will probably be slower then some binary log/journal system written in C or whatever these other systems are doing, but it will be much faster in maintaining it and upgrading it in a way I can understand.    Maybe the best replication solution for X software is your own software.  That's my plan.  Any thoughts?

Bookmark & Share