Considerations when implementing high availability for a web application

Wed, Apr 10, 2013 at 3:05PM

If you host an web application, do you want to make it fault tolerant?

Will you use high availability to scale out to multiple servers?

To what extent would you take this?

Below, I discuss how you might answer these questions for yourself. I help broaden your concerns so that you know what you need to look for elsewhere.

High availability costs more with commercial software

Whenever you first transition to adding high availability / clustering for your web application, you'll find it is not easy or cheap. If you rely on third party software or commercial software, you may have to pay extra for a special version of the software to support high availability. Many software companies realize that often it is only larger companies that implement high availability so they raise the price a significant amount since they feel a large company could absorb this cost and still see good value. A good example is ColdFusion. At the time of this writing, Adobe ColdFusion 10 standard costs $1499 without a discount. The standard edition doesn't include support for high availablity clustering. To do clustering, you need Adobe ColdFusion 10 Enterprise Edition which costs $8499 without a discount.

How does a small software company achieve affordable high availability?

If you are currently using commercial solutions for your small business, you really need to evaluate what your costs will be with those solutions as your business grows. Microsoft has programs like Bizspark for ASP.NET and their tools which seem to cost nothing at first, but after years of building on Microsoft technology, you'll have spent a lot of extra money to maintain those solutions. Many solutions like what Microsoft offers to new web developers are quite cheap at first, but that price quickly ramps up as your needs do. In the end, you might realize you need to stop using commercial software most likely.

I used to use ColdFusion, but in preparing for the future, I saw that it was never going to be easy for me to scale ColdFusion out to multiple servers. That license cost is so extreme for me, that I needed to look at either another language or an open source CFML engine. So back in 2011, I migrated to Railo because it is free, open source and supports clustering without any commercial licensing. I also switched from Windows to Linux and made sure all of the other Linux software I use is free open source and supports high availability.

So far, I haven't tried to implement high availability, but it is something that I CAN do now at least as a small business owner. Sure, it may be easy for you to recommend a high availability solution as an employee at some company, but if you are paying the bills out of pocket like I do, would you really go so far? It is a choice for all of us to make.

High availability using free software is still not free.

You may have chosen free solutions for high availability, but it's important to consider the labor costs of implementing and maintaining your solution. Everything is fairly simple when you are updating a single server, but once you need changes to go live on multiple servers, you have to consider quite a few more details. Also you need to have documentation and test how to resolve problems when things go wrong. Human error is a constant we have to face every day. Ideally, you'd eventually automate some of the deployment process so that human error doesn't continue to be an issue. If your database stops replicating, or files get "stuck" on only 1 server, you need to have a plan for how to resolve these things quickly. You also want to routinely review the correctness of the current implementation to ensure there aren't any edge cases that are going unchecked. Often times this means you'll need to add new features to your software to support the synchronization and maintenance of multiple servers. You can't scale a significantly complex application by relying on third party solutions alone.

Database clustering considerations

What kind of replication will you use? There are many kinds of replication supported by database software. Some of them have wildly different performance and SQL compatibility implications.

Will replication recover quickly after a crash? Your need recovery to be as fast and automated as possible. Your team doesn't have the time to constantly be fixing database issues that should just work.

How long does it take for the slave servers to "catch up"? If you take down a server for maintenance, will it be able to catch up when it comes back online? If your app is so write heavy that replication falls behind, you may need to redesign the application to perform write operations less frequently. This could mean changing how you log information or doing it in bulk INSERT operations instead.

Will all queries replicate correctly? If not, you need to train your team to avoid them and perform code reviews to ensure replication compatibility.

How do you perform an minor/major upgrade to the database software when using replication? Sometimes you may have significant downtime if you don't follow the correct upgrade procedure. You usually must follow a precise order of operations, consult the manual for the details.

How will you migrate to new hardware later? If you need to switch to a new server for better performance, you need to understand how to move your replication solution to the new servers. It may not be safe to simply move the binary files. You might need to dump and re-import. Consult the manual and other resources for best practices.

You'll need to research and understand the feature of your database software in great detail before deploying a solution into production. This knowledge takes time, and if you don't test the worst case scenario, you are taking a big risk. You never want to have to learn how to solve a problem when your servers are down. Document your recovery procedures and keep them available to anyone supporting the servers.

How will you handle session variables?

In many cases, you'll want sessions to continue to make requests to the same server to keep access to that data faster by using memory instead of session replication. This is usually referred to as "sticky sessions". There may be cases where you want to replicate session and not rely on "sticky sessions" to handle your traffic. You may need to evaluate multiple approaches to determine the best solution. Also, if you want your sessions to survive the server restarting, you'll need some form of replication whether it uses the database of a distributed cache system.

One server for write requests or multiple?

When users upload a file or a new blog article, should only one server would write those changes to the file system and database and then replicate those change to the other servers?

Would you rather have multiple servers accepting data instead?

You may have significantly simpler code if you accept all writes on one server. Many web applications are mostly read-only traffic, so this may be appropriate in the beginning. What happens if the server that handles the writing needs to go down? You might need to display a message showing that the server is undergoing maintenance unless you build the application in such a way where the write-enable server can be changed to another one in the cluster. When you consider how you handle duplicate file names and other uniqueness problems across multiple servers, it may seem quite difficult to allow multiple servers to perform writing at the same time. Sometimes, you'll be forced to have a single write-enabled server to be able to solve the problem because of how some third party features work. You also need to determine if your single server could handle the load of all read/write requests necessary for your administrative features to function with adequate performance. If not, you may need to have that server either be more powerful or you'd need to redesign the application to support writes across all servers in the cluster.

High availability as a fault tolerance solution

Justifying the time & money needed to go from 99.999% to 99.9999% up-time is very hard to see for most people. It's unfair to simply say "always be fault tolerant". It's completely valid to have a recovery plan that is based on reliable backups and a minimal amount of downtime. I've only seen single hard drive failures a few times in the last 9 years and very few major downtime incidents were actually hardware related. Most of the time, the developers are the ones who cause downtime.

Will a fault tolerance solution save you from human error? I think this is a valid question to ask yourself. I often have make careful plans, but something still goes wrong. We often don't have the time to test every feature of a complex application and test only what we think has changed. Often developers can't know what has changed, and this is where bugs occur. It usually takes a user using the application in a way that wasn't tested for the bug to occur. You won't be able to eliminate this kind of downtime unless you perform more thorough testing. It's not just about throwing money into hardware to solve downtime problems.

I can't say fault tolerance isn't a good thing, but everything is a calculated risk. Fault tolerance between 2 data centers would be even better then hosting multiple servers in a single database. However, the reality of current technology is that the speed of the Internet and latency are still significant problems. Syncing servers between datacenters could be more difficult and may cost quite a bit more when you factor in security solutions, load balancing and bandwidth. For the servers I've maintained, I can count the number of major downtime incidents that have occurred in past years. It's a very low number and usually because of self-inflicted mistakes that wouldn't be solved by fault tolerance always. Some errors are only apparent when real users/robots hit the sites. We need to be realistic when we set goals for the technology solutions we provide. We can't be perfect. Even if you employ better testing strategies, almost no one tests at the level required to be 100% sure there are no bugs. To me, this means it's not worth spending double/triple to implement high availability / fault tolerance right now. My customers are happy enough with the current level of reliability. I'm still going to make mistakes if I had High Availability.

How many web sites are actually using high availability?

The large majority of web sites are not using high availability solutions. We can't accurately measure this in reality, but we can assume that a large portion of the web is running simpler software that people have installed on shared hosting account such as cms, blog and forum apps. Typically these are implemented with the cheapest possible method. Even when you outsource the hosting of your application to another provider, you can't be sure their solution is completely fault tolerant. Sure, their marketing may claim to do things the best way, but have they tested it and do they maintain it properly? You always need some kind of disaster plan (backups) to be confident about the reliability of your data and you probably will experience some downtime no matter how robust / expensive the solution is. Notable downtime outages have occurred at every major company. If they can't have perfect reliability with their thousands of servers, you probably can't either. Anyone on the Internet is always vulnerable to denial of service attacks and Internet Service Provider outages. High availability is only going to improve your up-time a relative amount - not make it 100%.

Distributed/network filesystem latency considerations

Samba or NFS are quite broken for certain kinds of changes - they don't always update predictably and not all software is able to recognize the change. For example Railo and Nginx have trouble with seeing certain changes made to a file on a Samba/NFS mount. This may be a problem with the software we use more then Samba per say, but it is something that we have to deal with. To be confident a replicated file is available and functioning on all servers, you'd need to write the software to confirm its availability. Relying on the default behavior of Samba and NFS just doesn't work very well for certain disk changes especially if the file size hasn't changed or if there is a need to lock files. To keep things faster, caching techniques may be employed, which cause further problems since you don't have guarantees that the cache is correct. You may end up with 404 and 500 errors in your application when the filesystem has not yet reflected the latest changes.

Even if your final solution depends on Samba/NFS, it probably will need some configuration options changed. It's not plug and play. You might need to design a queue system for replicating data consistently to all servers so that you have confirmation that things are truly in place before making the new content live. This process will also need to be able to be re-tried when a server that wasn't available becomes available again to ensure the filesystem is consistent and not just the database. This would represent a significant redesign in many application that currently publish content immediately to the public. Having an asynchronous process manage availability of content could be abstracted and integrated with the app in a consistent way, but you simply don't have this without planning for it.

Shared memory / global state considerations

You also might be tracking the global state of the application in shared memory. If the application's shared memory controls whether things happen once or not, you may have serious problems when they try to run again on multiple servers. You also may have caches that need to be replicated across multiple servers and the shared memory will need to detect when it should be flushed intelligently - this will not happen without rewriting the application some. To share state between servers, there could incur significant overhead, delays or there could be complex changes require to fix the code. Most applications built with CFML (mine included) use shared memory extensively. To determine how they could be scaled to multiple servers requires some thorough analysis of how application, server and session scopes are utilized. Sure, you can say it is a best practice to avoid doing these things, but that just doesn't happen until you plan for deployment on multiple servers. You could say that using session scope is always bad, but then you are making the programming harder or performance slower using a external distributed solution. There is always some sacrifice for each technical decision. There isn't one correct answer.

Many applications don't support high availability

Almost apps you can install for free don't support high availability "out of the box". You'll have to handle all the concerns discussed yourself. Sure you may have a solution in the end, but it was significantly more difficult then just paying for another server and a load balancer. You want to do high availability with Wordpress, Mura CMS or something else? It's not going to be simple and you might find very few resources that assist you. You need to be somewhat of an expert with the application to be confident your configuration is correct. Most of us don't have time to figure out third party software to that level of detail. So this is why we'll continue to host sites that don't support high availability. One could design their application to support high availability and provide a significant amount of documentation, but when things go wrong, the user is going to need to know a lot more about how things work compared to someone who install something like Wordpress as a single instance. It's quite different to go from single server/single instance to multiple.

High availability as a performance solution

You think you can throw hardware at the performance problem by adding more servers? Well, maybe. If you've solved all of the technical concerns previously discussed, you should be able to do this.

Latency alone could defeat the performance benefit of adding another server. You can take a single server quite far with modern hardware. A single machine can scale further with multiple processors, multiple hard drives and more memory for quite a while before you need to switch to separate machines as a necessity for performance. Instead of using hard drives, you could switch to fast solid state drives. Instead of using cloud, use dedicated servers or a private cloud. Use a faster clockspeed cpu or multiple cpus in a single machine. You could separate the database from the web application on separate machines to improve performance without adding fault tolerance. We shouldn't forget that software or hardware RAID can be used to add redundant hard drives for performance, fault tolerance or both.

When you start using distributed filesystems, replication and network connections to connect to other machines, you are adding latency which can be up to 1000 times slower then operations in memory on a single server. You will be able to scale out to more traffic and requests, but each request may actually be a bit slower due to the overhead, unfortunately unless you do some clever optimizations to the software.

Also consider that you could run parts of the application on different servers instead of trying to run the entire application on multiple cloned servers. With this approach, you wouldn't incur any performance penalties to synchronize them and you'd be able to scale even further at low cost since each solution is dedicated and optimized to each task. With the way that web apps are moving to do a lot of ajax requests, it could be somewhat easy to move a section of the application to separate application. This also helps you keep the codebase smaller and easier to test. Scaling out with hardware and redundancy is about solving a completely different problem. You are really saying that you value up-time a great deal more then saving money or being efficient with the resources you pay for.

Do you need to save on hardware or on labor?

Sometimes we can abstract a solution so that the computing resources that it runs on are less important - virtualization is the biggest one, but you could also design an app that has a huge framework which you want people to be able to access at all times. The way we build an application may sometimes require scaling out to many machines because that somehow makes development easier, more secure or cheap. It is still wasteful in terms of the hardware resources consumed, but it may save you money on labor costs. Making code faster can sometimes be more expensive then paying for more hardware. However, once you reach the point where hardware costs more then labor, you're going to regret that decision. I think that is what you see happen at big companies like Facebook where historically they start out with a solution that can't scale very well (PHP) and then they have a slower transition from that technology to something that does scale. They can justify the cost of programming to reduce server costs.

How is my application optimized?

I try to provide a solution that wastes minimal hardware resources, but I still use a dynamic language (CFML), so there is waste involved. A lot of that waste is eliminated by using the shared memory scopes and relying on other Linux software to handle caching at different levels.  This also means I'd have to do some rewriting of the software to support high availability correctly. I don't think high availability should be used as a justification for writing slower applications and I want people to value my application for being faster.  All of my open source competitors have the same problems with their software, so I don't think my lack of high availability support is unique.  It seems that almost no open source CMS supports it in fact.  So this would give me a competitive edge if I look at implementing it as a built-in feature.  The only CMS software I found with this feature were charging for it.

Often a company has no choice but to spend money on more servers if their software is not efficient or the load is too high. We aren't always correct about how things will need to scale and we're often not encouraged to write code that scales because we might not even need it. It's very confusing how to make the best technical solution, since that is actually a moving target. As hardware and latency improves, we are able to implement things that were not possible before and concern ourselves less with hardware.  So far, hardware is getting faster at a rate that exceeds my current web site traffic.  However, as I move into cheaper or free applications, I may find I need more hardware.

High availability as a feature

Do you distribute your software? Perhaps, you should consider high availability as a feature to give people more reasons to use your software.

I would love to implement high availability support in my Jetendo CMS application in the future. While high availability is not important to me at the moment, users of my software may see it as a vital feature when it becomes available. I could also charge people a premium for a high availability hosting solution that they can't get elsewhere. Often small developers like me need to show a quality difference to earn someone's business. Many turnkey solutions and do it yourself approaches do a lot of things in an inferior way. High availability is another way to distinguish yourself in an increasingly competitive world.

The solution is different for every business.

My correct solution could be a failure for your company. I've tried to outline the things you should look at when considering high availability.

So there you have it, high availability is a complex and costly nut to crack. If you think you can handle all the nuances of implementing it and can see the value for your business, give it a go.

Do you use high availability and how much has it helped your business? Comment below.

Bookmark & Share