In a year, how long can you afford to have your service down? Today the "five nines" is the target for major web players. To reach that target, a flexible and resilient infrastructure is mandatory.
|Time per year|
99.999% (five nines)
We consider that a website must have a response time of 5 seconds maximum, and aim for an average response time of 2 to 3 seconds. Load balancing provides an easy way to achieve this, without compromising performance regardless of website sessions.
However, multiplying layers and services can slow down the overall user experience. For that reason, you should consider designing your solution with asynchronous communication mechanisms that rely on communication bus with AMQP (see RabbitMQ or Apache Kafka).
In order to guarantee your customers maximum availability whatever happens in a physical datacenter, you can benefit from our Regions and Availability Zones (AZs). For more information, see About Regions, Endpoints and Availability Zones and Regions, Endpoints and Availability Zones Reference.
What to Leverage
After designing your infrastructure, you can improve or design a new application according to rules that enable you to reach the state of the art of a clustered application. You can also design your software to be fault tolerant. For more information, see Netflix Chaos Monkey.
The main thing to do is to install each service or application on a single Virtual Machine (VM) and create an OUTSCALE machine image (OMI) from it. This enables you to easily replicate a VM and deploy it several times. For more information, see Creating an OMI.
A single service needs to be provided by many VMs (see SPOF). A VM with a service is called a "node", and the collection of "nodes" providing a service is called a "cluster". For more information, see Computer Cluster Wikipedia.
Each cluster contains a load balancer to which traffic is routed, instead of directly to a VM.
- Databases and load balancers do not work the same way. We recommend that you deploy your database services in cluster mode, such as Percona engine for Mysql.
- Each critical service or each job should run on a single VM, such as a 3-tier pattern.
- Apart from databases, we recommend that you make your services stateless to improve your cluster's ability to have nodes that fail without service interruption for your users.
- Consider designing your software solution with asynchronous communication mechanism.
- Your infrastructure must grow in out scale mode, not up scale, that is, you need to add nodes when overloading instead of resizing a single node.
Never hot snapshot the volume of a database. Because of continuous I/O, this can lead to fatal integrity issues with your data.
The best way to save a database is to log all transactions, which enables you to "re-play" transactions and maintain your database up-to-date.
Databases are not designed to be placed behind a load balancer. However, most of them let you configure them as a clusters. For more information about setting up your database, see your database documentation, Eventual consistency, and CAP theorem
Like databases, a communication bus (rabbitMQ, Kafka) enables you to configure them as clusters, instead of placing them behind load balancers.
Self-Healing and Reliability
We highly recommend using supervision tools to be able to terminate or run new nodes when one of them encounters a problem.
Software Stack Organization
Split your services across a 3-tier website with a Frontal (nginx), an Intelligence with your code (python/php/ruby/...) and a Database. Each service listed below must be on a single Virtual Machine:
For each Virtual Machine where a service is installed, create an OMI that you can later use to multiple nodes.