In a year, how long can you afford to have your service down? Today the "five nines" is the target for major web players. To reach that target, a flexible and resilient infrastructure is mandatory.
|Time per year|
99.999% (five nines)
We consider that a website must have a response time of 5 seconds maximum, and aim for an average response time of 2 to 3 seconds. Load balancing provides an easy way to achieve this, without compromising performance regardless of website sessions.
However, multiplying layers and services can slow down the overall user experience. For that reason, you should consider designing your solution with asynchronous communication mechanisms that rely on communication bus with AMQP (see RabbitMQ or Apache Kafka).
In order to guarantee your customers maximum availability whatever happens in a physical datacenter, you can benefit from our Regions and Availability Zones (AZs). For more information, see About Regions, Endpoints, and Availability Zones and Regions, Endpoints and Availability Zones Reference.
What to Leverage
Going to the Cloud enables you to use mechanisms such as Regions and Availability Zones and Load Balancers to design a High Availability and a Load Distribution infrastructure without modifying your application stack.
After designing your infrastructure, you can improve or design a new application according to rules that enable you to reach the state of the art of a clustered application. You can also design your software to be fault tolerant. For more information, see Netflix Chaos Monkey.
The main thing to do is to install each service or application on a single Virtual Machine (VM) and create an OUTSCALE machine image (OMI) from it. This enables you to easily replicate a VM and deploy it several times. For more information, see Creating an OMI.
A single service needs to be provided by many VMs (see SPOF). A VM with a service is called a "node", and the collection of "nodes" providing a service is called a "cluster". For more information, see Computer Cluster Wikipedia.
Each cluster contains a load balancer that receives incoming traffic. Your Virtual Machines should never receive direct incoming traffic.
- Run each of your critical services or jobs on a single VM, such as 3-tier pattern.
- Make your infrastructure grow in scale out mode, not up scale mode, that is, you need to add nodes when overloading instead of resizing a single node.
- Use several Availability Zones (AZs) to guarantee your service.
Subnets and Security Group Isolation
Because Cloud Computing provides a philosophy of security by design, we use Virtual Private Clouds (VPCs) and subnets to logically isolate each business layer and your overall infrastructure to other infrastructures. For more information, see Creating and Managing Subnets in Your VPC.
Three VMs are launched from existing OUTSCALE machine images (OMIs). Each VM has its own dedicated security groups, and is placed in a subnet dedicated to a single business logic.
The mechanism is consistent: for 1 business logic, you get 1 OMI, 1 Subnet and 1 security group. Elements discussed here appear in red in the graph below:
Instead of having 1 Virtual Machine per service which is growing (more CPU, more RAM) that is called up scaling, we prefer to distribute the load across several machines. To do that, the load of each business laye is distributed through a load balancer. For more information, see Load Balancing Unit (LBU).
Databases can not be managed with Load Balancers.
Feeding the Multitude
Based on existing OMIs, you can run multiple business nodes. For more information about adding nodes, see Working with Back-end Instances.
We replicate the previous infrastructure and cross flows between load balancers and nodes. These elements appear in red in the graph below.
To prevent datacenter failure, each horizontal layer (web server, intel, or database) is located in separate subnet and in a different Availability Zone.
3DS OUTSCALE provides load balancers replication and data replication for snapshots and OMIs. The delta between two snapshots is your RPO (Recovery Point Objective).
Go Further: Self-Healing and Reliability
We highly recommend using supervision tools, which enable you to terminate or run new nodes when one of them encounters a problem.