Ten years of verification, Tencent database RTO<30s, RPO=0 high-availability solution for the first time to fully reveal the secret

10.years of verification, Tencent database RTO<30s, RPO=0 high-availability solution for the first time to fully reveal the secret

In order to help developers better understand and learn distributed database technology, in March 2020, Tencent Cloud Database, Yunjia Community and Tencent TEG Database Working Group launched a three-month online technical salon "Do you want to The secrets of domestic database that you know are all here! "Invite dozens of senior database experts from the goose factory every Tuesday and Thursday evening to interpret the core architecture, technical implementation principles and best practices of the three goose factory self-developed databases, TDSQL, CynosDB/CDB, and TBase. March is the special month of TDSQL. This article will bring a live review of the second "Cracking the High Availability Problem of Distributed Databases: TDSQL High Availability Solution Implementation".

Hello everyone, the topic I am sharing today is TDSQL's multi-location, multi-center high-availability solution. TDSQL is a financial-grade distributed database launched by Tencent. In terms of availability and data consistency, based on the self-developed strong synchronous replication protocol, it has high performance while ensuring cross-IDC double copies of data. On the basis of strong synchronous replication, TDSQL has implemented a set of automatic disaster tolerance switchover scheme to ensure zero data loss before and after switchover, and provide 7×24 hours of continuous high availability services for the business.

In fact, not only the database, any system with high requirements for availability needs a highly available deployment architecture. This will involve some professional terms such as: different places, multiple livelihoods, dual livelihoods, elections, etc., which will be mentioned in today's sharing. In addition, it also includes the familiar two-place three centers, two-place four centers, and same-city dual livelihoods. And other concepts. It is worth mentioning that today's sharing is not only for the database, but also for the deployment architecture of any high-availability system.

In this sharing, we will introduce several typical deployment architectures of TDSQL, as well as the advantages and disadvantages of various architectures. Because in actual production practice, there will be various resource constraints. For example, the disaster recovery effect of having one computer room and two computer rooms in the same city is completely different. For another example, although there are two computer rooms, the specifications of the two computer rooms are quite different. Some computer rooms may have better network links, and some computer rooms may have poor network links, etc.... So, how to use limited resources Building a set of TDSQL with high cost performance or high performance under conditions is the main content we will discuss in this sharing.

Before entering the topic formally, let's review the last time we shared the core features and overall architecture of TDSQL.

Well, let's take a look at the core features of TDSQL first.

Our focus today is on "financial-grade high availability". How does TDSQL achieve more than 99.999% availability? The so-called high availability of five nines means that the unavailable time throughout the year cannot exceed 5 minutes. We know that failure is an unavoidable phenomenon. At the same time, failures are also hierarchical. From software failures, operating system failures, to machine restarts and rack power failures, this is a process from low to high disaster levels. The database needs to consider and respond to higher-level failure scenarios, such as the power failure of the entire computer room or even natural disasters such as earthquakes and explosions in the city where the computer room is located. In the event of such a failure, whether our system can first ensure that the data is not lost, and secondly, how long it takes to restore the service under the premise of ensuring that the data is not lost, these are all issues that need to be considered for financial-grade high-availability databases.

TDSQL database consistency: strong synchronization mechanism is the core guarantee

First of all, let’s review the key feature of TDSQL—strong synchronization mechanism. It is the key to TDSQL to ensure that data will not be lost and error-free. Compared with the semi-synchronous replication of MySQL, the performance of TDSQL strong synchronous replication is closer to asynchronous.

So how does this high-performance strong synchronization work? Strong synchronous replication requires that any request that responds to a successful service must be successfully placed on at least one backup machine in addition to the successful placement on the host. So we see that after a request arrives at the host, it is sent to the standby machine immediately, and the host can answer the business successfully only after one of the two standby machines responds successfully. That is to say, any successful response to the front-end business request must have two copies, one on the primary node and the other on the standby node. Therefore, strong synchronization is a key guarantee for multiple copies of data. By introducing a thread pool model and a flexible scheduling mechanism, TDSQL asynchronousizes the work threads, which greatly improves the performance of strong synchronization. After the transformation, its throughput is close to asynchronous.

After reading strong synchronization, let's review the core architecture of TDSQL. Load balancing is the entrance to the business. The business request reaches the SQL engine module through load balancing, and the SQL engine forwards the SQL to the back-end data node. The upper part of the figure is the management and scheduling module of the cluster. As the master controller of the cluster, it is responsible for resource management and fault scheduling. Therefore, TDSQL is divided into: computing layer, data storage layer, and cluster management nodes. As we have emphasized before, for cluster management nodes, one cluster is sufficient to deploy one set, usually an odd number of deployments such as 3, 5, or 7. Why is it odd? Because when a disaster occurs, an odd number can form an election. In other words, if three nodes are deployed in three computer rooms, if one computer room fails, the other two computer rooms can communicate with each other and cannot connect to the third. In the computer room, you can reach a consensus on the third computer room failure and kick it out. We can understand the management modules of the three computer rooms as the three brains of the cluster. After one of these brains is broken, if the remaining brains can reach half of the initial number, they can continue to provide services, and vice versa.

Deployment practice of high-availability clusters

The above is a review of some core features of TDSQL. Next, let's take a look at the model selection of each module. For a distributed database with separation of computing and storage, how should we choose a machine? We know that if you want an IT system to maximize its value, it is necessary to squeeze out the resources of the machine as much as possible, so that these resources are worthwhile and more efficient. If the CPU of this machine runs very full, but there is no load on the IO, or if the memory is equipped with 128 G, it actually only uses 2 G. This is a deployment with a very low efficiency ratio, which cannot give full play to the overall performance of the machine and the system.

The first is the LVS module. First of all, it is not an internal component of TDSQL as the access layer. The SQL engine of TDSQL is compatible with different load balancing, such as software load LVS, hardware load L5, etc. Load balancing as the access layer is generally a CPU-based service, because it needs to maintain and manage a large number of link requests, which consumes more CPU and memory. Therefore, the recommended configuration here has a relatively high CPU, 16vCPU, 32G memory, and must be a 10 Gigabit Ethernet port. Today, the cost of network card equipment is very low, so 10 Gigabit network cards are generally installed. The vCPU here emphasizes that 16 logical cores are sufficient (maybe there are only 8 physical cores), because most of our programs are multi-threaded.

Let's look at the compute node again. If the cluster is small and the resources are relatively tight, computing nodes can reuse machines with storage nodes, because the models of storage nodes can basically meet the needs of computing nodes, a 16vCPU, 32G+ memory, and 10 Gigabit Ethernet port.

After talking about the access layer and the computing layer, let's look at the data storage layer again. The storage node is responsible for data access and is an IO-intensive service. It is recommended to use a PCI-E SSD and a separate physical machine is required. For the database, we recommend deploying it on a real physical machine, which is more stable than a virtual machine. In addition, it is recommended to build another layer of Raid0, if possible, to make the read and write capabilities of the data nodes stronger. Some students will definitely ask why the data node does not make a Raid5 or Raid10 but directly makes Raid0. Because TDSQL itself has a one-master, multiple-slave architecture, and even more slaves can be added, there is no need for us to continue redundancy in the disk array. Infinite redundancy will only reduce the performance ratio. As a data node, the recommended configuration here is 32vCPU and 64G memory. The Innodb engine used by the data node is an engine that prioritizes the use of cache, which means that large memory has a significant effect on performance improvement. Therefore, we recommend the models with large memory, 10 Gigabit Ethernet, and PCI-E SSD.

Next is the management node: the recommended configuration is 8-core CPU, 16G memory, and 10 Gigabit Ethernet port. The management node task is relatively light, and a cluster only needs a small number of management nodes. If there is no physical machine and a virtual machine can be used, the configuration is obviously lower than the previous computing and storage nodes.

For the backup node, the larger the hard disk is, the better. It is mainly responsible for storing cold data, and ordinary SATA disks can be used.

Don't pay too much attention to the model of the above machine. It is an internal number of Tencent and has no practical meaning. Here is a brief summary, computing nodes rely on CPU and memory, and there are not too many requirements for disks. Although the storage node also has higher requirements for CPU memory, it emphasizes the power of IO (requires PCI-E SSD).

Through the introduction just now, I hope that it can help you to further deepen your understanding of TDSQL, especially this kind of database with separation of computing and storage. From the interpretation of the model configuration, we can clearly understand what type of model is matched to make the system perform the best.

To sum up, if the models are properly matched and the business is used in accordance with the specifications, then the strong performance of the database can be easily played out, that is, higher business support capabilities can be obtained with lower operating costs. Massive businesses are accompanied by cash flow, such as advertising, games, and e-commerce. If a low-cost system can handle such business requests easily and efficiently under full load, the economic effect will be considerable.

Cross-city and cross-computer room-level disaster recovery deployment plan

In the third part, we begin to cut into our main topic. In this chapter, we will introduce several typical deployment schemes, those familiar terms: three centers in the same city, three centers in two places, multiple live in different places, and what meaning does dual live in the same city mean? In other words, what kind of effect can it bring? Next, I will reveal the answers to these questions one by one.

l "3.Centers in the Same City" Architecture

The first part is the three-center architecture in the same city. As the name suggests, the three centers in the same city: In a city, there are three computer rooms A, B, and C. TDSQL still adopts the "one master and two standby" structure. Obviously we need to deploy the three data nodes in three computer rooms, where the main node is in one In the computer room, the two standby nodes are deployed in the other two computer rooms.

Each IDC provides two highly available LV5 as a load balancing system. Why does every IDC have to put an LV5? Because each IDC has its own business, it needs to have an independent load access. From the perspective of the access layer, the three computer rooms are a relatively parallel peer-to-peer architecture. The three computer rooms have their own services. Perhaps the first computer room supports the business of a large area of ​​the country, and the second computer room is another large area. District, reciprocity is such a meaning. This structure is relatively simple, and the overall structure is relatively clear.

The management node is not shown in this figure. We just said that the management node can be regarded as the brain of the entire cluster and is responsible for judging the current global situation. The three computer rooms obviously need to deploy three brains. The reason for the "three" has just been mentioned. When one of the brains has a problem, the other two can form a majority and complete mutual voting to confirm that the faulty brain will be kicked out.

"Single-City Single Center" Architecture

There are several scenarios for the "single center in the same city" architecture:

The first scenario is that IDC resources are relatively tight and there is only one data center. In this scenario, cross-machine room deployment is not possible, and can only be deployed in a cross-rack mode. When the master node fails or the rack where the master node is located fails, it can automatically switch.

The second scenario is that the business pursues the ultimate performance, and even cross-IDC network delay cannot be tolerated. Although the current computer rooms are all optical fiber networks, the network delay between the two computer rooms separated by 50km is less than 1ms. But some special services can't even tolerate a 1 millisecond delay. In this case, we can only deploy the active and standby in the same computer room.

The third category is used as a remote disaster recovery room. As a disaster recovery storage, there is generally no actual business access. It is more likely to be used for backup and archiving, so the investment in its resources is relatively limited.

The fourth is as a test environment, so I won't say more about this.

"Two places and three centers" architecture

Next, I will talk to you about the protagonist of this sharing-the "two places and three centers" architecture, which is a common deployment method for banks and a basic deployment structure required by regulatory requirements. This architecture provides better availability and data consistency at a lower cost through two data centers in the same city and one data center in another place. Automatic switching can be achieved in node abnormalities and IDC abnormalities, which is very suitable for financial scenarios and is the deployment method recommended by TDSQL.

In terms of deployment methods, we look at it from top to bottom. There are two computer rooms in the same city and one disaster recovery computer room in a remote location. The top layer is the brain management module of the cluster, which is deployed across three computer rooms.

The deployment of the management module can be either "2+2+1" or "1+1+1". We know that if we follow the "2+2+1" deployment method, when the first computer room fails, there will be "2+1" brains left. "2+1" is more than half of 5, and the rest "2+1" formed a majority and kicked out the failed node while continuing to provide services.

Continuing to look down, we see that the data node adopts the "one master and three standby" model, and it is strongly synchronized across the computer room and asynchronous with the computer room. Why is it asynchronous in the same computer room and cannot be strengthened synchronously? If it is in strong synchronization with the computer room, since it is closer to the master node than the other two standby nodes across the computer room (the average distance between IDC1 and IDC2 is at least 50 kilometers), each request sent by the business to the master node is It is this strong synchronization node in the same computer room that responds first, and the latest data will always only fall on the standby node in the same computer room. And we hope that the two copies of the data should be located in different computer rooms that are more than 50km apart, so as to ensure that the data can be consistent when switching between master and backup rooms across computer rooms.

Some people may ask, there is no difference between the asynchronous node configured by IDC1 and the non-release. Here is an explanation of why it is better to have this asynchronous node. Let's consider a situation. When the IDC2 of the standby computer room fails and all two nodes in the standby computer room are down, the master node in IDC 1 becomes a single point. At this time, if strong synchronization is turned on, because there is no response from the standby machine, the master node still cannot provide services; but if strong synchronization is turned off to continue to provide services, there is a single point of risk for data. If the master node has a software and hardware failure at this time, the data will be re Can't get it back. A better solution is to add a cross-rack asynchronous node to IDC1. After IDC2 hangs up, this asynchronous node will be upgraded to strong synchronization. In this way, under the premise that there is only one computer room left, we can still guarantee a cross-rack copy, reducing the single-point risk of the host.

After reading the two computer rooms in the main city, let's take a look at the remote disaster recovery computer room. As a remote disaster recovery computer room, it is generally a computer room that is more than 500 kilometers away from the main node and has a delay of more than 10 ms. Under such network conditions, asynchronous replication can only be used to synchronize data between the disaster recovery node and the main city. Therefore, the remote disaster recovery node assumes more of the responsibility of backup, and there will not be too many formal business visits daily. Although it looks like a vase on the surface, it is absolutely impossible without it. If a city-level failure occurs one day, the disaster recovery instance can still recover more than 99% of our data. It is precisely because of this asynchronous weak relationship between the disaster recovery node and the master node that our disaster recovery instance is allowed to be an independent deployment unit in the backup city.

In addition to being used as asynchronous data backup for the remote disaster recovery computer room, another important responsibility is: when a computer room in the main city fails, it forms a majority with another normal computer room in the main city, kicks out the failed computer room and completes the main-standby switchover. . This brain, deployed in a remote place, does not participate in the main city most of the time, and only intervenes when a computer room in the main city fails. Under normal circumstances, the modules of the main city access the brains of the main city, and the modules of the backup city access the brains of the backup city, and cross-access will not cause the problem of excessive delay.

"Two Centers" Architecture

After talking about the "three centers", let's talk about the "two centers" architecture. Specifically, there are only two computer rooms in the same city. According to our previous PPT experience, the deployment of TDSQL in the two computer rooms needs to be deployed in the same computer room asynchronously and across computer rooms with strong synchronization. Therefore, a four-node model is adopted, distributed in 2 IDCs.

However, there is a trade-off in the "two-center" architecture. Only when it is deployed in the backup computer room and the failure is not the backup center, can automatic cross-IDC disaster recovery be realized. But if the backup center fails, in fact, in the asynchronous mode of the same computer room and strong synchronization across the computer room, whether it is deployed in the main computer room or the standby computer room, if a failure occurs, the majority election and automatic failover cannot be successfully completed, or strong Synchronous nodes cannot be promoted to form a majority, or the majority has a random room failure and malfunctions, requiring manual intervention. Therefore, in scenarios with high availability requirements, a 7*24-hour high-availability deployment architecture such as "two locations and three centers" is generally recommended.

Summary of standardized high-availability deployment solutions

Finally, we summarize today’s sharing:

● 1. for cross-city disaster recovery, it is generally recommended to build an independent cluster mode in a remote location, and achieve synchronization through asynchronous replication. The main city and backup can be deployed in different ways, such as the main city with one master and three backups, and the backup city with one master and one backup can be combined flexibly.

● The best solution for live network operation is to add a disaster recovery center with three centers in the same city, followed by a two-site three-center architecture that is standard in the financial industry. Both of these architectures can easily implement abnormal automatic switching of data centers.

● If there are only two data centers, it is impossible to automatically switch any data center abnormally, and some trade-offs are needed.

Not only for the database system, any high-availability system needs to be based on the deployment architecture considerations, this is all this sharing, thank you all.


Q: How long does it take to switch between active and standby in the same city?

A: Within 30 seconds.

Q: Is the main city of two places and three centers set up as a cascade?

A: This question is very good. From the perspective of the main city, this is obviously a cascading relationship. The data is first synchronized from the master of the main city to the slave of the main city, and then synchronized to the master of the secondary city through the slave of the main city. Data is passed down layer by layer.

Q: Will strong synchronization wait for SQL playback?

A: Don't wait, as long as the IO thread pulls the data. Because the binlog based on the row format is idempotent, we have proved it to be reliable through a large number of cases. In addition, the increase of apply will increase the average time consumption and decrease the throughput. Finally, if there is a problem with apply, the TDSQL monitoring platform will immediately identify and alert the DBA to confirm the processing.

Q: The standby machine only saves binlog and does not play back. Can the performance keep up with the master?

A: The standby machine pulls binlog and playback binlog are two different sets of threads, called IO threads and SQL threads, and the two sets of threads do not interfere with each other. The IO thread is only responsible for downloading the binlog to the master, and the SQL thread only replays and pulls the local binlog. The previous question said that the strong synchronization mechanism does not wait for playback, not that the binlog of the standby machine will not be played back.

Q: The write storage nodes of the three centers in the same city are all in IDC1, so is the service delay in IDC2 very large?

A: The computer rooms in the same city now use optical fiber transmission, and the time consumption is basically less than 1 millisecond. There is no need to worry about this access time consumption. Of course, if the equipment room facilities are relatively outdated, or the network links between the distances are extremely unstable, some disaster tolerance capabilities may need to be sacrificed in order to pursue excellent performance.

Q: One master and two backups. Is there a VIP method for failover of SQL engine?

A: Of course, multiple SQL engines are bound to load balancing equipment, and the business accesses TDSQL through VIP. When the SQL engine fails, load balancing will automatically kick it out.

Q: Isn't this a library for each of the three businesses?

A: No, all three businesses are written to the main library. The SQL engine will be routed to the main database, one master and two backups. TDSQL emphasizes that only one master provides services at any time, and the standby machine only provides read services but not write services.

Q: What are the network requirements for multiple SETs in the same city for multiple copies in the same city?

A: Delay within 5 milliseconds.

Q: If two strong synchronization masters and slaves can set one of them to return?

A: The default mechanism of TDSQL strong synchronization is to wait for a strong synchronization standby machine to respond.

Q: If the middle node is down, will the remote node automatically connect to the main node?

A: Of course it will.

Q: What are the advantages of strong synchronization compared with semi-synchronous replication?

A: Comparing strong synchronization with semi-synchronous replication, the most intuitive understanding may be asked, doesn't strong synchronization mean changing the semi-synchronous timeout period to infinite? In fact, this is not the case. The key to TDSQL strong synchronization here is not to solve the problem of the standby machine response, but to solve this increase in the mechanism of waiting for the standby machine, how to ensure high performance and high reliability. In other words, if the performance is not modified on the basis of native semi-synchronization, and only the timeout period is changed to infinite, the performance and asynchrony ratio will actually not reach even half of the asynchrony. This is also unacceptable in our opinion. It is equivalent to sacrificing a large part of performance for data consistency.

The performance of TDSQL strong synchronous replication is based on a lot of optimization and improvement on the basis of native semi-synchronization, making the performance basically close to asynchronous, and can achieve zero data loss-multiple copies, which is a feature of DSQL strong synchronization. In the last live broadcast, we introduced how TDSQL achieves high performance and strong synchronization. For example, after a series of thread asynchronization, the thread pool model is introduced, and a thread scheduling optimization is added.

Q: Which one is used in the arbitration agreement?

A: Majority election.

Okay, if you have other questions in the future, welcome to communicate in our technical exchange group. Today's live broadcast is over here, goodbye. thank you all!

Reference: https://cloud.tencent.com/developer/article/1604247 10.years of verification, Tencent database RTO<30s, RPO=0 high-availability solution for the first time panorama revealed-Cloud + Community-Tencent Cloud