Interpretation of the architectural principles of Tencent's distributed database TDSQL financial-level capabilities

Interpretation of the architectural principles of Tencent's distributed database TDSQL financial-level capabilities

In the era of distributed technology, financial customers continue to lead the domestic database industry. Why do they?

In order to help developers better understand and learn distributed database technology, in March 2020, Tencent Cloud Database, Yunjia Community and Tencent TEG Database Working Group jointly launched a three-month online technical salon on domestic databases " Do you want to The secrets of domestic database that you know are all here! "Invite dozens of senior database experts from Goose Factory to interpret the core architecture, technical implementation principles and best practices of the three goose factory self-developed databases, TDSQL, CynosDB/CDB, and TBase every Tuesday and Thursday evening. March is the special month of TDSQL. This article will bring a live review of the first "Tencent self-developed distributed database TDSQL core architecture and features disassembly".

Hello everyone, the topic I share today is the interpretation of the core architecture of the distributed database TDSQL based on the self-research and evolution of massive billing scenarios.

Today’s sharing is divided into five parts:

Chapter 1, Product Introduction and Applicable Scenarios

Chapter 2, TDSQL architecture analysis and module introduction

Chapter 3. Data Consistency Guarantee

Chapter 4, Distributed TDSQL Practice

Chapter 5, Data Synchronization and Backup

1. What is TDSQL: How does Tencent build a financial-grade distributed database

Let's first understand the TDSQL product and its applicable scenarios. The first chapter includes four aspects: usage scenarios, development history, core features, and compatibility.

First of all, TDSQL is an autonomous, controllable and highly consistent distributed database product compatible with MySQL launched by Tencent. Here we emphasize that it is highly compatible with MySQL-TDSQL is fully compatible with the MySQL protocol, and is fully autonomous and controllable with strong data consistency. The second is that TDSQL has distributed characteristics, and has a flexible scalability and high-availability architecture. In the Internet industry, massive user traffic scenarios are very common. If the database does not have scalability and scalability, it is difficult to deal with the sudden increase in traffic such as large-scale e-commerce promotions and grab red envelopes during the Spring Festival. These are actually all scenarios. Test the database to deal with massive user traffic.

At present, TDSQL has served more than 500 financial government enterprises, and the industry covers various fields such as banking, insurance, securities, government affairs, and Internet finance.

Let's take a look at the past and present of TDSQL. TDSQL can be traced back to 2002. At that time, it was not actually called TDSQL. It was a database service of Tencent's billing platform department. At that time, open source MySQL was used. From 2002 to 2007, with the development of the company's business, Tencent faced increasing pressure on the number of users. At this time, we proposed a high-availability design solution that does not stop for 7×24 hours to ensure that the database can provide 7×24-hour uninterrupted and continuous high-availability services. At that time, Tencent’s value-added business became increasingly large-scale, and the business became more and more sensitive to data, and the requirements for data availability became higher and higher. It even had to guard against some things like operators’ optical fibers being cut. The unusual scene.

From 2007 to 2012, this may be the rapid development of the Internet era from the Internet to the mobile Internet for 5 years. Of course, the company's business is also advancing by leaps and bounds. We began to commercialize this highly available database. By 2012, the prototype of TDSQL had come out. As an internal product, it began to provide financial-level data consistency and reliability services within the company.

Since 2012, TDSQL has been relatively mature within Tencent and is already a well-known product, but it has not been commercialized. 2014 coincided with a good opportunity-the establishment of WeBank. WeBank paid attention to TDSQL when selecting the database. After repeated testing and verification, it was found that TDSQL at that time had fully met the requirements of WeBank for data availability and consistency. Taking this opportunity, TDSQL was successfully put into production in WeBank and became the only database of WeBank, covering the core business of the bank.

所以说2014年,TDSQL完成了商业化,也实现了私有化部署。2014年以后,TDSQL推广到了很多银行、金融机构,这过程中是借鉴了2014年TDSQL在微众银行成功实施的宝贵的经验。

因为在2014年微众银行的部署中,我们也踩了很多坑,也认识到在私有化部署环境的各种各样的挑战,并一一攻克了这些挑战。当2014年在私有化部署完成之后,再到2015年TDSQL上公有云,我们继续通过公有云服务打磨自己的产品。

所以从2012年作为一个内部产品到2014年的私有化部署,再到2015年公有云上的部署,TDSQL已经逐步从一个内部产品逐渐走向行业,成为一个正式对外的商用数据库。从2015年到2019年,TDSQL已经推广到许多银行和金融政企。但是很重要的一点是,虽然服务了很多银行、金融客户,但是在银行领域有一块比较难动的蛋糕叫银行的传统核心系统。传统核心系统数据库长期以来一直是被国外的商用数据库所垄断,比如说ORACLE、DB2啊,像TDSQL这类分布式数据库是很难介入的。

2018年,我们关注到张家港银行有更换核心系统的需求,就此建立联系并成功达成合作,最终,2019年,我们将腾讯这套分布式数据库系统成功应用到了张家港银行的传统核心系统。张家港行也是作为全国第一家传统核心系统上分布式数据库的银行,分布式数据库不再是只局限于银行的互联网核心、互联网银行等外围系统的尝试,而是真真正正切入到银行系统的心脏—传统核心,这也是国产数据库领域一个具有里程碑意义的事件。

Therefore, in the future, we will also continue to "go global" to penetrate into more complex and updated core business systems and polish our products.

This is the development history of TDSQL.

2. TDSQL core features: extremely challenging "four highs" service and safe operation and maintenance

Let's look at the core features of TDSQL.

First of all, as a database adapted to financial scenarios, strong data consistency is the foundation of life, because data cannot be lost or wrong. In the financial scenario, you have no way to estimate-if a piece of data is wrong, whether this piece of data is 1 cent or 100 million, so strong data consistency is our most fundamental feature. No loss is allowed, no mistakes are allowed, this is the minimum requirement for the database.

The second is financial-grade high availability. TDSQL ensures the availability of more than 99.999%, and supports the deployment of cross-IDC, multiple computer rooms, and multiple activities in the same city. We first entered the financial scene because the challenge of the financial scene is the biggest. China's financial industry is subject to the most stringent regulatory requirements, and it also has extremely high requirements for the availability, reliability, and consistency of data and services. We require 99.999% availability, which means that this database cannot be down for more than 5 minutes throughout the year.

The third is high performance and low cost. Companies in the Internet age have massive businesses and massive machines. A slight increase of 10% in performance may save hundreds or thousands of machines in cost. This economic benefit is still relatively large. Therefore, high performance and low cost are also a key feature of TDSQL.

The fourth point is linear horizontal expansion. Because whether it is the Internet or other enterprises, with the development of digitalization, for example, there is a sudden increase in traffic, an event, etc., now the carrying capacity of a single machine is more and more likely to highlight the bottleneck. Therefore, we propose this horizontal linear expansion capability and require it to be horizontally scalable. It is possible that the load, hard disk, and machine resources of one machine cannot be accommodated, but it can be split into multiple machines without too much consideration. It can automatically increase its own throughput and load.

The fifth point is enterprise-level security. Financial data is sensitive, and some sensitive financial data requires a higher level of corporate security protection based on the current data, such as database firewalls, transparent encryption, and so on.

The sixth point is convenient operation and maintenance. In privatized deployment, in many cases, their network environment and deployment environment are isolated from us. If bank customers have problems, then we can’t cut in to help solve them immediately. , So it needs a complete set of supporting facilities, simple and easy to use, can automatically help users locate and solve problems, and at the same time minimize the complexity of operation and maintenance.

3. TDSQL core architecture

Next we understand the architecture and module division of TDSQL. Through the understanding of this chapter, we can cut into the technical details of TDSQL, why it is designed this way, what are the benefits of this design, and how to achieve high availability, linear expansion and other capabilities through such an architecture and design.

1. TDSQL system overview

1.1 Resource Pool

This picture looks from the bottom up. 1. the bottom layer is the resource pool, which belongs to the IaaS layer service. It can be a physical machine or a virtual machine. As long as you add a machine to TDSQL, TDSQL is implemented on a machine's resource pool. Management of database instances. Of course, physical machines are recommended here. If you add a layer of virtual machine services, you will undoubtedly introduce some hidden dangers in terms of stability and performance.

1.2 Storage node

From the resource pool upwards are the storage nodes. The storage node should emphasize the two storage forms of TDSQL, one is the Noshard database, and the other is the distributed database (also called the Shard version of TDSQL). Simply put, Noshard is a stand-alone version of TDSQL. It has made a series of transformations and improvements on the basis of MySQL to support a series of TDSQL features, including high availability, strong data consistency, and 7×24 hours automatic failover. Wait. The second is a distributed database with horizontal scalability. So TDSQL actually presents two forms to the outside world, presenting a non-distributed form and the other a distributed form. As for the difference between these two forms, or which scenario is more suitable for which database, we have a special chapter to analyze later.

1.3 Compute Node

Look at the compute node again. The computing node is the computing engine of TDSQL, which separates the computing layer from the storage layer. The computing layer is mainly to do some SQL processing, such as lexical solution, syntax analysis, SQL rewriting, etc. If it is a distributed database, coordination related to distributed transactions must be done. Therefore, we see that the computing layer does not store data and only runs real-time calculations in SQL, so it is more CPU-intensive. In addition, the TDSQL computing node also has the ability of OLAP, which can optimize some complex calculations-when should it be pushed down to the storage engine layer, when should it be summarized in the computing layer, etc. This is what the computing node needs to do Things.

1.4 Chitu Operation Management Platform

Further up is the Chitu operation and management platform. If we compare the following set of things to a black box, we hope that there will be a user interface to manipulate this black box, and this interface is the Chitu operation and management platform. Through this platform, the DBA can manipulate the TDSQL background black box, so it is equivalent to a set of WEB management system. All DBA operations can be done on the user interface, without logging in to the background, and you don’t need to care about which computing node is, which storage node is, or how to manage it. You need to add some nodes or subtract some nodes, or change Where does this node need to be moved... These can all be done through the interface. The DBA interface is not easy to make mistakes, but if you log in to the background, it is easy to make a mistake. If you accidentally restart the machine, it may cause a certain impact.

1.5 "Bianque" Smart DBA Platform

Besides Chitu, why is there a "Bianque" smart DBA platform? Maybe under normal circumstances, our machine is good, but if the machine fails, or one day the disk has bad blocks, or the IO performance is getting worse and worse...SSD actually has a aging process, in the later period. , Throughput and IOPS may drop to a certain extent, causing the response speed of the database to slow down. If the DBA wants to investigate this situation, you must first see which instance it is, which machine is involved, what is the problem with this machine, and check the health of the machine... These are all mechanical tasks. With Bian Que The intelligent management platform can automatically analyze the cause of the failure when there is a failure. For example, you can find out what caused the SQL to slow down, or why the active/standby switch occurred, and the IO suddenly became abnormal or What other reasons cause the machine to malfunction.

In addition, the Bianque intelligent DBA platform also has an intelligent diagnosis system, which can be periodically initiated by the DBA to diagnose instances. For example, in some database instances, the CPU runs high all the year round, which is actually caused by some relatively poor SQL. At this time, Bianque’s intelligent DBA system can easily perform inspections on the user instance to get a health status graph and score it. It is found that this instance, for example, his CPU is overused and needs to be expanded, but there is no expansion. Scores will be deducted; then the indexes of other tables are not built, scores will be deducted... to generate a diagnostic report. Therefore, with Bianque and the Chitu operation management platform, the job of the DBA is actually very easy. It may only need a few buttons a day to solve a series of troubles, including high availability and performance analysis. , Lock analysis, etc., completely liberate the DBA from the complicated work.

In addition, we see that there are actually several small modules here. Scheduling system, the scheduling system is mainly responsible for the overall resource scheduling, such as the addition and deletion of database instances, expiration and invalidation, and some capacity scheduling, namely expansion and contraction, and some multi-tenant management. In other words, this is the scheduler of the entire management console.

There is also a backup system, this is the cold standby center, there is a special chapter to talk about it later, I won't repeat it here. In addition, we also provide some service modules as assistance, such as auditing, as well as migration services between databases-how can our TDSQL help remote databases move in or move out from TDSQL? In addition, it also includes security modules such as data verification, data subscription, SQL firewall, and injection detection, as well as an auxiliary module-to help our DBAs or users, to complete some personalized and rich needs.

The above is an overview of the TDSQL system.

2. TDSQL architecture module and its characteristics

Let's look at the core architecture again. The core architecture is actually a thumbnail of the previous figure, and we have selected the core modules.

1. the user's request is sent to the SQL engine through load balancing. Then, the SQL engine is used as the computing access layer to fetch data from the back-end storage node according to the requirements of this SQL. Of course, whether it is the SQL engine or the back-end database instance, there is a metadata to manage scheduling. For example, the computing engine needs to get a route. The route tells the SQL engine which back-end data node this SQL should be sent to, and whether it should be sent to the primary node or to the standby node. So we introduced ZK (Zookeeper) to store metadata information similar to routing. Of course, ZK only stores metadata statically. To maintain and manage these metadata information, a set of scheduling and interface components is also required, here is OSS, Manager/Schedule. So we can see in this picture that TDSQL is divided into three parts as a whole: management node, computing node and storage node. Of course, there is an auxiliary module here to help complete some personalized requirements, such as backup, message queue, data migration tools, etc. In addition, the load balancing here is actually not necessary. Users can choose their own hardware load or LVS soft load. This load balancing can be customized according to actual user scenarios.

After understanding the overall architecture, we will continue to look at the characteristics of each node, how dependent on the machine, what characteristics are required of the machine, and so on.

2.1 Management module: easily manage the entire database backend through the web interface

First of all, we have to look at the management module. As a cluster, only one set of management modules is built, and a set of machines can generally be reused. At the same time, the management module has relatively low requirements for the machine. For example, when resources are tight, we can use a virtual machine instead. Internally, a set of management modules carries nearly tens of thousands of instances of the largest single cluster of management.

The management module includes several key modules mentioned above: Zookeeper (ZK), Scheduler, Manager, OSS and monitoring collection program, Chitu management console. So how do they work together? 1. the DBA user clicks a button on the Chitu management desk-this set of WEB front desk initiates an operation-this button may be to expand the instance, this button will transfer the https request to the OSS module , This OSS module is a bit like a web server, it can receive web requests, but it can forward this to ZK. Therefore, the OSS module is a bridge from the front-end to the back-end. With the OSS module, the entire back-end work module can be bound to the front-end and web interface.

Well, after capturing this request, create a task node on ZK, this task node is captured by the scheduling module, and the task is processed after the capture. After processing the task, return its processing result to ZK. The task on ZK is captured by OSS, and the last is https request, to query this task, and finally get a result, which is returned to the front end.

This is a purely asynchronous process, but with this set of management modules, we can easily manage the entire TDSQL backend through the web interface. Of course, this entire process has a monitoring and acquisition module to collect and obtain the audit and status of the entire process.

2.2 DB module: non-destructive upgrade of the database

DB module, namely data node, data access service is an IO-intensive service. Therefore, data node is also our storage node. It has higher requirements for IO. Generally, it is recommended to configure SSD hard disk, preferably PCI-E SSD . Because for database services, no matter how high the CPU is, if the IO cannot keep up, it is still a small horse-drawn cart. For example, with only 1,000 IOPS, the CPU can't run at all and can't be used. Therefore, it is generally recommended that the IPS should be at least 10,000 or more.

Let's look at the concept of SET again. A SET is a database instance, and a SET contains a database-for example, what we require by default is one master and two backups, one Master node and two Slave nodes. Of course, there is an Agent module on the DB node. When MySQL is executing, we need to monitor its behavior and perform operations. Why can't these things be built into MySQL? In fact, there is a problem. If the data node is upgraded, it may involve a restart. Once restarted, it will affect the user's business and affect the service. At this time, we consider adding a module Agent to it to complete the operations on MySQL for all clusters and report the status of MySQL. With it, most of the upgrades to TDSQL data nodes will be transformed into upgrades to the Agent, and upgrading the Agent does not have any impact on the business, which achieves a lossless upgrade. Compared to Agent, we do not frequently upgrade the data node MySQL. Generally, we will not move it for a year or six months. This is our DB module and also a storage node.

2.3 SQL engine module: distributed complex SQL processing

Next, let's look at another more important module: SQL engine module. The SQL engine is in the position of the computing layer, and it is CPU-intensive, so we try to require a higher CPU in the selection of models. The second is memory. As a computing access layer, it has to manage links. If it is a large number of short links or long links, it takes up a lot of memory, so it has higher requirements for CPU and memory. In addition, it does not store data by itself, nor does it have the distinction between active and standby, so it does not require much hard disk.

Let's take a look at the characteristics of the SQL engine. The SQL engine first pulls metadata from ZK. As a SQL engine, it includes permission verification, read-write separation, statistical information, protocol simulation and other related operations.

Some people may ask, is this SQL engine actually a kind of middleware? In fact, this is not the case. If the SQL engine is a middleware, it can be separated from MySQL. But our SQL engine needs to do lexical, grammatical analysis, and work as a query engine. Moreover, in a distributed scenario, the complex functionality of the SQL engine will be highlighted. For example, to handle distributed things, it is necessary to maintain a global auto-increment field to ensure that multiple data and multiple storage nodes share a sequence that guarantees global auto-increment. ; If it is distributed, some syntaxs should be restricted, including lexical and grammatical analysis; and for some complex calculations, it also needs to do some SQL pushdowns and final data aggregation. Therefore, the SQL engine is still a relatively complex module. As a computing layer, it is not as simple as a simple middleware. This is a SQL engine.

3. TDSQL financial-level features: strong data consistency guarantee

Earlier we learned about the overall architecture and core features of TDSQL. Next, we want to focus on its most important feature-as a guarantee of strong data consistency that is indispensable in financial scenarios. We will talk about the guarantee of data consistency from four aspects:

1. Primary and standby data replication methods

2. Data replication comparison: TDSQL primary and backup data replication solution VS MySQL native solution

3. Core function: Disaster tolerance switching, strong data consistency, zero loss and zero error

4. Strong data consistency

  1. TDSQL master and backup data replication: high performance and strong synchronization

First of all, before talking about data consistency, let's first understand the way of MySQL native data replication.

First of all, the first type is asynchronous replication: the host returns success to the client without waiting for the response from the slave. This is unacceptable in the financial scene. In this case, the data is equivalent to not having multiple copies protection.

The second is semi-synchronous: the host waits for the standby machine to respond under certain conditions. If it cannot wait for the standby machine to respond, it will still return the business success, which means that it will eventually degenerate into an asynchronous method, which is also financial The scene is unacceptable.

In addition, native semi-synchronization actually has a performance defect, that is, in the scene of cross-IDC network jitter, the request glitch is very serious. Therefore, native asynchronous replication and semi-synchronous replication have some problems, and they cannot fully adapt to financial scenarios.

TDSQL introduces strong synchronous replication based on the raft protocol. After the host receives the service request, it waits for one of the standby machines to respond successfully before returning to the client successfully. For example, in this picture, we must wait for one of the standby machines to respond successfully after the next service request arrives at the host, before returning to the client successfully, otherwise the request will not be answered. Therefore, strong synchronization is one of the most basic features of TDSQL, and it is the key to TDSQL to ensure that data will not be lost and will not be wrong.

Speaking of this, some students may ask, your strong synchronization is actually not complicated, is it not that the timeout time is changed to infinite on the basis of semi-synchronization, and the standby machine that answers at the same time is set to 1. This is not the case. The key to TDSQL's strong synchronization here is not to solve the problem of standby machine response, but to solve this increase in the mechanism of waiting for the standby machine, how to ensure high performance and high reliability. In other words, if the performance is not modified on the basis of native semi-synchronization, and only the timeout period is changed to infinite, the performance and asynchrony ratio will actually not reach even half of the asynchrony. This is also unacceptable in our opinion. It is equivalent to sacrificing a large part of performance for data consistency.

The performance of TDSQL strong synchronous replication is a lot of optimizations and improvements on the basis of native semi-synchronous, making the performance basically close to asynchronous.

Therefore, the strong synchronization here emphasizes that it has high-performance characteristics while achieving strong synchronization, so it is a high-performance strong synchronization to be precise.

So how do we achieve high-performance strong synchronization? Let's continue reading. Here TDSQL has made a transformation to the MySQL master-slave replication mechanism. 1. we first introduced a thread pool model.

The native MySQL is-an interconnection request is a thread, so the resource consumption of the operating system is still very large: 5,000 connections, 5,000 threads. With 10,000 connections and 10,000 threads, can the operating system handle it? I can't hold it. In addition, the native data replication method is actually relatively high in serialization. For example, after a user request is sent, it waits for the standby machine to respond; in this process, the user thread cannot do anything here at all, only after the standby machine responds To be able to return to the front end. That is to say, a large number of threads are in a waiting state, which is the root cause of the low efficiency of semi-synchronization.

After introducing the thread pool model, we also need to consider how to schedule the thread pool. We hope that MySQL maintains a small number of working threads. For example, if there are 10,000 user connections, the real work may be 100 or 50 MySQL working threads. If it is a non-working thread, it can go to sleep first when waiting for the IO response, and let it not affect our throughput. So TDSQL has modified the native replication. In addition to introducing the thread pool model, it also adds several sets of custom threads. What does this thread do? When a user request comes in and completes the write operation and refreshes the binlog, in the second step he should wait for the standby machine to respond. At this time, it was taken over by our newly introduced thread group, the user conversation was retained, and the worker thread was released. What should the worker thread do and continue to do, and process other user connection requests. After the real standby machine has given the response, another group of threads wakes it up, echoes it to the client, and tells the client that the response was successful.

Therefore, after such an asynchronous process, it is equivalent to asynchronizing the previous serial process, so that performance close to asynchronous replication is achieved. This is the core of TDSQL's strong synchronization transformation. Therefore, we emphasize here that it is not only a process of achieving strong synchronization, but also a strong synchronization close to asynchronous performance.

Let's take a look at the performance comparison after the transformation: the asynchronous TPS is about 60,000, and the average time consumption is less than or equal to 10 milliseconds. Looking at semi-synchronization again, it is obvious that there is a two-thirds performance loss, and this time consumption fluctuation is still relatively large, such as the IDC network jitter. In the strong synchronous one-master two-standby mode, first the performance is close to the asynchronous performance, and there is no additional increase in time consumption.

Because there is one master and two backups, unless the two computer room networks are jittered at the same time, the time consumption of strong synchronization will not fluctuate significantly. So we see that strong synchronization based on TDSQL achieves multiple copies of data and ensures performance.

With this multi-copy guarantee, how can we achieve automatic failover without losing data? In fact, this still requires a set of procedures for switching elections. This leads to the disaster tolerance switching function of TDSQL.

2. Automatic disaster tolerance switching: strong data consistency, zero loss and zero error

Automatic disaster tolerance switching has become very easy to implement based on the strong synchronization feature. Let's take a look at this structure diagram first:

The SQL engine sends the request to the master node. The master node is synchronized by the two standby machines, and each node has a corresponding Agent reporting the current node status. At this time, a failure of the master node is detected by the Agent, and it is reported that zk is captured by the Scheduler, and then the Scheduler will first demote the master node and turn it into a slave. In other words, at this time, in fact, the entire cluster is full of slaves, and there is no master node. At this time, the other two surviving standby machines report their latest binlog points. Because of the strong synchronization guarantee, one of the other two standby machines must have the latest binlog, then the two standby machines report their latest points separately Later, Schedule can clearly know which node's data is the latest, and promote this latest node to the master node. For example, it is found that the data of Slave1 is the latest. At this time, Schedule modifies the route and sets the master node to Slave1 to complete the switchover between active and standby. With the aforementioned switching mechanism, it is ensured that the entire switching does not require human intervention, and the data before and after the switching are completely consistent. Here is a summary. It is precisely because of the guarantee of strong synchronization that when the host fails, we must be able to find the latest data from a standby node and refer to it as the primary node. This is the whole process of TDSQL disaster tolerance switchover. Disaster tolerance switchover needs to be established on the basis of strong synchronization.

3. Guarantee of data consistency in extreme scenarios

After talking about the disaster tolerance switchover, let's talk about the follow-up processing of the failed node. There may be several situations for fault handling: one is that it can survive the fault. For example, the computer room may suddenly lose power, and it will resume immediately after the power loss. Or the machine accidentally restarted due to hardware reasons. After the restart, the node can be pulled up. After being pulled up, we hope it can quickly rejoin the cluster. At this time, the data is actually not lost. I don't want to rebuild a copy of the data after the node fails, and erase all the previous data. Instead, the last synchronized data is the breakpoint, and the subsequent data continues to be transmitted. With the above questions, let's take a look at the recovery process after a node failure.

1. we consider a scenario, for example, node A is the master node, B and C are slaves, and the data of node A is normally synchronized, A+1, A+2, and then A+3 should be synchronized. When A+3 has not been synchronized to the slave node, a failure occurs. At this time, according to the data situation of the B and C nodes, the data of C is the latest, so C is selected as the master node, and then C continues to synchronize data to B . After a while, node A is pulled up and can rejoin the cluster. After rejoining the cluster, it was discovered that a request A+3 had not been answered by the B and C nodes, but it had been written to the log. At this time, there is actually a problem with the data of node A. We need to roll back this A+3 that has not been confirmed by the standby machine to prevent it from being synchronized to other nodes in the future. Therefore, the emphasis here is on the rollback of a data, which is equivalent to every time a slave joins a node, we will check its data and roll back the overwritten data. Of course, the assumption just now is that the luck is better, and the A node can be restarted. Sometimes the A node may hang up and the machine can no longer get up. We need to replace this node, that is, a new machine, such as adding a D node. We hope it will rebuild data quickly and have as little impact as possible on the current online business. How to make it rebuild quickly? Of course, it is based on physical copy, and it pulls data from the standby machine, so that it will not affect the primary node, and it can quickly rebuild the data.

4. the practice of distributed TDSQL

In the fourth part, we begin to talk about the practice of distributed TDSQL. In the first three chapters, we are talking about the high availability and strong consistency of TDSQL. These are a necessary feature as a financial scenario. Distributed has not yet been involved. When it comes to distributed, another form of TDSQL is opened. Next, let's talk about the difference between distributed TDSQL and single-node TDSQL, and how to achieve a series of guarantees under this distributed architecture, and how to be transparent and unaware of the business.

1. Sub-table

Sub-table. When in stand-alone mode, a logical table that the user sees is actually a physical table, which is stored on a physical node (physical machine). In a distributed form, the actual physical storage of the logical table seen by the user may be broken up and distributed to different physical nodes. Therefore, the goal of TDSQL table splitting is to be completely transparent to the business. For example, the business only sees a complete logical table, and he does not know that these tables have actually been evenly split to each physical node by TDSQL. For example: before the data may be all on one machine, now the data is evenly distributed on 5 machines, but the user is not aware of it. This is a goal that TDSQL wants to achieve-in the eyes of the user, it is a complete picture. The logical table is actually broken up in the background.

How to break up this table in the background and how to distribute it? We hope to be transparent and shielded to users so that they do not care about the details of data distribution. How to distribute and break up this data? This leads to a concept: shardkey-is the shard key of TDSQL, which means that TDSQL will disperse this data according to the shardkey field.

We believe that shardkey is a very natural field, and the data is naturally broken up through a field. For example, within Tencent, we like to use the QQ number as a shardkey to automatically break up the data through the QQ number, or WeChat account; and some bank customers prefer to use some customer numbers, ID numbers and bank card numbers as shardkey. That is to say, this data is naturally scattered through a field. We believe that the introduction of shardkey will not add extra work, because first, users know their data best, knowing which field their data is evenly distributed according to the best, and at the same time, giving users the right to choose sharding keywords independently is helpful. To achieve the best global performance of distributed databases from a global perspective.

So some people here may wonder, is the primary key the best or as dispersed as possible? That's right, it is indeed the case. As the fragmentation key of TDSQL, the more dispersed the better, the requirement is to be a part of the primary key or a unique index. After determining the fragmentation key, TDSQL can evenly distribute the data according to the fragmentation key. For example, in this picture, after slicing according to a field, 10,000 pieces of data are evenly distributed on four nodes.

Now that we understand that shardkey is a shard key, how to use it? Here we will talk about how to use it.

For example, we created the TB1 table. There are several fields, such as ID. From the name, we should know that it is a non-unique, or it can be said to be a relatively scattered value. We see here that "ID" is used as the distribution key, so that the six pieces of data are evenly distributed to two shards. Of course, after the data is evenly distributed, we require that all SQL sent to this side needs to bring a shardkey, that is to say, after it is sent here, it can be sent to the corresponding shard according to the corresponding shardkey. If you don't bring this shardkey, it doesn't know which shard to send to, so it sends it to all shards. Therefore, it is emphasized that through such improvements, we require SQL to bring shardkey as much as possible. If you bring the shardkey, you can implement the routing and distribution of SQL to the corresponding shards.

After talking about data fragmentation, let's look at data fragmentation again.

2. Horizontal split

For distributed, it is possible that all of our data is on one node initially. When a node has a performance bottleneck and needs to split the data, it is very simple for us TDSQL. A button on the interface: one-click expansion, it can automatically split the data. The process of splitting is also easier to understand. In fact, it is a process of copying and relocating data, because the data itself can be divided into half and half. For example, the first piece of data needs to be split into two, and the lower half of the data needs to be copied to another node. This principle is also relatively simple. It is a copy of data. The emphasis here is that in the process of copying, the business is not affected in any way. In the end, the business will only be frozen for a second.

Why is it called second-level freeze? Because, in the last step, the distribution of data to two nodes involves a routing information change. For example, the original routing information needs to be sent to this segment, and now it needs to be divided according to the division. The upper half should be sent to one segment, and the next Half is sent to another shard. In the process of rerouting, we hope that this data is not written to be relatively static.

Of course, the rerouting is also completed in milliseconds, so when data is split, the real final impact on the business is less than 1s, and it will only be triggered during the freezing phase of the final rerouting.

After talking about data splitting, we began to cut into the most difficult problem to solve in distributed, distributed transactions.

3. Robust and reliable distributed transactions

Single-node transactions are very easy to solve, but there are still certain difficulties in solving distributed transactions in a distributed scenario, and it needs to consider a variety of complex scenarios.

In fact, the implementation of distributed transactions is not difficult, but the first thing is to ensure its robustness and reliability, and it can cope with a variety of complex scenarios. For example, when it comes to distributed transactions, there are active/standby switching, node downtime... In various disaster tolerance test environments, how to ensure that the data ledger is flat, not more or less Dividing money, this is what distributed transactions need to consider.

TDSQL distributed transaction is based on the standard two-phase commit implementation of disassembly, which is also a common method in the industry. We see that the SQL engine, as the initiator of distributed transactions, unites various resource nodes to complete distributed transaction processing.

Distributed transactions are also judged based on shardkey. Specifically, a transaction is initiated for the SQL engine read. For example, the first SQL is to change the user information table with the user ID to A. The second SQL is to insert a flow table with a user ID of A. Both tables use the user ID as the shardkey. We found that these two SQLs were sent to one shard. Although it was an open transaction, we found that it did not take a distributed transaction. It was actually limited to a single node transaction in a single shard. Of course, if it involves transfers: For example, from account A to account B, account A happens to be in the first shard, and account B is the second shard. This involves a distributed transaction and requires the SQL engine to complete the entire distribution. Transaction processing.

Distributed transaction is a decentralized design. Whether it is a SQL engine or a back-end data node, it is actually a design that has high availability and supports linear expansion. Distributed transactions are more complicated. If you speak separately, you may be able to teach a course. There are a lot of content involved, such as what are the exception scenarios in the two-stage submission process, how to deal with failures, how to deal with timeouts, and how to ensure that the transaction is final Consistency and so on. I won’t go deep here anymore, I hope I can share this content with you individually.

Therefore, here is only a summary of distributed transactions, we will not discuss its details:

The first is based on two-phase commit. We have done a lot of optimization and BUG repair on the basis of MySQL native XA transaction. For example, in the original XA, data inconsistency and loss will occur during the main/standby switchover. TDSQL has done a lot of repairs on this basis, so that XA transactions can ensure data consistency.

The second is strong performance. When we first introduced native distributed transactions, the performance of distributed transactions was less than half of that of a single node. Of course, after a series of optimization and tuning, our final performance loss is 25%, which means it can reach 75% of the performance of a single node.

The third is to be transparent to the business, because for the business, there is actually no need to care about whether it is distributed or non-distributed. It only needs to start a transaction according to the normal business.

The fourth is complete exception tolerance. Whether distributed transactions are robust also needs to consider the ability of fault tolerance.

The fifth is global lock detection. It is also indispensable for lock detection in a distributed environment. TDSQL provides distributed deadlock detection from a global perspective, which can clearly view the lock waiting relationship between multiple distributed transactions.

The sixth point is complete decentralization. Both the SQL engine and the data node support high availability and can scale linearly.

The above is a summary of TDSQL distributed transactions. If users want to maintain a high degree of compatibility with MySQL, then the Noshard version of TDSQL may be more suitable. But if for users, a single node has reached the resource bottleneck, and there is no way to redistribute or expand data under a single node, then the Shard mode must be selected. But in Shard mode, we have certain restrictions and restrictions on SQL, and there will be a special course later to talk about how distributed TDSQL restricts SQL.

We see that both Noshard and Shard have high availability, strong data consistency, and automatic disaster recovery capabilities.

At the same time, TDSQL also supports the migration of Noshard to Shard. Maybe the Noshard we planned in the early stage can still carry business pressure, but with the sudden increase in business, we can no longer support it. At this time, we need to move all to Shard. Then TDSQL also has perfect tools to help. Users quickly migrate data.

5. TDSQL data synchronization and backup

Next is another auxiliary feature of TDSQL: data synchronization and backup.

1. TDSQL data synchronization component

The focus of data synchronization is divided into three scenarios:

The first scenario is a data summary. For example, the data of multiple database instances are synchronized to one database instance. For example, users in the insurance industry prefer to synchronize the data of multiple regional database instances across the country to the national database for statistical analysis.

The second is disaster tolerance across cities. In cross-city disaster recovery, data in a distributed database in one city generally needs to be synchronized to a heterogeneous distributed database in another city for disaster recovery. Sometimes we do cross-city disaster recovery of heterogeneous databases. For example, the main city is a 16-node database, which is very large. However, due to cost considerations, the number of equipment and computer rooms we choose may be worse. For example, the disaster recovery instance has only two physical shards. One is a two-slice database instance, and the other is a sixteen-slice database instance. Synchronize from sixteen shards to two shards. This is the synchronization of a heterogeneous database. At this time, we need to use the data synchronization component.

The third is migration. The migration of heterogeneous databases will synchronize data from TDSQL to MySQL, Oracle, PostgreSQL and other databases. Of course, from TDSQL to TDSQL is a synchronization method, and another is TDSQL to other heterogeneous databases. For example, Zhangjiagang Rural Commercial Bank, its core system needs to be replaced from the traditional foreign commercial database to TDSQL, and some risk prevention may still be needed. In the end, we gave a set of solutions using Oracle as an example of TDSQL disaster recovery. Through the data synchronization component, TDSQL data can be synchronized to Oracle in quasi-real time. If we need to switch the business to Oracle in extreme cases, we also have this ability.

Of course, data migration also reflects the open mind of TDSQL. Since users are allowed to migrate data to TDSQL, if one day users may think that TDSQL is not very good, and think that there is a better product to replace it, TDSQL supports users to move data away.

2. TDSQL data backup

Finally, let's talk about TDSQL backup.

TDSQL supports online real-time hot backup. At the same time, this backup is based on the backup done on the backup machine. The backup supports mirroring and binlog backup. Mirroring also supports physical mirroring and logical mirroring (also called physical backup and logical backup).

The advantage of physical backup is that it is fast and directly manipulates physical files. The disadvantage is that it can only back up the entire database instance and cannot choose to specify a database table. The advantage of logical backup is that it is backed up by SQL, it can choose a single database table backup, but if the efficiency of the entire instance backup is not as good as the physical backup. For example, if there is 1T of data, only 100 megabytes is my key data. If there is no need to use physical backup to save storage space, use logical backup and only back up the library tables we care about.

With physical backup and logical backup, data-based mirroring, combined with binlog, can easily achieve fixed-point data recovery. For binlog backup, TDSQL's Agent module completes quasi-real-time asynchronous backup. For example, we back up mirror images at 0 o'clock every day, and at the same time, binlog quasi-real-time backups of various time periods. When it is necessary to restore the data at 6 o'clock in the morning, use the data mirroring at 0 o'clock and the binlog from 0 o'clock to 6 o'clock to complete the data recovery at 6 o'clock.

The backup is done on the standby machine without affecting the main machine. The entire backup process is also monitored and alarmed to realize the tracking of the entire backup process.

Because the content of the architecture is indeed relatively large, this time is also used as a preview for all TDSQL series sharing, and more series sharing will be carried out in detail according to some chapters of this course. This sharing is mainly to help you understand the overall architecture and module division of TDSQL, as well as how its key features are designed, and what considerations are based on it. After listening to this sharing and then listening to the following topics, it will be easier to understand.

Okay, the above is all the content shared today, thank you!

6. Q&A

Q: There is no SQL engine module in TDSQL version 1.0, right?

A: As early as 2002, we used a stand-alone version of MySQL as a data access service, and then we derived TDSQL, a structure that separates computing and storage, and then introduced a SQL engine.

Q: Does the MySQL of the storage node use the official native?

A: TDSQL has done a lot of tuning on the basis of native MySQL, such as thread pool, strong synchronization optimization, security restrictions, distributed transaction XA optimization, and so on.

Q: The core of the bank needs to be divided into databases and tables. How to realize the developed aggregate query?

A: The SQL engine shields the details of the sub-table, so that the business can logically see the same as the single-node mode, which is still an independent database table. In addition, the SQL engine will automatically do data aggregation, and business development does not need to be concerned.

Q: a+3, if it is lost, the nodes B and C are not synchronized, what should I do? Machine A has been unable to recover.

A: If a+3 is not confirmed by B and C, that is, it is not satisfied and confirmed by the majority, it will not respond to the success of the business, and will eventually return to the business with a timeout error. If the A machine cannot be recovered, the newly added node will "clone" a standby node through the physical copy method at the fastest speed and continue to replace the A node to provide services.

Q: Can you use the pt tool for the Shard version, or can you add fields to gh-ost? Are there any restrictions?

A: The TDSQL management console provides the function of online ddl, which will automatically make atomic changes to multiple shards, no need to use third-party tools for the business; distributed TDSQL is not allowed to adjust the shardkey field when doing DDL, such as id as the shardkey. , Now you need to adjust the name as the shardkey, which is not allowed.

Q: How many TDSQL Shard algorithms are there? Does the table-building statement have to modify the syntax?

A: The TDSQL shard algorithm shields the business, that is, the hash split based on the MySQL partition table (the algorithm does not allow users to modify it). This is also to shield the business from the details of the TDSQL sub-table. This is not to restrict users to only partition based on hash, users can do secondary partition on the basis of TDSQL-shard (for example: according to date and time). For the syntax of creating tables and using it, there is a course on distributed development later, so stay tuned.

Q: Who will solve the reliability of zk?

A: On the one hand, zk itself has made high-availability cross-machine room deployments, and when an odd number of zk deployments fails, as long as the number of remaining surviving zk is greater than half of the cluster zk, zk can continue to provide services; on the other hand, even if all zk nodes are Downtime, each module itself is not strongly dependent on zk, that is, when zk is not working, the normal read and write requests of the database instance will not be affected in any way, but it cannot handle triggering operations related to scheduling such as switching and expansion.

Q: The teacher just talked about two modes. If Shard mode is used, does the application layer have requirements for Sql syntax? Did I hear it wrong?

A: Compatible with 99% of MySQL's SQL syntax. The main difference between shard mode and noshard mode is the introduction of shardkey. After the introduction of shardkey, in order to take advantage of the performance advantages of shard mode, it is recommended that all SQLs be accessed with shardkey. At the same time, in shard mode, some advanced database features such as stored procedures, triggers, views, etc. will be subject to certain usage restrictions . More detailed content will be introduced later in a special distributed development course.

Q: The data from the branch to the head office is synchronized and summarized. Does the Oralce synchronization to TDSQL support both directions?

A: Support two-way synchronization. The support here is conditional. Quasi-real-time synchronization can be achieved from TDSQL to Oracle, but quasi-real-time synchronization is currently not possible from Oracle to TDSQL, and will be supported in the future.

Q:分布式表和非分布式表如何做join?

A:分布式TDSQL下存在两种表分别是大表和小表,大表就是shard表(分布式表),小表又叫广播表,会冗余到所有数据节点上。分布式表和非分布式表做join时,由于分布式表所在的物理节点上存在非分布式表的一份冗余,因而可以在单个数据节点上完成join。

Q:是否支持自建私有云?如果公有云,成本会不会上升?

A:支持私有云的,TDSQL大部分的银行客户都是采用私有云部署模式,并和外网隔离。公有云的成本相比私有云明显是前者更低,像私有云需要自建机房,搭建光纤设备,但是好处是独占物理硬件资源。公有云的话都是和公有云上的其他用户公用一套IDC、网络环境。

Q:是否支持K8S部署?

A:TDSQL自带了一键部署解决方案,不依赖K8S。

Q:分局分表支持大表关联查询吗?

A:支持的

//下期预告//

Reference:https://cloud.tencent.com/developer/article/1600722 腾讯分布式数据库TDSQL金融级能力的架构原理解读 - 云+社区 - 腾讯云