In order to help developers better understand and learn distributed database technology, in March 2020, Tencent Cloud Database, Yunjia Community and Tencent TEG Database Working Group launched a three-month online technical salon "Do you want to The secrets of domestic database that you know are all here! "Invite dozens of senior database experts from the goose factory every Tuesday and Thursday evening to interpret the core architecture, technical implementation principles and best practices of the three goose factory self-developed databases, TDSQL, CynosDB/CDB, and TBase. This article will bring a live review of the third chapter "Smooth expansion in a hundred million-level traffic scenario: TDSQL's horizontal expansion plan practice".
Not much to say, we officially enter today's sharing. The topic shared today is "Smooth expansion under the scenario of 100 million-level traffic: TD's horizontal expansion plan practice".
Today’s sharing I will mainly include these four parts:
The first part first introduces the background of horizontal expansion, mainly introduces why horizontal expansion is necessary, mainly compares with vertical expansion, and talks about the problems that we generally encounter with horizontal expansion.
The second part will briefly introduce how TDSQL performs horizontal expansion, so that everyone has an intuitive impression.
The third part will introduce in detail the design principles behind the horizontal expansion of TDSQL, which will mainly correspond to the first part, and see how TDSQL solves the problems encountered in general horizontal expansion.
The fourth part will introduce cases in practice.
1 Background and Challenges of Horizontal Database Expansion
1. let's look at the background of horizontal expansion. The reason for the expansion is actually very intuitive. Generally speaking, it is mainly due to the expansion of business visits or the scale of needs, but the existing capacity or performance cannot meet the needs of the business, mainly because the TPS and QPS are not enough or the delay exceeds Tolerant range of the business, or the existing capacity can not meet the requirements, the latter mainly refers to the disk or network bandwidth. Generally, when encountering this kind of problem, we have to expand the capacity. In terms of capacity expansion, there are actually two more common methods, one is vertical expansion and the other is horizontal expansion. These two types have different characteristics, and their advantages and disadvantages are also very obvious.
1.1 Horizontal expansion VS vertical expansion
1. let's take a look at vertical expansion. Vertical expansion is mainly to improve the configuration of the machine or the configuration of the instance. Because, we know that when you buy a database or an instance on the cloud, it is actually allocated on demand, that is to say, for users, the current business volume may not be large, and only two CPUs or a few gigabytes of memory are needed. ; And as the business grows, he may need to expand the capacity of this instance, then he may currently need 20 CPUs, or 40G of memory. At this time, on the cloud, we can dynamically adjust resources through the control of resources, so that it can meet the needs of the business-that is, we can dynamically increase the CPU on the same machine. The limit of this expansion is that when the CPU and memory of the entire machine are given to it, if it is found that it is not enough, a better machine needs to be prepared for expansion. This can be switched between active and standby in MySQL: by first selecting a standby machine, and then data synchronization; after the data synchronization is completed, the main and standby switching is performed, so that the better machine can be used. As you can see, during the whole process, there is basically no impact on the business-the main and standby switchover, if the IP is changed, in fact, it is through the front-end or VIP method, which basically has no impact on the business. . Then one of its biggest disadvantages is that it relies on stand-alone resources: you can provide it with a better machine to meet a certain amount of requirements. And as the business develops more rapidly, you will find that the best machine you can provide may still not be enough, which is equivalent to not being able to expand. Therefore, the biggest disadvantage of vertical expansion is that it relies on the resources of a single machine.
Compared with vertical expansion, another method is called horizontal expansion. The biggest advantage of horizontal expansion is that it solves the problem of vertical expansion—in theory, horizontal expansion can carry out unlimited expansion, and it can dynamically adapt to business needs by adding machines.
Compared with vertical expansion, horizontal expansion can solve the problem of vertical expansion, but it will introduce some other problems. Because horizontal expansion is more complicated than vertical expansion, we will analyze the problems that may be encountered below, and we will introduce TDSQL's solutions later:
So you can see that the advantages of horizontal expansion are obvious, and it solves the limitations of vertical expansion machines. But it is more complicated and introduces more problems. Next, everyone will bring these problems. I will introduce how TDSQL performs horizontal expansion and how it solves the problems just mentioned.
2 TDSQL horizontal expansion practice
1. let's take a look at the architecture of TDSQL. TDSQL simply contains several parts:
You can take a look at the entire expansion process: at the beginning, the data is placed on a set, that is, in a node. Then the expansion will actually expand the data to-there are 256 Sets, which will expand to 256 machines. You can see that there are several key points in the entire expansion:
In summary, the data has been segmented at the beginning. The expansion process is equivalent to moving the fragments to new nodes. During the entire expansion process, the number of nodes increases, which can be expanded from 1 to 2 to 3, or even expansion. In the end, it can reach 256, but the number of fragments remains the same. At the beginning, 256 shards are on one node, and if it is expanded to two nodes, it is possible that every 128 shards are on one node; at the end, it can be expanded to 256 nodes, and the data is on 256 machines. One machine is responsible for one of the shards. Therefore, the entire expansion is simply the relocation of shards. We will talk about the specific details later.
On a private cloud or a public cloud, a unified front-end page is provided for the entire expansion of TDSQL, which is very convenient for users in the process of using it.
Let's take a look at this example. Now there are two Sets in this case, that is, two nodes. Each node is responsible for a part of the routing. The first node is responsible for 0-31, the other is named 3, and the responsible routing information is 32-63. Now there are two nodes. If you want to expand, we will have a "Add Set" button on the front page. Click "Add Set" and a dialog box will pop up. The previous configuration will be automatically selected by default. , Users can customize by themselves, including the current Set, how many resources are needed, and the allocation of memory and disk.
In addition, because the expansion requires route switching, we can manually select a time, which can be automatically switched, or the actual situation of the business can be judged by the business, and the route switching can be manually operated. These can be set according to the needs of the business.
After the first step is created, I just said that the brain module will be responsible for allocating various resources, as well as initialization, and the entire logic of data synchronization. In the end, you will see that the first node, originally two nodes, has now become three nodes. Before the expansion, the first node is responsible for 0-31. Now it is only responsible for 0-15, and the new node is responsible for the other part of the routing. So throughout the process, you can see that you can quickly add from two nodes to three nodes by clicking on the web page-we can continue to add Set, and continue to perform one-click expansion according to business needs.
3 The design principle behind the horizontal expansion of TDSQL
Just now I mainly introduced the core architecture of TDSQL, as well as the characteristics of horizontal expansion and front-end operations, (helping) everyone to build an intuitive impression.
In Chapter 3, I will introduce the design principles behind the horizontal expansion of TDSQL in detail, mainly to look at some of the problems encountered in the horizontal expansion mentioned in Chapter 1, and how we can solve these problems. These problems need to be solved no matter which system is used for horizontal expansion.
3.1 Design principle: How to choose the partition key to balance compatibility and performance
First of all, we just mentioned that the first problem of horizontal expansion is how to split the data. Because data splitting is the first step, this will affect the entire subsequent use process. For TDSQL, the logic of data splitting is put into a syntax for creating tables. The business needs to specify the shardkey "equal to a certain field"-the business needs to select a field as the partition key when designing the table structure. In this case, TDSQL will split the data according to the partition key, and access it will do the data according to the partition key. Of aggregation. We hope that the business can participate in the design of the table structure and designate a field as a shardkey. In this way, compatibility and performance can be well balanced.
In fact, we can also do that when users create a table without specifying a shardkey, we will randomly select a key at the bottom to split the data, but this will affect the subsequent use efficiency, such as the use of distributed databases. performance. We believe that if the business layer can have a small amount of participation in the design of the table structure, it can bring a very large performance advantage, so that compatibility and performance can be balanced. In addition, if the business chooses the shardkey-the partition key, we can see multiple tables when designing the table structure of the business, and we can select the relevant column as the shardkey, so as to ensure that the data is split. The data is placed on the same node, which can avoid many cross-node data interactions in distributed situations.
When we create a table, the partition table is our most commonly used, it splits the data to each node. In addition, we actually provide two other types-a total of three types of tables are provided. The main thinking behind it is for performance, that is, by putting the data of the global table as a table on each node-everyone at the beginning You will see that the full amount of data is on each node, which is equivalent to no distributed feature, no horizontal split feature, but in fact, this kind of table, we generally use it in some configuration tables with relatively small data volume and relatively few changes. In, data redundancy is used to ensure subsequent access, especially during operations, to avoid cross-node data interaction as much as possible. In other respects, for shardkey, we will make a hash based on the user. The advantage is that our data will be more evenly distributed on each node to ensure that the data will not have hot spots.
3.2 Design principle: high availability and high reliability in capacity expansion
As mentioned earlier, because the entire expansion process will be more complicated, can the entire expansion process ensure high availability or high reliability, and what is the perception of the business? How does TDSQL do it?
The first step is the data synchronization phase. Suppose we now have two Sets, and then we find that one of the SETs now has a relatively dangerous disk capacity, for example, it may reach more than 80%. At this time, we need to expand it. 1. we will create a new instance, and create a new instance by copying the image. Instance, create a new synchronization relationship. The process of establishing synchronization has no perception of business, and this process is real-time synchronization.
The second stage is to continuously tie the data, while continuing to perform data verification. This process may continue for a period of time. When the delay difference between the two synchronizations is infinitely close-for example, we set a threshold of 5 seconds. When we find that we have caught up within 5 seconds, this time we will enter the first 3.stages-routing update stage.
In the routing update phase, first we will freeze the write request. If the business writes over, we will reject it and let the business try again after two seconds. This time, the business will actually have a second-level impact. But this time will be very short. After the write request is frozen, when the third instance is synchronized, it will soon be found that all the data has been caught up, and the verification is no problem. At this time, we will modify the routing and perform related atomic operations at the same time. The bottom layer is partitioned and shielded at the storage layer, so that it can ensure that the SQL access layer does not write wrong data if the routing is too late to update. Because the bottom layer has been changed, the partition has been blocked. In this way, the consistency of the data can be guaranteed. Once the routing is updated, the third SET can receive the user's request. At this time, you can find that the first SET and the third SET have the full amount of data because they have established synchronization.
The last step is to delete these redundant data. To delete redundant data, use delayed deletion to ensure that it can be deleted slowly during the deletion process without causing relatively large IO fluctuations and affecting existing network services. During the entire deletion process, we have done partition shielding, and at the same time, we will rewrite SQL at the SQL engine layer to ensure that when we have redundant data at the bottom layer, even if it is a full scan when the user comes to check, we can It is guaranteed that there will be no more data.
It can be seen that the entire expansion process, data synchronization, and checksum removal redundancy phases, the time consumption will be relatively long, because if the data volume is relatively large, the entire copy mirroring or chasing is required. The binlog period is relatively long. However, these stages have no impact on the business, and the business simply does not perceive that a new synchronization relationship has been added. Then, if there is a problem when establishing a synchronization relationship, or a problem occurs when creating a new backup machine, you can completely replace it with another one, or after retrying, this has no effect on the business. In the routing update stage, theoretically, it is inevitable that business write requests will have a second-level impact, but we will control this impact time window to a very short time, because it is necessary to ensure that the synchronization is within 5 seconds to freeze the write request. With a relatively small threshold, we can initiate a route update operation after synchronization to this stage. At the same time, we have partitioned and shielded the storage layer to ensure that among multiple modules, there will be no problem of data confusion if there are different updates. This is how we guarantee high availability and high reliability during expansion, and the entire expansion has a very small impact on the business.
3.3 Design Principle: Distributed Transaction
What I just talked about is the general process of the expansion phase and how TDSQL solves the problem. Next, let's look at how to solve the problems caused by the horizontal expansion just mentioned after the expansion is completed. The first is distributed transactions.
After the expansion, the data is cross-node. The system originally has only one node. Now, how to ensure the atomicity of the data if the cross-node is used? This is based on two-phase commit, and then distributed transactions are implemented. The entire processing logic completely shields the complexity behind the business. For the business, using a distributed database is the same as using a stand-alone MySQL. If the business SQL only accesses one node, then ordinary transactions can be used; if one SQL or one transaction of the user is found to operate on multiple nodes, we will use two-phase commit. In the end, logs will be used to ensure the atomicity of the entire distributed transaction. At the same time, we achieve complete decentralization of the entire distributed transaction in the implementation process, and TM can be done through multiple SQLs, and the performance can also achieve linear growth. In addition, we have also done a large number of various exception verification mechanisms, with very robust exception handling and global trial and error mechanisms, and we have also passed the TPCC standard verification.
3.4 Design principle: How to achieve linear growth in performance during expansion
For horizontal expansion, data splitting into multiple nodes mainly brings about two problems: one is the problem of transaction atomicity just mentioned, which is solved by distributed transactions; the other is performance.
In vertical expansion, a linear increase in performance is generally achieved by replacing a better CPU or similar method. In terms of horizontal expansion, because the data is split into multiple nodes, how can we make good use of the split nodes to perform parallel computing, and truly bring out the advantages of a horizontal distributed database, which requires a lot of operations, A large number of optimization measures. TDSQL has made such optimization measures.
One is that related data exists on the same node. When building the table structure, we hope that the business can participate in part. When designing the table structure, we specify some related keys as shardkeys, so that we can ensure that the back-end related data is on one node. If you perform a joint query on these data, you don't need to cross-node.
Similarly, we achieve performance improvement through parallel computing and streaming aggregation-we split and distribute SQL to each background node, and then calculate in parallel through each node, and after the calculation is done, we will do secondary aggregation through the SQL engine, and then Return to the user. In order to reduce the pull of data from the back-end to SQL and reduce one pull of data, we will do some push-down queries-push more conditions to the DB. In addition, we have also done data redundancy to minimize cross-node data interaction through data redundancy guarantee.
Let's briefly look at an aggregation-how TDSQL achieves horizontal expansion, it is basically unaware of the business, and the way to use it is the same as using a stand-alone MySQL. For the business, assuming there are 7 pieces of data, the business does not care whether the specific data in this table exists in one node or multiple nodes, only 7 pieces of data need to be inserted. The system will perform grammatical analysis according to the passed SQL and automatically rewrite this piece of data. For 7 pieces of data, the system will calculate according to the partition key and find that these 4 are to be sent to the first node, and the other 3 are sent to the second node, and then rewritten. After the rewritten, these data are inserted. For the user, this is the execution of this one, but it is cross-node, we will use two-phase commit here, which becomes multiple SQL, and then guarantee that if there is a problem, both sides will roll back at the same time.
After the data is inserted and recorded, if the user wants to do some query-in fact, the user does not know that the data is split, it is a complete table for him, and he uses similar aggregate functions to query. Similarly, this SQL will also be rewritten, and the system will send this SQL to two nodes, add some averaging functions, and perform corresponding conversions. At each node, the system will perform data aggregation first, and then do aggregation again here. The advantage of adding this step is that if we come here, we can do an aggregation, which is equivalent to not caching too much data here, and doing a streaming calculation to avoid the situation of consuming too much memory at one time.
For more complex SQL, such as multiple tables or more sub-queries, if you are interested, you can pay attention to our subsequent sharing-SQL engine architecture and engine query actual combat.
In the third chapter above, we introduced some principles of TDSQL's entire horizontal expansion in more detail, such as how to split data, horizontal expansion practices, and how to solve problems in the expansion process. It also introduces some problems brought about by horizontal expansion. How TDSQL is solved.
4 Practical Cases of Horizontal Expansion
In Chapter 4, we briefly introduce some practices and cases.
4.1 Practice: How to choose a partition key
Just now we said that we hope that when the business participates in the design of the table structure when creating the table, we can consider the choice of partition key. How to choose the partition key? Here is a brief introduction based on several types.
If it is a user-oriented Internet application, we can use the user-corresponding field, such as user ID, as the partition key. This ensures that when there are a large number of users, the data can be split to various back-end nodes according to the user ID.
For game applications, the logical main body of the business is the player, and we can pass the field corresponding to the player; for e-commerce applications, some fields of the buyer or seller can be used as the partition key. For the Internet of Things, for example, the device ID can be used as a partition key. Choosing the partition key in general is to split the data better to avoid the final point of leakage. In other words, by selecting this field through the partition key, the data can be more evenly distributed to each node. In terms of access, when there are more SQL requests, they actually have partition key conditions. Because only in this case, can we better take advantage of the distributed advantage-if there is a partition key in the condition, then this SQL can be directly entered on a certain node; if there is no partition key, it means that you need Send this SQL to all nodes on the backend.
As you can see, if you expand horizontally to more—from one node to 256 nodes, then if a certain SQL is not well written, you may need to aggregate all the data of 256 nodes, then the performance will not be well.
In summary, we hope that the business will participate as much as possible when creating tables and designing the table structure. Because whether it is an aggregation function or a variety of transaction operations, in fact, it is basically unaware of the business, and the participation of the business at this time means that it can be exchanged for a great performance improvement.
4.2 Practice: When to expand?
When will we expand? In TDSQL, we will have a large amount of monitoring data. For each module, we will monitor the running status of the entire system locally, and there will also be various logs to report information on the machine. Based on this information, we can decide when to expand.
To put it simply, for example, disks-if you find that the data disk usage is too high, you can expand at this time; or SQL requests, or the CPU usage is close to 100%-at present, basically if the utilization rate reaches 80%, you will need to expand. Another situation is that the request volume is actually relatively small at this time and the resource usage is relatively sufficient, but if the business tells you in advance that an event will be carried out at some time, the request volume of this event will increase several times at that time. At this time, we The expansion can also be completed in advance.
Let's look at a few cluster cases on the cloud. As you can see, this cluster has 4 SETs, and each SET is responsible for a part of the shardkey. The routing information is 0-127, which means that it can be expanded to 128 nodes in the end, so it can be expanded by 128 times. This "128" can be determined by the initial business estimate. Because if the pool is too large, it can indeed be expanded to a few thousand in the end, but the data will be scattered. In fact, the performance of every cloud or actual machine is already very good today, and there is no need for thousands of specifications.
This is another cluster-it will have a little more nodes, with 8 nodes, and each node is also responsible for part of the routing information. This number is only 64, so this can be expanded to 64 nodes in the end. This is a related example on the cloud.
Today, my sharing is mainly about these contents. If you have any questions, please comment and pay attention.
Q: Before the expansion, the tables in the SET are all partitioned tables. Are they partitioned tables?
A: Yes, before the expansion, it is equivalent to this. Simply put, we have a node now, then we tell him 256, this value is set when we initialize. And this value will not change after the cluster is initialized. Suppose our cluster has set a value of 256-because he may think that the amount of data will be very, very large later, so it can be set to 256. At this time, the data is all on one node. At this time, the user created a table according to the syntax we just mentioned. This table is actually divided into 256 parts at the bottom. So even if he did not expand, its data is 256 copies. Create another table, also 256 copies. The user may create two tables, but we have 256 partitions at the bottom of each table. Expansion is equivalent to partitioning and moving it to other places.
Q: How to ensure the consistency of the backup files of each node when they are restored?
A: There is no mutual relationship between each node. Each node is responsible for a part of the routing number segment, and only stores part of the data. The horizontal expansion is only responsible for part of the data. The backup between them is actually not related to each other, so this backup is actually it. Irrelevant. We may have one master and two backups for each node. In fact, we have a strong synchronization mechanism to ensure strong data consistency during replication. You can refer to the previous sharing, which will introduce in more detail how TDSQL's TDSQL one-master multiple-backup architecture guarantees strong data consistency in a single node .
Q: Can a single point of failure be avoided when the two phases are coordinated?
A: First of all, in the two-phase commit, we use the SQL engine for transaction coordination, which is a single transaction. If other connections are sent over, other SQL engines can be used for the transaction coordination period. And each SQL engine is stateless and can be scaled horizontally. So this is actually not too many failures, we can randomly expand according to the performance, can achieve linear growth in performance, without centralization. The logs are all broken up, and the logs will also be recorded in the data node of the TDSQL backend, with one master and multiple backups, and strong internal consistency is guaranteed, and there will be no single point of failure.
Due to time constraints, today’s sharing is mainly here first. If you have any questions, please add our technology group.
TDSQL is one of the three major product lines under the Tencent TEG database working group. It is a financial-grade distributed database product developed by Tencent. It is currently widely used in finance, government affairs, Internet of Things, smart retail and other industries, with a large number of distributed Database best practices.