This article is from Tencent CDC, please indicate the source when reprinting https://cdc.tencent.com/2018/09/13/frontend-exception-monitor-research/
Front-end monitoring includes behavior monitoring, abnormal monitoring, performance monitoring, etc. This article mainly discusses abnormal monitoring. For the front-end, it is in the same monitoring system as the back-end. The front-end has its own monitoring solution, and the back-end has its own monitoring solution, but the two are not separated, because if a user encounters an abnormality during the operation of the application, there is It may be caused by the front-end or the back-end. A mechanism is needed to connect the front-end and the back-end to unify the monitoring itself in the monitoring system. Therefore, even if only the front-end abnormal monitoring is discussed, the front-end and back-end boundaries cannot be strictly distinguished. According to the actual system design, the help of monitoring for development and business should be reflected in the final report.
Generally speaking, a monitoring system can be roughly divided into four stages: log collection, log storage, statistics and analysis, report and warning.
Collection stage: Collect abnormal logs, do certain processing locally, and report to the server with a certain plan.
Storage stage: The back-end receives the abnormal logs reported by the front-end, and stores them according to a certain storage scheme after certain processing.
Analysis stage: It is divided into automatic machine analysis and manual analysis. The machine automatically analyzes, through preset conditions and algorithms, statistics and filters the stored log information, finds problems, and triggers alarms. Manual analysis, by providing a visual data panel, allows system users to see specific log data, and find the root cause of abnormal problems based on the information.
Alarm stage: divided into alarm and early warning. The alarm is automatically alarmed according to a certain level, through a set channel, and in accordance with a certain trigger rule. Early warning is to predict in advance and give a warning before the abnormality occurs.
Front-end abnormality refers to the situation that users cannot quickly obtain results that meet the expected results when users use Web applications. Different abnormalities have different consequences, ranging from unpleasant users to using the product, and causing the user to lose the product. Recognized.
According to the extent of the consequences of the abnormal code, the front-end abnormal performance is divided into the following categories
The content presented on the interface does not match the content expected by the user. For example, click to enter the non-target interface, the data is inaccurate, the error message that appears is incomprehensible, the interface is misplaced, and the error interface is jumped to after submission. When this kind of abnormality occurs, although the product itself can still be used normally, the user cannot achieve his goal.
There are phenomena that there is no response after the operation on the interface, for example, the button cannot be submitted after clicking the button, and the operation cannot be continued after the prompt is successful. When this kind of abnormality occurs, the product is already partially unavailable at the interface level.
The interface appears to be unable to achieve the operation purpose, for example, the target interface cannot be entered by clicking, and the details cannot be viewed by clicking. When this kind of abnormality occurs, some functions of the application cannot be used normally.
d. Feign death
The interface freezes and no function can be used. For example, users cannot log in and cannot use in-app functions, and cannot perform any follow-up operations due to a certain mask layer blocking and unclosable. When this kind of exception occurs, the user is likely to kill the application.
The application frequently quits automatically or becomes inoperable. For example, intermittent crashes, web pages cannot be loaded normally or no operations can be performed after loading. The continuous occurrence of this kind of abnormality will directly lead to the loss of users and affect the vitality of the product.
The reasons for the front-end exceptions are mainly divided into 5 categories:
1) Wrong business logic judgment condition 2) Wrong event binding sequence 3) Wrong call stack timing 4) Wrong operation js object
Data type error
1) Treat null as an object to read property 2) Treat undefined as an array for traversal 3) Use numbers in string form directly for addition 4) Function parameters are not passed
Grammatical and syntactic errors
1) Slow 2) The server does not return data but still 200, the front end performs data traversal as normal 3) The network is interrupted when submitting the data 4) The front end does not do any error handling when the server 500 error
1) Insufficient memory 2) The disk is full 3) The shell does not support API 4) Incompatible
When an exception occurs, we need to know the specific information of the exception and decide what solution to adopt based on the specific information of the exception. When collecting abnormal information, you can follow the 4W principle:
WHO did WHAT and get WHICH exception in WHICH environment ?
a. User information
Information about the user when the exception occurs, such as the user's current state, permissions, etc., as well as the need to distinguish which terminal the exception corresponds to when the user can log in with multiple terminals.
b. Behavioral information
What operation the user performed abnormally: the interface path; what operation was performed; what data was used during the operation; what data was spit out to the client by the API at the time; if it was a submission operation, what data was submitted; the previous path ; The last behavior log record ID, etc.
c. Exception information
The code information of the exception: the DOM element node operated by the user; the exception level; the exception type; the exception description; the code stack information, etc.
d. Environmental information
Network environment; device model and identification code; operating system version; client version; API interface version, etc.
An interface generates a requestId
A traceId is generated in a stage, which is used to track all log records related to an exception
The unique identification code of this log is equivalent to logId, but it is generated according to the specific content of the current log record
The time when the current log was generated (save time)
At that time, user status information (available/disabled)
At the time, the former user’s role list
At that time, the user's current group, group permissions may affect the results
At that time, the license may expire
What did you do
Previous path, source URL
The state and data of the current interface
What data does the upstream api give
What data was submitted
DOM elements operated by the user
The node path of the DOM element
A custom style sheet for this element
The current attributes and values of the element
Type of error
Error stack information
Error column position
Error description (declared by developer)
Event x-axis coordinate
Event y-axis coordinate
Event x-axis coordinate
Event y-axis coordinate
The key that triggered the event
Network environment description
Operating system description
This is a very large log field table, which almost contains all the information that can describe the abnormal surrounding environment in detail when an abnormality occurs. Under different circumstances, these fields are not necessarily collected. Since we will use the document database to store logs, it does not affect its actual storage results.
Front-end capture exceptions are divided into global capture and single-point capture. The global capture code is centralized and easy to manage; single-point capture is used as a supplement to capture some special situations, but it is scattered, which is not conducive to management.
a, global capture
Through the global interface, the capture code is written in one place. The available interfaces are:
b, single point capture
Wrap a single code block in the business code, or manage points in the logic process, to achieve targeted exception capture:
Due to browser security policy restrictions, when a cross-domain script reports an error, the detailed information of the error cannot be obtained directly, only a Script Error can be obtained. For example, we will introduce third-party dependencies or put our own scripts on the CDN.
Solution to Script Error:
For an anomaly, just having the information of the anomaly is not enough to fully grasp the essence of the problem, because the location of the anomaly is not necessarily the location of the source of the anomaly. We need to restore the abnormal scene to restore the full picture of the problem and even prevent similar problems from occurring in other interfaces. A concept needs to be introduced here, which is "abnormal recording". Recording through the two dimensions of "time" and "space" records the entire process from before the occurrence to the occurrence of the anomaly, which is more helpful to find the source of the anomaly.
Said that when an abnormality occurs, the source of the anomaly may be far away from us. We need to return to the scene where the anomaly occurred to find the source of the anomaly. Just like solving a crime in real life, it is easier to solve a crime if there is a surveillance camera to record the process of the crime. If you only focus on the anomaly itself, you need luck to find the root cause of the anomaly, but with the help of anomaly recording, it is easier to find the root cause.
The so-called "abnormal recording" actually uses technical means to collect the user's operation process and record every user's operation. When an abnormality occurs, the recording within a certain period of time is re-run to form an image for playback. So that the debugger can see the user's operation process at the time without asking the user.
It is a schematic diagram of a set of abnormal recording restoration schemes from Ali. Events and mutations generated by user operations on the interface are collected by the product, uploaded to the server, and stored in the database in order after queue processing. When the abnormality needs to be reproduced, these records are taken out of the database, a certain technical scheme is adopted, and these records are played sequentially to realize the abnormality recovery.
Generally speaking, we divide the level of information collected into info, warn, error, etc., and expand on this basis.
When we monitor the occurrence of an abnormality, the abnormality can be divided into four levels of A, B, C, and D in the "important-urgent" model. Some exceptions, although they occur, do not affect the normal use of the user. The user does not actually perceive it. Although it should be fixed in theory, in fact, compared with other exceptions, it can be handled later.
The article will discuss the alarm strategy. Generally speaking, the closer to the upper right corner, the faster the notification will be, ensuring that the relevant personnel can receive the information and deal with it as soon as possible. Class A exceptions require quick response, and even need to be known by the relevant person in charge.
In the abnormal collection phase, the severity of the abnormality can be judged according to the abnormal consequences classified in the first section, and the corresponding reporting scheme can be selected when the abnormality occurs.
As mentioned earlier, in addition to the abnormal error message itself, we also need to record user operation logs to achieve scene recovery. This involves the amount and frequency of reports. If any log is reported immediately, it is tantamount to a self-made DDOS attack. Therefore, we need a reasonable escalation plan. The following four reporting schemes will be introduced, but in reality, we will not be limited to one of them. Instead, we often use them at the same time, choosing different reporting schemes for logs of different levels.
As we mentioned earlier, we do not only collect logs of the anomaly itself, but also logs of user behavior related to the anomaly. A single exception log cannot help us quickly locate the root cause of the problem and find a solution. However, if you want to collect user behavior logs, you must adopt certain techniques, and you cannot immediately upload the behavior logs to the server after every operation of the user. For applications with a large number of users online at the same time, if the user operates, upload it immediately Logging is tantamount to a DDOS attack on the log server. Therefore, we first store these logs locally on the user client, and after certain conditions are met, we then package and upload a set of logs at the same time.
So, how to store front-end logs? It is impossible for us to directly save these logs with a variable, which will burst the memory, and once the user performs a refresh operation, these logs are lost. Therefore, we naturally think of the front-end data persistence solution.
At present, there are more options for persistence solutions available, mainly: Cookie, localStorage, sessionStorage, IndexedDB, webSQL, FileSystem, etc. So how to choose? We use a table to compare:
Read fast and write slow
Slow reading and fast writing
After synthesis, IndexedDB is the best choice. It has the advantages of large capacity and asynchronous, and the asynchronous feature ensures that it will not block the rendering of the interface. And IndexedDB is divided into databases, and each database is divided into stores, and can be queried according to the index. It has a complete database management thinking and is more suitable for structured data management than localStorage. But it has a shortcoming, that is, the API is very complicated, not as simple and direct as localStorage. For this, we can use the hello-indexeddb tool, which uses Promises to encapsulate complex APIs, simplifying operations, and making the use of IndexedDB as convenient as localStorage. In addition, IndexedDB is a widely supported HTML5 standard, compatible with most browsers, so there is no need to worry about its development prospects.
Next, how should we reasonably use IndexedDB to ensure the rationality of our front-end storage?
The above figure shows the process and database layout of the front-end storage log. When an event, change, or exception is captured, an initial log is formed, which is immediately placed in the temporary storage area (a store of indexedDB), after which the main program ends the collection process, and subsequent events only occur in the webworker. In a webworker, a cyclic task continuously takes out logs from the temporary storage area, classifies the logs, stores the classification results in the index area, and enriches the log record information, which will eventually be reported to the server log record Transfer to the archive area. When a log exists in the archive area for more than a certain number of days, it has no value, but in order to prevent special circumstances, it is dumped to the recovery area, after a period of time, it will be removed from the recovery area Clear.
As mentioned above, the log is sorted in a webworker and then stored in the index area and archive area, then what is the sorting process like?
Since the reporting we are going to talk about below is carried out according to the index, our front-end log sorting work is mainly to sort out different indexes according to log characteristics. When we collect logs, we will tag each log with a type to classify it and create an index. At the same time, the hash value of each log object is calculated through object-hashcode as the only sign of this log.
rquestId: Track the front-end and back-end logs at the same time. Since the back-end also records its own logs, when the front-end requests the api, the requestId is included by default, and the log recorded by the back-end can correspond to the front-end log.
traceId: Trace related logs before and after an exception occurred. When the application starts, a traceId is created, until an exception occurs, the traceId is refreshed. Collect the requestId related to a traceId, and combine the logs related to these requestIds, and finally all the logs related to the exception are used to review the exception.
The figure shows an example of how to use traceId and requestId to find all logs related to an exception. In the above figure, hash4 is an abnormal log. We find that the traceId corresponding to hash4 is traceId2. In the log list, there are two records with the traceId, but the record of hash3 is not the beginning of an action, because the requestId corresponding to hash3 It is reqId2, and reqId2 starts from hash2. Therefore, we actually have to add hash2 to the entire replay candidate record where the exception occurred. To sum up, we need to find out the log records corresponding to all requestIds corresponding to the same traceId. Although it is a bit convoluted, you can understand the truth with a little understanding.
We gather all the logs related to an exception and call it a block, and then use the hash set of the log to get the hash of this block, build an index in the index area, and wait for it to be reported.
Reporting logs is also carried out in webworker. In order to distinguish it from collation, it can be divided into two workers. The reporting process is roughly as follows: in each cycle, retrieve the corresponding number of indexes from the index area, retrieve the complete log record from the archive area through the hash in the index, and upload it to the server.
According to the frequency of reporting (important urgency), the reporting can be divided into four types:
After the logs are collected, the report function is triggered immediately. Only used for Type A exceptions. Moreover, due to the influence of network uncertainties, the A-type log report needs to have a confirmation mechanism, and it is considered complete only after confirming that the server has successfully received the report information. Otherwise, there needs to be a circular mechanism to ensure that the report is successful.
Store the collected logs locally, and package them for one-time reporting after a certain amount is collected, or package them and upload them at a certain frequency (time interval). This is equivalent to merging multiple times into one report to reduce the pressure on the server.
Pack an abnormal scene into a block and report it. It is different from batch reporting, which guarantees the integrity and comprehensiveness of the log, but there will be useless information. The block reporting is for the exception itself, ensuring that all logs related to a single exception are reported.
A button is provided on the interface, and users take the initiative to report bugs. This is conducive to enhancing interaction with users.
Or when an exception occurs, although there is no impact on the user, the application monitors it, and a prompt box pops up to let the user choose whether to upload the log. This kind of scheme is suitable when the user's private data is involved.
A little delay
Report all at once
100 at a time
Report related items once
1 at a time
Not urgent but important
Although instant reporting is called instantaneous, it is actually done through a cyclic task like a queue. It is mainly to submit some important exceptions to the monitoring system as soon as possible so that the operation and maintenance personnel can find problems. Therefore, its corresponding urgency is relatively high .
The difference between batch report and block report: batch report is to report a certain number of items at a time, for example, 1,000 items are reported every 2 minutes until the report is completed. Block reporting is to collect all logs related to the exception immediately after the exception occurs, check which logs have been reported in batches, remove them, and upload other related logs. These logs related to the exception are relatively more important. More importantly, they can help restore the abnormal scene as soon as possible and find the root cause of the abnormality.
The feedback information submitted by the user can be reported slowly.
In order to ensure that the report is successful, there needs to be a confirmation mechanism when the report is reported. Since the server receives the report log, it will not be stored in the database immediately, but will be placed in a queue. Therefore, the front and back ends are ensuring that the log is true. Some more processing needs to be done at the point that it has been recorded in the database.
The figure shows a general process of reporting. When reporting, first query through hash to let the client know whether there are logs that have been saved by the server in the set of logs to be reported. If they already exist, remove these logs. , To avoid repeated reporting and waste of traffic.
When uploading batches of data at one time, you will inevitably encounter situations such as large data volume, wasted traffic, or slow transmission. A bad network may cause reporting failure. Therefore, data compression before reporting is also a solution.
In the case of combined reporting, the amount of data at one time may be more than ten k. For a site with a large daily PV, the generated traffic is still considerable. Therefore, it is necessary to compress the data and report it. lz-string is an excellent string compression library with good compatibility, less code size, high compression ratio, short compression time, and an astonishing 60% compression rate. But it is based on LZ78 compression. If the backend does not support decompression, you can choose gzip compression. Generally speaking, the backend will pre-install gzip by default. Therefore, you can also choose gzip to compress the data. The toolkit pako comes with gzip compression, you can try use.
Generally, an independent log server is provided to receive client logs. During the receiving process, the legitimacy and security of the client log content must be screened to prevent attacks. And because log submissions are generally more frequent, it is also common for multiple clients to be concurrent at the same time. It is also a common solution to process the log information one by one through the message queue and write it to the database for storage.
The picture shows the architecture diagram of Tencent BetterJS, where the "access layer" and "push center" are the access layer and message queue mentioned here. BetterJS splits the various modules of the entire front-end monitoring. The push center assumes the role of pushing logs to the storage center for storage and pushing to other systems (such as the alarm system), but we can view the queues in the log receiving stage independently. Make a transition between the access layer and the storage layer.
Logging is a dirty job, but it has to be done. For small applications, a single database, single table and optimization can handle it. For a large-scale application, if you want to provide a more standard and efficient log monitoring service, you often need to work hard on the log storage architecture. At present, the industry has relatively complete log storage solutions, mainly: Hbase series, Dremel series, Lucene series, etc. In general, the main problems faced by the log storage system are the large amount of data, irregular data structure, high write concurrency, and large query requirements. Generally, a log storage system needs to solve the above problems, it is necessary to solve the write buffer, the storage medium is selected according to the log time, and a reasonable index system is designed for convenient and fast reading.
As the log storage system solution is relatively mature, no more discussion will be done here.
The ultimate purpose of the log is to use it. Since the volume of the general log is very large, to find the required log record in the huge data, you need to rely on a better search engine. Splunk is a mature log storage system, but it is paid for. According to Splunk's framework, Elk is an open source implementation of Splunk, Elk is a combination of ElasticSearch, Logstash, and Kibana, and ES is a search engine based on Lucene storage and indexing; logstash is a log standardization pipeline that provides input, output and conversion processing plug-ins; Kibana provides visualization And the user interface for query statistics.
A complete log statistical analysis tool needs to provide various convenient panels to provide feedback information to log administrators and developers in a visual manner.
Different requests of the same user will actually form different story lines. Therefore, it is necessary to design a unique request id for a series of operations of the user. When the same user operates on different terminals, they can also be distinguished. The user's status, authority and other information during a certain operation also need to be reflected in the log system.
How an abnormality occurs, you need to connect the story lines before and after the abnormal operation in series to observe. It does not only involve one operation of a user, and is not even limited to a certain page, but the final result of a series of events.
The performance of the application during operation, for example, interface loading time, api request duration statistics, unit calculation consumption, and user sluggish time.
The environment in which the application and service are running, such as the network environment where the application is located, operating system, device hardware information, server cpu, memory status, network, broadband usage, etc.
Exception code stack information, locate the code location and exception stack where the exception occurred.
By linking the user logs related to the abnormality, the abnormal process is output with a dynamic effect.
Statistics and analysis of anomalies are just the basis, and an anomaly monitoring system should have the ability to push and alert when an anomaly is found, or even handle it automatically.
a. Monitoring implementation
When the log information enters the access layer, the monitoring logic can be triggered. When there is a relatively high-level abnormality in the log information, an alarm can also be sent immediately. The alarm message queue and the log storage queue can be managed separately to achieve parallelism.
Perform statistics on the warehousing log information, and alarm abnormal information. Respond to monitoring exceptions. The so-called monitoring abnormality refers to: the regular abnormality is generally more reassuring, and the more troublesome is the sudden abnormality. For example, in a certain period of time, sudden and frequent D-level abnormalities are received. Although D-level abnormalities are not urgent and generally important, when abnormalities occur in the monitoring itself, we must be vigilant.
b. Custom trigger conditions
In addition to the default alarm conditions configured during system development, custom trigger conditions that can be configured by the log administrator should also be provided.
There are many ways to choose from, such as email, SMS, WeChat, and telephone.
For different levels of alarms, the push frequency can also be set. Low-risk alarms can be pushed once a day in the form of reports, and high-risk alarms are pushed cyclically for 10 minutes until the handler manually turns off the alarm switch.
For the push of log statistics, daily reports, weekly reports, monthly reports, and annual reports can be automatically generated and emailed to relevant groups.
When an exception occurs, the system can call the work order system API to automatically generate a bug ticket. After the work order is closed, it will be fed back to the monitoring system to record the tracking information of the exception handling and display it in the report.
Most of the front-end code is released after being compressed, and the reported stack information needs to be restored to source code information in order to quickly locate the source code and modify it.
When publishing, only deploy the js script to the server and upload the sourcemap file to the monitoring system. When the stack information is displayed in the monitoring system, the sourcemap file is used to decode the stack information to obtain the specific information in the source code.
But there is a problem here, that is, the sourcemap must correspond to the version of the official environment, and it must also correspond to a commit node in git, so as to ensure that the stack information can be used correctly when checking for exceptions to find the code of the version where the problem occurs. These can be achieved by establishing CI tasks and adding a deployment process to the integrated deployment.
The essence of early warning is to preset possible abnormal conditions. When the condition is triggered, the abnormality does not actually occur. Therefore, the user behavior can be checked before the abnormality occurs, and the user behavior can be repaired in time to avoid the abnormality or the expansion of the abnormality.
How to do it In fact, it is a process of statistical clustering. Count the abnormal situations in the history, collect statistics from different dimensions such as time, region, and users, find out the rules, and automatically add these rules to the early warning conditions through the algorithm, and the next time it is triggered, the early warning will be given in time.
Automatically repair errors. For example, if the front-end requires the interface to return a value, but the interface returns a numeric string, then there can be a mechanism for the monitoring system to send the correct data type model to the back-end. When the back-end returns data, it controls the data of each field according to the model. Types of.
Write abnormal use cases and add abnormal test users to the automated test system. In the process of testing or running, every time an exception is found, it is added to the original exception use case list.
Simulate the real environment, simulate the random operation of real users in the simulator, generate random operation action codes using automated scripts, and execute.
Define an exception, for example, when a pop-up box pops up and contains specific content, it is an exception. Recording these test results, and then clustering statistical analysis, is also very helpful for preventing abnormalities.
A user logs in on different terminals, or the state of a user before and after login. A requestID is generated by a specific algorithm, and a series of operations of a user on an independent client can be determined through the requestId. According to the log sequence, the specific path of the user's abnormality can be sorted out.
The front end is written as a package, and most of the logging, storage and reporting can be completed by global reference. In the special logic, you can call a specific method to record the log.
The back-end is decoupled from the business code of the application itself, and can be made into an independent service that interacts with third-party applications through interfaces. With integrated deployment, the system can be expanded and transplanted at any time.
The entire system can be expanded, not only for service order applications, but also for simultaneous operation of multiple applications. All applications under the same team can be managed on the same platform.
Different people have different permissions when accessing the log system. A visitor can only view their related applications. If some statistical data is more sensitive, you can set permissions separately, and sensitive data can be desensitized.
Exception monitoring is mainly for error reporting at the code level, but performance exceptions should also be paid attention to. Performance monitoring mainly includes:
The back-end API has a great impact on the front-end. Although the front-end code also controls the logic, the data returned by the back-end is the basis. Therefore, the monitoring of the API can be divided into:
Sensitive data is not collected by the log system. Since the storage of the log system is relatively open, although the data in it is very important, most log systems are not confidential in storage. Therefore, if the application involves sensitive data, it is best to:
This article mainly studies the overall framework of front-end abnormal monitoring. It does not involve specific technical implementations. It involves front-end and back-end parts and some knowledge points related to the whole problem. It mainly focuses on the front-end part, which overlaps with back-end monitoring. The part also has branch parts, which need to be practiced continuously in a project to summarize the monitoring requirements and strategies of the project itself.
From these strict mode rules, you can get a glimpse of the mystery, which is strict today , and bugs will be thrown away in seconds , oh yeah.
You can also come to my
GitHubblog to get the source files of all articles:
Front-end persuasion guide : https://github.com/roger-hiro/BlogFN