Cluster Health Monitoring

Enterprise is used to automate a reasonably wide selection of duties. The difficulty of reliability is, after all, necessary for every one of those duties. Nevertheless, there are two areas of 1C: Enterprise utility, wherein the system’s reliability isn’t just needed but crucial. These are company implementations and cloud companies.

In these areas, we imagine there are two areas for bettering reliability:

  1. Bettering high quality by lowering errors. This applies to each of the platform and utility options;
  2. Rising the safety of the system from the results of mistakes.

We work on each instruction. And on this article, we need to let you know concerning the subsequent step within path quantity 2.

This step is to extend the safety of the 1C: Enterprise server from errors that will come up in its work processes. There will be all kinds of mistakes. They are often the results of incorrect platform operation. Or they will result from the execution of incorrect utility code executed by server employee processes.

Errors in workflows result in several issues. A separate mechanism could be made to remove every slight downside. However, we determined to attempt to make a complete answer immediately. Its working title is a monitoring system. We perceive that the title isn’t wholly particular; however, now we have settled on it.

The essence of the monitoring system will be described with a phrase from a well-known joke: “In Odessa, shortly raised isn’t thought-about to be fallen.” However severely talking, the monitoring system has to detect the issue well-timed and mechanically repair it.

Now we have applied the monitoring system to the server agent course. It polls the cluster processes every 10 seconds. A cluster can embody several manufacturing servers, each managed by its service agent. Due to this fact, the cluster processes are polled solely by the agent that controls the central server:

All processes working within the cluster are polled: cluster managers employee processes. Functions operating on manufacturing servers are interviewed via the brokers of those servers. Thus, the operability of the brokers themselves is moreover checked.

The monitoring system checks every course in keeping with the following standards:

  • Method of connection; it has to be put in inside 20 seconds;
  • Customary question (take a look at execution velocity, database connections, disk operations);
  • The quantity of reminiscence utilized by the method;
  • The variety of errors for the array of requests (messages of the EXCP sort for announcements of the CALL sort within the technological log per minute);
  • Termination of processes deleted from the cluster registry; such methods ought to be complete within 20 minutes.

The outcomes of the examination are recorded within the technological log.

To customize the Analyze the variety of errors by the combination of requests criteria, we launched a new possibility Tolerance for the array of server errors. It has to be set as a share of the typical worth for the remainder of the processes. For instance, you set it to 50. Similarly, the specific variety of errors per request per minute for the final 5 minutes was 100. Then processes that brought about greater than 150 errors per request per minute can be acknowledged as problematic.

The method will be acknowledged as problematic in keeping with different standards (aside from the Customary request criterion). The monitoring system can terminate downside processes by itself, making a reminiscence dump of the method beforehand. This characteristic is enabled by the Pressure terminate downside processes possibility.

So as so that you can interactively and programmatically handle the monitoring system choices, we made the mandatory enhancements to the client-server administration utility added new strategies to the V83.COMConnector object added new parameters to the cross-platform cluster administration interface and unique occasions to the technological log.

With the beginning of utilizing the monitoring system, it will be necessary for us to grasp how accurately we have chosen the technique for figuring out the extent of the “well being” of the work course. Whether or not there are false positives, whether or not actual issues are qualitatively decided. In this sense, probably the toughest parameter for us is the tolerance of the variety of server errors. As a result of that is an oblique approach to assessing. We are going to work with you to see if this works nicely.

The monitoring system isn’t the one answer geared toward bettering reliability. Within the close to future, we will discuss one other enchancment that may make the cluster work extra predictable when community connections are damaged.


Related Articles

Latest Articles