Establishing causality

Establishing causality is one of the most essential tasks of Monitoring and Observability. Let's show what cause is, why it's vital, and how you can establish causality by observing telemetry behavior. 

A few words about some basics.

Before venturing into a more advanced topic of establishing causality, let's set some basic terms we will use later in our talk.

What is causality?

Causality is an abstraction that explains why the complex system changes states and ends up in a specific set of conditions at each moment. Causality is a true or false fact that one event influences another. Or, saying differently, causality is a logical measure of the fact that an event “A” affects another event “B” to the point when we can establish the truth, stating that “B” is happening because of “A.” Causality is a non-deterministic automaton that, as the definition implies, can lead to one or more or none transitions from a current state into another in the finite state machine that we are observing.

Observability and causality

What is the challenge of the observability and causality relationship? Daily, management personnel uses monitoring and observation instrumentation to control the systems they are responsible for. One of the goals of control is to be able to understand the behavior of the system at each and every moment. And this understanding requires a in-depth understanding of the causality for most and, ideally, every state of the system. So, imagine you are trying to decipher a complex finite state machine using only limited known states and transitions. And by a study of the behavior of the available telemetry data, you are trying to fill the gaps. Or otherwise, comprehend not only which states your system currently is but also how it gets there. For a large and complex system, this is a very tough task.

Why is establishing causality necessary?

As I mentioned above, as far as control goes, your environment and the systems you control are nothing but large and complex finite-state machines. Mathematically speaking, we can define a finite state machine as an abstract machine that can be in one or more states at any given time. And a state in a finite state machine could be defined as a set of conditions describing the machine at a specific moment. Moving between states is called a transition. And with an understanding of the causality, which includes the original state, transition, and current state, it is easier to control the system. And unfortunately, many monitoring practitioners have opted for more straightforward threshold monitoring, letting the observability platform detect some thresholds and define a reaction on that threshold. While this approach allows for solving some simple and practical tasks, it does not bring adepts of that approach to where they truly want to be, to control instead of reacting. Let's look at the example. One of those days for our SRE starts with a message that the primary database is down. Without having any causality context, he finds several other down conditions. Trying to bring resolution quickly, he tried to restart the database. Tried and failed. Next, he found that the free disk space on the database server was zero. Relieved, he is ssh-ing on the DB server; discovers that there are a lot of some temp files on that partition, clears them up, restarts database, and now—success. He is closing the ticket. In 15 minutes, the database is down, and free space is, again, zero. To cut a long story short, later on, he found that the temp files were legit, produced by a data ingestion procedure, which took the source data from another server on which this data was collected from a third-party provider and due to a bug in Ansible recipe, the permission on the folder where the source files were stocked was set incorrect and processed files were not removed by ingestion procedure, so ingestion re-import the old data alongside with new, creating too many temp files and crash the database. And what was the causality? Change in Ansible recipe, wrong permissions, excessive files in IN queue for ingestion, an excessive number of the temp files in ingestion, leading to an overuse of disk space which causes crash in the database. Let's say you keep your observability platform from helping you establish a relationship between the current and previous states. In that case, you solve this kind of puzzle every time as they are new. Which leads to more downtime, decreased SLA, and all sorts of losses. So, establishing causality is vital.

In search for causality

There are several ways to establish causality. I will review two of them. One could be called “User defined,” and another is “Searching for causality through patterns of observation.” But you shall not limit yourself to finding the best way to detect your case's root cause. Moreover, remember that the best way is sometimes a combination of methods.

Expert systems

The IT industry has been using an expert system for quite a while. The idea behind the expert system is that some knowledge is represented as “if-then” clauses called rules or “assign” clauses called facts. 

(fact MyTemperatureCelsius 38.2 

(> MyTemperatureCelsius 36.6 )

=>

(print “You have a fever” MyTemperatureCelsius ))

A combination of statically defined rules and facts is called a “Knowledge base.” In numerous instances, the statically defined “Knowledge base” is prepared by a human being who is an expert in that specific field. Another part of the Expert system is called the “Inference Engine.” The idea behind this software is to produce new facts and rules by applying an “Inference Engine” rule to the rules and facts of the “Knowledge base.” So the combination of the “Knowledge base” and “Inference Engine” will give you an expert-induced, extendable system that you can use to detect causality in your environment. 

If you need help figuring out where to start with an expert system, I recommend starting with that book. And I can wholeheartedly recommend http://www.clipsrules.net.


What are the benefits and disadvantages of Expert Systems? First, they are good as they have a Knowledge Base. If the Expert who made the rules and facts for you are short of being a real expert, then the results produced by the system will be internally flawed. But if the Knowledge Base is good, you will get good, predictable results, bringing you a well-detected causality. Second, the cost of maintenance of Expert Systems is high. You must maintain the freshness and accuracy of existing rules and try to bring up new ones. Otherwise, the Expert System is a great tool.

Causality detection through telemetry pattern observation.

Nothing in complex systems happens “in the vacuum .” Spikes in cluster CPU utilization may be tied to load spikes. Abnormalities in network telemetry could result from security-related events that you can detect through the patterns in related telemetry. But how can we combine observed samples in all observed telemetry and use detected patterns to establish causality?

Telemetry Observation Matrix

The first step is to create a Telemetry Observation Matrix. This matrix is different from Telemetry Matrix discussed in other chapters. When Telemetry Matrix is a 2-dimensional structure, where the columns are the sources and rows are the telemetry types, the matrix represents a momentary state of the telemetry produced by the system. Telemetry Observation Matrix is a 3-dimensional matrix. The x-axis (or columns) is the sources, the y-axis (or rows) is the telemetry types, and the z-axis is a vector of the telemetry data samples. The size of the x-axis equals the number of sources you have. The size of the y-axis is the same as the number of telemetry types that your system has, and the size of the z-axis is equal to the size of the “Observability Horizon.”

Observability Horizon

Observability Horizon is the number of telemetry data samples you will use to observe and detect patterns in the data.

No alt text provided for this image


The more data samples you have for review, the more back-in-time behavior you will study. But more data will not necessarily guarantee that you will reliably detect a pattern. So, choosing the “Observability Horizon” is empirical, sometimes, try and try again process. I recommend setting “Observability Horizon” as large as reasonably practicable within your computing capabilities and then subsample smaller “Observability Horizons” from the larger dataset.

For the next trick, we will need a Neural Network?

In the “Telemetry Observation Matrix,” we have a very extended set of telemetry data associated with various sources and historical values of this data placed on a timeline. Now, we have all the data we need for a pattern search. And for the pattern search, we will use a simple, forward-propagating neural network. Feed Forward Network is one of the simplest Neural Networks. Feed Forward is a neural network where the data travels in one direction, and connections between nodes don't form a cycle. So, back-propagation of the data is explicitly forbidden. Feed Forward Network, also called a “Single Layer Perceptron,” where the data received on the input layer of the nodes is carried to the output layer through the single hidden layer. The number of “neurons” in the input layer shall match the number of data items in Observability Horizon. So, suppose you are collecting telemetry data for a large Observability Horizon and planning to use the segments of that data for a patterns search. In that case, you have to prepare a Perceptron with a configuration matching the size of your data set.


Next, you have to train your Perceptron. What is Perceptron training? You are feeding the Perceptron with samples of the data of the same arity as your Observability Horizon and specify how close this sample is to one of the patterns you are looking for. For example, let's say that we are looking for three types of patterns in the data: upswing, where each next element value in the timeline is greater than the previous one; downswing, when the next element value is smaller; and stable, where element value is about the same. In our training data, we will be using normalized data. I will explain the purpose of that later on. And now, let's look at the sample of training data.


[0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0
[0.0, 0.2, 0.4, 0.6, 0.8] [1.0, 0.0, 0.0]
[0.8, 0.6, 0.4, 0.2, 0.0] [0.0, 1.0, 0.0]]


What can we see when looking at this training data? On the left is an array with data samples of length (or arity) - 5. The five is the size of the Observability Horizon in this sample. And on the right side, we can see a pattern classification, indicating how close this sample is to one of the pattern categories. Three numbers tell us that there are three categories. The first category is “upswing,” the second - is “downswing,” and the third one is “stable .”Of course, just three samples are insufficient to train our Perceptron, so please prepare at least a few dozen normalized data samples for each category. And for each sample of the training data, you must specify how close this sample is to the category you are looking for. Before we continue, please note that your sample can be matched for more than one category.

[0.4, 0.4, 0.4, 0.5, 0.5] [0.5, 0.0, 0.5]

In this training sample, it is difficult to say if it represents a slight deviation from the “stable” category or is an “upswing .” So, we can set it as a sample from two different classes with lower “certainty.”

After you train your Perceptron with the prepared training dataset for each category, your Artificial Neural Network will be ready to detect patterns in your data.

Data smoothing and normalization

And when you do have a Telemetry Observation Matrix with telemetry data samples, and also you do have a prepared and trained Perceptron, you are ready to feed your data to a Perceptron to see which categories Perceptron can detect, right? But wait ! First, let me remind you that you trained your Perceptron with Normalized data. We will discuss what that is in the second. But first, let's talk about data smoothing. Data smoothing is an essential step, which will take a vector of numbers as an input and produce a new vector of the same size but with smoothed values. This procedure allows us to reduce the variability of the telemetry values, which helps to determine an actual pattern in the data. But what is variability reduction? The idea is straightforward - by utilizing Smoothed Moving Average (SMMA), you are calculating a dynamic arithmetic mean for each element of your original data vector containing telemetry from Observability Horizon. SMMA is an extension of a Simple Moving Average as it uses a sliding window to calculate the mean. One of the key SMMA benefits is that this algorithm effectively reduces noise in the data. So, reducing the data variability using the SMMA algorithm helps eliminate values from your Observability Horizon, which does not represent your pattern.

Before you feed a telemetry data sample to a Perceptron, the next step is called “Data Normalization .”Min-Max data normalization produces the new vector where values are derived from the original vector using Min-Max feature scaling. The actual outcome of the scaling is that all data from the sample is scaled to fit between 0 and 1.

Why scale the values? When you want to see the shape or pattern of the sample, or if you're going to compare or match different samples with different scales, you have to normalize data to make it suitable for pattern matching.


And now have an array or vector of values derived from the original Observation Horizon but smoothed and normalized. At this point, we also have a Perceptron trained to recognize a pattern category. Everything is ready to produce Telemetry Pattern Matrix.

Telemetry Pattern Matrix.

Telemetry Pattern Matrix is a 2-dimensional matrix wherein we arrange sources for the x-axis or the column. In the y-axis or the row, we store the telemetry type. And data element of the Telemetry Pattern Matrix is a tuple, where each element indicates the proximity of the telemetry sample to a category with the same index in the tuple, as in Perceptron training data.

Telemetry Pattern Matrix was produced from the Telemetry Observation Matrix by feeding the telemetry data sample from each Observability Horizon vector to a trained Perceptron. The outcome, the tuple with proximity information to the known patterns, will be stored in each “cell,” defined by the source and telemetry type.

Based on the chosen Observability Horizon, we now have robust behavioral data for each telemetry item. With the help of the Telemetry Pattern Matrix, we can match different types of patterns against each other to see which behavior happens simultaneously. And we can use Telemetry Pattern Matrix to help us establish causality in our system.

And the next step is...

To help us visualize causality, we can build different heat maps using various pattern matches to detect if we have some clusters of behaviors that happen simultaneously. And the next step is to match known telemetry behavior. We can mix and check different types of telemetry behavior programmatically while trying to match behaviors to the groups of telemetry sources and data.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics