At some point I needed to check whether one of my services was sending and receiving data to and from a certain address consistently. I didn’t care about the contents, I just needed to check for the gaps in its network activity.
So, let’s use
tcpdump, probably the most classic Linux tool for collecting raw TCP data.
This captures packets coming from and to address
111.222.333.444 and port
12345 on every interface. The data is stored in the
- You may replace
anywith the name of the interface if you know it.
-ttttoption is important because it will make
tcpdumpwrite both date and time in the format that we will be parsing in this tutorial.
-noption ensures that addresses won’t be resolved to hostnames. You may omit it if you want.
- Hint! You may optionally add
-A. Then the output will also contain the bodies of the packets. It is not required for this tutorial; the parser we are about to create will be able to filter out these data lines that don’t contain dates. Moreover, you would rather not specify it now, that will reduce the dump file size.
I use a Jupyter notebook to process the data; it’s much simpler to do it this way than with a script.
So, I’m importing the required modules in the first cell. Make sure you have all the required libraries installed. You can find the installation guide at their appropriate pages. Or you can simply install Anaconda suite, it has them all preinstalled (I think?).
Parsing the dump file
Now let’s process the file itself. Basically, all we need is the dates and times. We can extract them using regex. You may try extracting them with common Python tools, like
join, etc, if you feel like it.
The result of this is a series of amounts of messages transferred within a given minute. A
pandas is simply a list of index-value pairs. We have sorted them by index, which is the time and date, as well.
... 2018-11-23 20:19:00 106 2018-11-23 20:20:00 40 2018-11-23 20:21:00 138 2018-11-23 20:22:00 41 2018-11-23 20:23:00 76 2018-11-23 20:24:00 128 2018-11-23 20:25:00 64 2018-11-23 20:26:00 80 2018-11-23 20:27:00 32 2018-11-23 20:28:00 72 2018-11-23 23:30:00 1281 2018-11-23 23:31:00 1171 2018-11-23 23:32:00 1130 2018-11-23 23:33:00 932 2018-11-23 23:34:00 569 2018-11-23 23:35:00 464 ...
Oh, by the way, if you try to display your data (by simply typing
a in a cell and invoking the cell), it will show only some of the entries. If you want to display the whole dataset.
Creating the source dataframe
Now let’s split the date and time into two separate columns. We will create a
Dataframe for that, which is essentially a table. It will have 3 columns: date, time, and the number of messages transferred within the given minute.
A list of unique dates.
We have split time and date because we will create several plots, one for each date. Now let’s get the ordered list of the dates present in our data. It is likely to be pretty short.
Dataframe full of zeroes
There is one problem with the data we have so far. As you can see in the example above, there is a gap between 20:28 and 23:30. If we create the plot using this data, these bars will be put near each other. I’m looking for temporal gaps in data transfer, if you remember, and such a visualization won’t help me much. We need the missing minutes to be in the source data with the
amount set to 0. So the first thing is to create the “scaffolding”, which is just a dataframe that has all the minutes of the days present in
Now, let’s update it to get the zero-containing data. Note that this one is pretty slow, but I haven’t found a faster solution.
Took a while, but it’s done! We have our data now. Let’s move to the final step.
We’re going to use the
seaborn library to create the barplot. It is built over
matplotlib, so it’s possible to manipulate the underlying parameters as well. Both of these libraries are very confusing, so I created several parameters on top of the cell to control various elements of the plot: font sizes, x-axis label frequency, image size, etc.
This one also takes a while and I’m not sure how it can be optimized. But still, it does what I want it to do.
And here’s one of the resulting plots. I’ve scaled it down significantly for this article, in reality the image is about 8 times larger. If there are several days in your data, the plot for each day will appear on a separate image.
And that’s about it.
So, we’ve created a plot that shows the network activity we got from
tcpdump. In this example there are apparent gaps, which is exactly what I wanted to know.
This method is pretty slow and consumes quite a lot of RAM. If you manage to optimize this routine somehow, let me know, I’m curious!
Of course, I could use Filebeat, Logstash and Elasticsearch with Kibana, but deploying and configuring it for such a small and short-term task would be an overkill. If I had to monitor logs constantly, then it might be viable. Btw, I might make a tutorial on how to set up and configure the Elastic stack in Docker to use with Python applications in the upcoming future, so stay tuned!
As always, thanks for stopping by, and I’ll see you eventually.