Good evening

At some point I needed to check whether one of my services was sending and receiving data to and from a certain address consistently. I didn’t care about the contents, I just needed to check for the gaps in its network activity.

# Gathering data

So, let’s use tcpdump, probably the most classic Linux tool for collecting raw TCP data.

This captures packets coming from and to address 111.222.333.444 and port 12345 on every interface. The data is stored in the dump12345 file.

• You may replace any with the name of the interface if you know it.
• The -tttt option is important because it will make tcpdump write both date and time in the format that we will be parsing in this tutorial.
• The -n option ensures that addresses won’t be resolved to hostnames. You may omit it if you want.
• Hint! You may optionally add -A. Then the output will also contain the bodies of the packets. It is not required for this tutorial; the parser we are about to create will be able to filter out these data lines that don’t contain dates. Moreover, you would rather not specify it now, that will reduce the dump file size.

# Processing data

I use a Jupyter notebook to process the data; it’s much simpler to do it this way than with a script.

## Imports

So, I’m importing the required modules in the first cell. Make sure you have all the required libraries installed. You can find the installation guide at their appropriate pages. Or you can simply install Anaconda suite, it has them all preinstalled (I think?).

## Parsing the dump file

Now let’s process the file itself. Basically, all we need is the dates and times. We can extract them using regex. You may try extracting them with common Python tools, like split, string join, etc, if you feel like it.

The result of this is a series of amounts of messages transferred within a given minute. A Series in pandas is simply a list of index-value pairs. We have sorted them by index, which is the time and date, as well.

...
2018-11-23 20:19:00     106
2018-11-23 20:20:00      40
2018-11-23 20:21:00     138
2018-11-23 20:22:00      41
2018-11-23 20:23:00      76
2018-11-23 20:24:00     128
2018-11-23 20:25:00      64
2018-11-23 20:26:00      80
2018-11-23 20:27:00      32
2018-11-23 20:28:00      72
2018-11-23 23:30:00    1281
2018-11-23 23:31:00    1171
2018-11-23 23:32:00    1130
2018-11-23 23:33:00     932
2018-11-23 23:34:00     569
2018-11-23 23:35:00     464
...


Oh, by the way, if you try to display your data (by simply typing a in a cell and invoking the cell), it will show only some of the entries. If you want to display the whole dataset.

## Creating the source dataframe

Now let’s split the date and time into two separate columns. We will create a Dataframe for that, which is essentially a table. It will have 3 columns: date, time, and the number of messages transferred within the given minute.

## A list of unique dates.

We have split time and date because we will create several plots, one for each date. Now let’s get the ordered list of the dates present in our data. It is likely to be pretty short.

## Dataframe full of zeroes

There is one problem with the data we have so far. As you can see in the example above, there is a gap between 20:28 and 23:30. If we create the plot using this data, these bars will be put near each other. I’m looking for temporal gaps in data transfer, if you remember, and such a visualization won’t help me much. We need the missing minutes to be in the source data with the amount set to 0. So the first thing is to create the “scaffolding”, which is just a dataframe that has all the minutes of the days present in unique_dates.

Now, let’s update it to get the zero-containing data. Note that this one is pretty slow, but I haven’t found a faster solution.

Took a while, but it’s done! We have our data now. Let’s move to the final step.

# Plotting

We’re going to use the seaborn library to create the barplot. It is built over matplotlib, so it’s possible to manipulate the underlying parameters as well. Both of these libraries are very confusing, so I created several parameters on top of the cell to control various elements of the plot: font sizes, x-axis label frequency, image size, etc.

This one also takes a while and I’m not sure how it can be optimized. But still, it does what I want it to do.

And here’s one of the resulting plots. I’ve scaled it down significantly for this article, in reality the image is about 8 times larger. If there are several days in your data, the plot for each day will appear on a separate image.

So, we’ve created a plot that shows the network activity we got from tcpdump. In this example there are apparent gaps, which is exactly what I wanted to know.