Good evening

At some point I needed to check whether one of my services was sending and receiving data to and from a certain address consistently. I didn’t care about the contents, I just needed to check for the gaps in its network activity.

Gathering data

So, let’s use tcpdump, probably the most classic Linux tool for collecting raw TCP data.

tcpdump -i any -tttt -n "(dst host 111.222.333.444 and dst port 12345) or (src host 111.222.333.444 and src port 12345)" | tee dump12345

This captures packets coming from and to address 111.222.333.444 and port 12345 on every interface. The data is stored in the dump12345 file.

  • You may replace any with the name of the interface if you know it.
  • The -tttt option is important because it will make tcpdump write both date and time in the format that we will be parsing in this tutorial.
  • The -n option ensures that addresses won’t be resolved to hostnames. You may omit it if you want.
  • Hint! You may optionally add -A. Then the output will also contain the bodies of the packets. It is not required for this tutorial; the parser we are about to create will be able to filter out these data lines that don’t contain dates. Moreover, you would rather not specify it now, that will reduce the dump file size.

Processing data

I use a Jupyter notebook to process the data; it’s much simpler to do it this way than with a script.

Imports

So, I’m importing the required modules in the first cell. Make sure you have all the required libraries installed. You can find the installation guide at their appropriate pages. Or you can simply install Anaconda suite, it has them all preinstalled (I think?).

import re
from os import path
from datetime import datetime, time as dt_time, timedelta

# Third-party imports
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt, dates as mdates
import seaborn as sns

Parsing the dump file

Now let’s process the file itself. Basically, all we need is the dates and times. We can extract them using regex. You may try extracting them with common Python tools, like split, string join, etc, if you feel like it.

FOLDER='/root'

FILENAME='dump12345'

with open(path.join(FOLDER,FILENAME), 'r') as f:

    # filter out lines that do not start with time
    patt = re.compile('^[\d]{4}\-[\d]{2}\-[\d]{2} [\d]{2}:[\d]{2}:[\d]{2}\.[\d]+ ')
    a = filter(lambda x: patt.match(x), f)

    # get time as "YYYY-MM-DD hh:mm" as string
    p = re.compile('^[\d]{4}\-[\d]{2}\-[\d]{2} [\d]{1,2}:[\d]{1,2}')
    a = map(lambda x: next(p.finditer(x)).group(0), a)

    # convert to datetime objects
    a = map(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M'), a)

    # creating the actual series of all datetimes
    a = pd.Series(a)
    # creating a series of amounts of entries for each minute. Index is date; sorting by it.
    a = a.value_counts().sort_index()

The result of this is a series of amounts of messages transferred within a given minute. A Series in pandas is simply a list of index-value pairs. We have sorted them by index, which is the time and date, as well.

...
2018-11-23 20:19:00     106
2018-11-23 20:20:00      40
2018-11-23 20:21:00     138
2018-11-23 20:22:00      41
2018-11-23 20:23:00      76
2018-11-23 20:24:00     128
2018-11-23 20:25:00      64
2018-11-23 20:26:00      80
2018-11-23 20:27:00      32
2018-11-23 20:28:00      72
2018-11-23 23:30:00    1281
2018-11-23 23:31:00    1171
2018-11-23 23:32:00    1130
2018-11-23 23:33:00     932
2018-11-23 23:34:00     569
2018-11-23 23:35:00     464
...

Oh, by the way, if you try to display your data (by simply typing a in a cell and invoking the cell), it will show only some of the entries. If you want to display the whole dataset.

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(a)

Creating the source dataframe

Now let’s split the date and time into two separate columns. We will create a Dataframe for that, which is essentially a table. It will have 3 columns: date, time, and the number of messages transferred within the given minute.

# creating dataframe with separate fields for a day(date) and time
a = pd.DataFrame(data={'date':[i.date() for i in a.index],
                       'time':[i.time() for i in a.index],
                       'amount':list(a.values),
                      },columns=['date','time','amount'])

A list of unique dates.

We have split time and date because we will create several plots, one for each date. Now let’s get the ordered list of the dates present in our data. It is likely to be pretty short.

# getting sorted unique dates. We'll need them later
unique_dates = a['date'].unique()
unique_dates.sort()

Dataframe full of zeroes

There is one problem with the data we have so far. As you can see in the example above, there is a gap between 20:28 and 23:30. If we create the plot using this data, these bars will be put near each other. I’m looking for temporal gaps in data transfer, if you remember, and such a visualization won’t help me much. We need the missing minutes to be in the source data with the amount set to 0. So the first thing is to create the “scaffolding”, which is just a dataframe that has all the minutes of the days present in unique_dates.

# creating a dataframe having an amout value of zero for every minute of each day that is present in the source data.
d = [(u,(datetime(year=2000,month=1,day=1,hour=0,minute=0)+timedelta(seconds=i*60)).time(), 0)
            for u in unique_dates for i in range(1440)]
zeroes_df = pd.DataFrame(data=d,columns=['date','time','amount'])

Now, let’s update it to get the zero-containing data. Note that this one is pretty slow, but I haven’t found a faster solution.

# adding data from source to the full dataframe. Any minute that has no data for it will remain having a zero set for amount
for index, row in a.iterrows():
    zeroes_df.loc[(zeroes_df['date'] == row['date']) & (zeroes_df['time'] == row['time']), ['amount']] = row['amount']

# the original data (the one without zero minutes) is no longer needed, we can assign the new data (with zeroes) to it.
a=zeroes_df

Took a while, but it’s done! We have our data now. Let’s move to the final step.

Plotting

We’re going to use the seaborn library to create the barplot. It is built over matplotlib, so it’s possible to manipulate the underlying parameters as well. Both of these libraries are very confusing, so I created several parameters on top of the cell to control various elements of the plot: font sizes, x-axis label frequency, image size, etc.

This one also takes a while and I’m not sure how it can be optimized. But still, it does what I want it to do.

%matplotlib inline

# change this to change the absolute size of the image
SIZE_MULT = 8

# every N-th label on x-axis is shown. Change the number if you want.
EVERY_NTH_LABEL_KEPT = 20

# adjust this if the aspect ratio is bad.
figsize=(30*SIZE_MULT, 3*SIZE_MULT)

# size of axis labels
LABEL_SIZE_MULT = 12

# size of axis labels
AXIS_LABELS_SIZE_MULT = 18

# size of plot title
TITLE_SIZE_MULT = 20

# will create a separate plot for every date
for fecha in unique_dates:
    # grabbing data only for this date
    subdata = a.loc[a['date'] == fecha]

    fig = plt.figure(figsize=figsize)
    pl = sns.barplot(x=subdata['time'], y=subdata['amount'],
                     edgecolor="k", linewidth=1)# black edges

    pl.set_xlabel("Time",fontsize=AXIS_LABELS_SIZE_MULT*SIZE_MULT)
    pl.set_ylabel("Amount",fontsize=AXIS_LABELS_SIZE_MULT*SIZE_MULT)
    pl.set_title(FILENAME+": "+str(fecha),fontsize=SIZE_MULT*TITLE_SIZE_MULT)
    pl.tick_params(labelsize=LABEL_SIZE_MULT*SIZE_MULT)

    for ind, label in enumerate(pl.get_xticklabels()):
        if ind % EVERY_NTH_LABEL_KEPT == 0:
            label.set_visible(True)
            label.set_rotation(90)
        else:
            label.set_visible(False)

    # hiding the zero on y-axis
    pl.get_yticklabels()[0].set_visible(False)

And here’s one of the resulting plots. I’ve scaled it down significantly for this article, in reality the image is about 8 times larger. If there are several days in your data, the plot for each day will appear on a separate image.

The resulting plot

And that’s about it.

Conclusion

So, we’ve created a plot that shows the network activity we got from tcpdump. In this example there are apparent gaps, which is exactly what I wanted to know.

This method is pretty slow and consumes quite a lot of RAM. If you manage to optimize this routine somehow, let me know, I’m curious!

Of course, I could use Filebeat, Logstash and Elasticsearch with Kibana, but deploying and configuring it for such a small and short-term task would be an overkill. If I had to monitor logs constantly, then it might be viable. Btw, I might make a tutorial on how to set up and configure the Elastic stack in Docker to use with Python applications in the upcoming future, so stay tuned!

As always, thanks for stopping by, and I’ll see you eventually.