Good evening

I have a TCP server clients can connect to. Pretty straightforward. But the problem is, all my clients use mobile network and are constantly on the move. Losing connection without notifying the server via FIN or RST is a common situation. So the server has this lingering connection and thinks a client is online when in reality it is not.

My first solution was to wait for a while; if a client does not send any data in a given period of time, cut the connection (as a quick side note, SetDeadline was particularly useful for me, it causes i/o timeout error on conn.Read if it waits for too long). But there’s a delicate balance to consider: I shouldn’t do it too early, in case the client is just slow on generating data, yet not too late either, because it will misinform me on the client’s online status, which I require.

My thought was to ping the client. But I don’t want to spam the client with unneeded data. Additionally, I’m not the one in charge of programming the clients, so I’m not sure how they will behave if I send them weird packets.

TCP keepalive - a lightweight ping

TCP keepalive sends packets without (or almost without) a body to make sure that the other side answers with an ACK. It is not the part of the TCP standard (they are described in RFC1122 though) and is always disabled by default. But still, the majority of modern TCP solutions should support it.

In its most simple implementation (and that will be enough for us today), it has three main parameters:

  • Idle time - after receiving a packet, it waits for this long before sending a ping message, unless another data packet comes.

  • Retry interval - if it sends a ping, and there is no ACK from the other side, try again after this time.

  • Ping amount - how many pings to send (with no ACK from client) before we consider this connection to be dead.

For example, idle time is 30 seconds, retry interval is 5 seconds and ping amount 3 pings. Here’s how it works.

We get a data message from the client. Then it stops sending anything. We wait for 30 seconds. Then we send a ping. If we get an ACK, then we wait for another 30-second period before sending another ping; unless a data packet comes in, then this timer is reset again.

If we don’t get an ACK, we wait for 5 seconds and try again. Still no answer after another 5 seconds? Pinging one last time and waiting for the last 5 seconds (yes, we wait for the retry interval amount of seconds after the last ping). Then we consider the connection lost by timeout and disconnect from the server’s side.

Default values?

It is said that Windows waits for 2 hours by default before sending keepalive pings. But I use Linux. Getting the defaults is pretty easy, as shown in section 3.1.1 here.

# Idle time
cat /proc/sys/net/ipv4/tcp_keepalive_time

# Retry interval
cat /proc/sys/net/ipv4/tcp_keepalive_intvl

# Ping amount
cat /proc/sys/net/ipv4/tcp_keepalive_probes

How do we set it in Golang?

Since I program a lot in Golang lately and I needed to implement this solution in it, I will discuss it here.

One thing before we start. It is relevant for Linux. I’m not 100% sure if it works on OSX, and I’m almost sure that it won’t work on Windows this way.

A special type of connection

First of all, I noticed that I used only net.Conn in my server program. But it won’t work, it lacks certain specific methods we require. We need TCPConn for that.

Which means, we need to use not Listen and Accept, but ListenTCP (which is called differently, using a structure instead of a string for address. So the call would look something like ListenTCP("tcp", &net.TCPAddr{Port: myClientPort}. The IP will be 0.0.0.0 by default, if you don’t specify it) and AcceptTCP. The latter returns the TCPConn type we need.

Utilizing methods provided by Go

There are two methods you may put your eye on while scanning the docs: SetKeepAlive and SetKeepAlivePeriod. The former is pretty simple: you pass true to it and it enables the TCP keepalive mechanism.

But the latter is quite confusing. What exactly do you set with that? The answer was found in this article (a good read, I recommend): it sets both Idle time and Retry interval. The Ping amount remains at the system default. So when I set it to, say, 5 * time.Second, it waits for 5 seconds, pings and waits for another 5. And 8 more pings. I require more flexibility.

Going OS-level

This can be achieved via direct manipulation on socket parameters. I won’t go too much into detail, it’s pretty self-explanatory. Here’s how we set Idle time to 30 seconds (we can do it via SetKeepAlivePeriod because we will adjust other parameters separately), Retry interval to 5 seconds and Ping amount to 3. Stole Referenced some code from the aforementioned article and repository, many thanks!

conn.SetKeepAlive(true)
conn.SetKeepAlivePeriod(time.Second * 30)

// Getting the file handle of the socket
sockFile, sockErr := conn.File()
if sockErr == nil {
    // got socket file handle. Getting descriptor.
    fd := int(sockFile.Fd())
    // Ping amount
    err := syscall.SetsockoptInt(fd, syscall.IPPROTO_TCP, syscall.TCP_KEEPCNT, 3)
    if err != nil {
        Warning("on setting keepalive probe count", err.Error())
    }
    // Retry interval
    err = syscall.SetsockoptInt(fd, syscall.IPPROTO_TCP, syscall.TCP_KEEPINTVL, 5)
    if err != nil {
        Warning("on setting keepalive retry interval", err.Error())
    }
    // don't forget to close the file. No worries, it will *not* cause the connection to close.
    sockFile.Close()
} else {
    Warning("on setting socket keepalive", sockErr.Error())
}

I have a line somewhere after this in a loop that looks like dataLength, err := conn.Read(readBuf) which blocks until data comes or an error occurs. If keepalive cannot reach the other side, err.Error() will contain connection timed out, which you can handle.

An important note on file descriptors

The code above is fine, but only if you don’t invoke it too often. After writing this article, I learned about one little problem the hard way.

The devil hides in the somewhat innocent function, the Fd. Take a look at its code.

func (f *File) Fd() uintptr {
    if f == nil {
        return ^(uintptr(0))
    }

    // If we put the file descriptor into nonblocking mode,
    // then set it to blocking mode before we return it,
    // because historically we have always returned a descriptor
    // opened in blocking mode. The File will continue to work,
    // but any blocking operation will tie up a thread.
    if f.nonblock {
        f.pfd.SetBlocking()
    }

    return uintptr(f.pfd.Sysfd)

}

If the file descriptor is in non-blocking mode, it will be set to blocking. So what? Well, according to this for example, whenever Go encounters a blocking call to system, it creates an additional thread to cater for it. Given the fact that I use a separate goroutine for each client, imagine the explosion! It hit the 10000-thread limit quickly and panicked.

Putting it into a separate goroutine didn’t help. But there was one thing that did. A word of warning, it was introbuced in version 1.11 and will not work in earlier versions. Let’s rewrite that code this way.

//Sets additional keepalive parameters.
//Uses new interfaces introduced in Go1.11, which let us get connection's file descriptor,
//without blocking, and therefore without uncontrolled spawning of threads (not goroutines, actual threads).
func setKeepaliveParameters(conn devconn) {
    rawConn, err := conn.SyscallConn()
    if err != nil {
        Warning("on getting raw connection object for keepalive parameter setting", err.Error())
    }

    rawConn.Control(
        func(fdPtr uintptr) {
            // got socket file descriptor. Setting parameters.
            fd := int(fdPtr)
            //Number of probes.
            err := syscall.SetsockoptInt(fd, syscall.IPPROTO_TCP, syscall.TCP_KEEPCNT, 3)
            if err != nil {
                Warning("on setting keepalive probe count", err.Error())
            }
            //Wait time after an unsuccessful probe.
            err = syscall.SetsockoptInt(fd, syscall.IPPROTO_TCP, syscall.TCP_KEEPINTVL, 3)
            if err != nil {
                Warning("on setting keepalive retry interval", err.Error())
            }
        })
}

func deviceProcessor(conn devconn) {

    //............

    conn.SetKeepAlive(true)
    conn.SetKeepAlivePeriod(time.Second * 30)

    setKeepaliveParameters(conn)

    //............

    dataLen, err := conn.Read(readBuf)

    //............
}

The newest versions of Go introduced some new interfaces, and net.TCPConn implements SyscallConn() which lets you get RawConn object that implements Control. All you have to do is define a function (anonymous in the example above) that takes the pointer to a file descriptor. This is the way of getting access to the file descriptor of the connection without making blocking calls, thus avoiding uncontrolled thread spawning.

In conclusion

Networking is tricky. And often OS-dependent. This solution will most likely work only on Linux, but it is a good start. There are similar parameters in other operating systems, they’re just called differently.

Thanks for tuning in. See you soon!