Finding Goroutine Leaks

Originally posted on Storj Blog.

A leaked goroutine at the end of a test can indicate several problems. Let’s first, take a look at the most common ones before tackling an approach to finding them.

Problem: Deadlock #

First, we can have a goroutine that is blocked. As an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


func LeakySumSquares(ctx context.Context, data []int) (total int, err error) {
	results := make(chan int)

	for _, v := range data {
		v := v
		go func() {
			result := v * v
			results <- result
		}()
	}

	for {
		select {
		case value := <-results:
			total += value
		case <-ctx.Done():
			return ctx.Err()
		}
	}

	return total, nil
}

In this case, when the context is cancelled, the goroutines might end up leaking.

Problem: Leaked Resource #

Many times different services, connections, or databases have an internal goroutine used for async processing. A leaked goroutine can show such leaks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


type Conn struct {
	messages chan Message

	close context.CancelFunc
	done  chan struct{}
}

func Dial(ctx context.Context) *Conn {
	ctx, cancel := context.WithCancel(ctx)
	conn := &Conn{
		close:    cancel,
		messages: make(chan Message)
		done:     make(chan struct{}),
	}
	go conn.monitor(ctx)
	return conn
}

func (conn *Conn) monitor(ctx context.Context) {
	defer close(conn.done)
	for {
		select {
		case msg := <-conn.messages:
			conn.handle(msg)
		case <-ctx.Done():
			return
		}
	}
}

func (conn *Conn) Close() {
	conn.close()
	<-conn.done
}

Even if the main loop is properly handled, the conn.handle(msg) could become deadlocked in other ways.

Problem: Lazy Closing Order #

Even if all the goroutines terminate, there can still be order problems with regard to resource usage. For example, you could end up depending on a database, connection, file, or any other resource, that gets closed before the goroutine finishes.

Let’s take a common case of the problem:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


type Server struct {
	log Logger
	db  *sql.DB
}

func NewServer(log Logger, dburi string) (*Server, error) {
	db, err := sql.Open("postgres", dburi)
	if err != nil {
		return nil, fmt.Errorf("opening database failed: %w", err)
	}
	return &Server{log: log, db: db}, nil
}


func (server *Server) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	tag := r.FormValue("tag")
	if tag == "" {
		return
	}

	// update the database in the background
	go func() {
		err := server.db.Exec("...", tag)
		if err != nil {
			server.log.Errorf("failed to update tags: %w", err)
		}
	}()
}


func (server *Server) Close() {
	_ = server.db.Close()
}

In this case, when the Server is closed, there still could be goroutines updating the database in the background. Similarly, even the Logger could be closed before the goroutine finishes, causing some other problem.

The severity of such close ordering depends on the context. Sometimes it’s a simple extra error in the log; in other cases, it can be a data-race or a panic taking the whole process down.

Rule of thumb #

Hopefully, it’s clear that such goroutines can be problematic.

One of the best rule-of-thumb in terms of preventing these issues is:

The location that starts the goroutine must wait for the goroutine to complete even in the presence of context cancellation. Or, it must explicitly transfer that responsibility to some other “service”.

As long as you close the top-level “service” responsible waiting for everything, it’ll become visible in tests because if there’s a leak, then the test cannot finish.

Unfortunately, this rule cannot be applied to third party libraries and it’s easy to forget to add tracking to a goroutine.

Finding leaks #

We could use the total number of goroutines, to find leaks at the end of a test, however that wouldn’t work with parallel tests.

One helpful feature in Go is goroutine labels, which can make profiling and stack traces more readable. One interesting feature they have is that they are propagated automatically to child-goroutines.

This means if we attach a unique label to a goroutine, we should be able to find all the child goroutines. However, the finding of such goroutines is non-trivial.

To attach the label:

1
2
3
4
5
6
7


func Track(ctx context.Context, t *testing.T, fn func(context.Context)) {
	label := t.Name()
	pprof.Do(ctx, pprof.Labels("test", label), fn)
	if err := CheckNoGoroutines("test", label); err != nil {
		t.Fatal("Leaked goroutines\n", err)
	}
}

Unfortunately, currently, there’s not an easy way to get the goroutines with a given label. But, we can use some of the profiling endpoints to extract the necessary information. Clearly, this is not very performant.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73


import "github.com/google/pprof/profile"

func CheckNoGoroutines(key, value string) error {
	var pb bytes.Buffer
	profiler := pprof.Lookup("goroutine")
	if profiler == nil {
		return fmt.Errorf("unable to find profile")
	}
	err := profiler.WriteTo(&pb, 0)
	if err != nil {
		return fmt.Errorf("unable to read profile: %w", err)
	}

	p, err := profile.ParseData(pb.Bytes())
	if err != nil {
		return fmt.Errorf("unable to parse profile: %w", err)
	}

	return summarizeGoroutines(p, key, value)
}

func summarizeGoroutines(p *profile.Profile, key, expectedValue string) error {
	var b strings.Builder

	for _, sample := range p.Sample {
		if !matchesLabel(sample, key, expectedValue) {
			continue
		}

		fmt.Fprintf(&b, "count %d @", sample.Value[0])
		// format the stack trace for each goroutine
		for _, loc := range sample.Location {
			for i, ln := range loc.Line {
				if i == 0 {
					fmt.Fprintf(&b, "#   %#8x", loc.Address)
					if loc.IsFolded {
						fmt.Fprint(&b, " [F]")
					}
				} else {
					fmt.Fprint(&b, "#           ")
				}
				if fn := ln.Function; fn != nil {
					fmt.Fprintf(&b, " %-50s %s:%d", fn.Name, fn.Filename, ln.Line)
				} else {
					fmt.Fprintf(&b, " ???")
				}
				fmt.Fprintf(&b, "\n")
			}
		}
		fmt.Fprintf(&b, "\n")
	}

	if b.Len() == 0 {
		return nil
	}

	return errors.New(b.String())
}

func matchesLabel(sample *profile.Sample, key, expectedValue string) bool {
	values, hasLabel := sample.Label[key]
	if !hasLabel {
		return false
	}

	for _, value := range values {
		if value == expectedValue {
			return true
		}
	}

	return false
}

And a failing test might look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


func TestLeaking(t *testing.T) {
	t.Parallel()

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	Track(ctx, t, func(ctx context.Context) {
		LeakyThing(ctx)
	})
}

func LeakyThing(ctx context.Context) {
	done := make(chan struct{})
	go func() {
		go func() {
			done <- struct{}{}
		}()
		done <- struct{}{}
	}()
}

The full example can be found here https://go.dev/play/p/KTF9tyEmLor.

Depending on your usecase, you may want to adjust to your needs. For example, you may want to skip some goroutines or maybe print some extra information, or have some grace period.

Such an approach can be hooked into your tests or existing system in a multitude of ways.