Attributing Latency in Go with Pprof Tags

A pprof CPU Profile Callgraph Node of Hugo rendering a site, broken down CPU time spent per-Hugo-template.

A latency analysis of a Go program (Hugo), starting with in-app wall-clock counters, then pprof CPU profiling, then adding custom dimensions to slice the profile.


I recently migrated Graphviz's docs from Jekyll to Hugo, in part because the Hugo static-site-generator was faster.

However, recently our Hugo build time has been creeping up, it now takes 13 seconds, which is beginning to feel a bit slow when editing and previewing:

$ hugo ~/projects/
Start building sites … 
hugo v0.88.1+extended darwin/amd64 BuildDate=unknown

                   | EN   
  Pages            | 634  
  Paginator pages  |   0  
  Non-page files   | 146  
  Static files     | 349  
  Processed images |   0  
  Aliases          |  72  
  Sitemaps         |   1  
  Cleaned          |   0  

Total in 13482 ms

I'd like to know why this is slower. Making this a bit trickier, Hugo uses parallelism heavily: with a pool of Goroutines processing each page in parallel.


Hugo can count the clock duration spent processing per-template. Here's the top 5 output:

$ hugo --templateMetrics ~/projects/
Start building sites …                                                                               
hugo v0.88.0-DEV darwin/amd64 BuildDate=unknown                                                      
Template Metrics:                                                                                    
     cumulative       average       maximum                                                          
       duration      duration      duration  count  template                                         
     ----------      --------      --------  -----  --------                                         
  31.661835379s   95.655091ms  243.928529ms    331  partials/sidebar.html                            
  31.531966131s   95.262737ms  243.832708ms    331  partials/sidebar-tree.html                       
  20.116633209s  116.957169ms  221.516518ms    172  attrs/single.html                                
   4.113334482s  132.688209ms  221.705938ms     31  attr-types/single.html                           
   4.051019586s  112.528321ms  212.598791ms     36  output/single.html                               

This is a pretty good start, far better than most software. We can see the effect of parallelism: the overall build takes 13s wall-time, but we see that we've spent 31s processing partials/sidebar.html alone!

This gives me some pointers – perhaps I should focus on sidebar.html, sidebar-tree.html. But perhaps they're just so large because they are the sidebar, rendered for every page? They are invoked 331 times. attrs/single.html is another cumulative duration offender at 20s, but perhaps it's justified given we have 172 pages using this template?

This view is useful, but limited, in particular, it doesn't tell us why these pages are slow.


Let's go a bit deeper than just at the template-level, and see if we can see some patterns in the functions executed.

Go has a built-in sampling CPU profiler, which outputs profiles in pprof format. Hugo supports invoking this using the --profile-cpu flag, and we can open the profile with go tool pprof:

$ hugo --profile-cpu /tmp/pprof
Total in 12100 ms

$ go tool pprof /tmp/pprof
Type: cpu
Time: Sep 18, 2021 at 8:24am (AEST)
Duration: 11.97s, Total samples = 22.06s (184.34%)
Entering interactive mode (type "help" for commands, "o" for options)

Here we start to see the effects of parallelism, from duration: 12s, total samples: 22s – on average using 184%, or on average executing across 1.86 CPUs.

The pprof profiler only samples threads when they are executing on-CPU, and excludes time spent blocked waiting for IO or mutexes. This pprof profile reports 24s of samples, even though above we saw --templateMetrics saying that 31s wall-time was spent processing sidebar.html alone!

It's always important to remember the distinction between wall-time profiling and on-CPU-profiling. Wall-time profiling is most useful for profiling user-visible latency, and on-CPU-profiling is useful for reducing CPU utilisation. Often you care about both, and want to look at both.

Let's open an execution-time callgraph in the browser. I've simplified the graph here by limiting to 20 nodes, otherwise it doesn't fit in the blog so well:

(pprof) nodecount=20
(pprof) web 

Redder, thicker, larger lines and nodes are more common in the profile, and dotted lines represent nodes that have been removed. See: docs on how to interpret the callgraph.

Profiling Graphviz's doc-generation, using Graphviz to render the callgraph. Beautiful.

This shows us a lot of time spent in texttemplate functions, a bit of time in reflect, syscall and memory allocation.

However, we don't have any information about what source templates are driving this. Hugo is kind-of a language interpreter, and this is a common problem when profiling any interpreted language: the profiles only show up the interpreter code, unable to measure the code being interpreted – in our case, the HTML templates driving the template engine.

Fun thought: an ahead-of-time (or even JIT-) compiled templating engine would be trivially profileable. Some templating engines do this!

Custom Go Pprof Profile Labels

Can we relate this profile back to the templates and inputs driving the interpreter? Yes, with Go's "Profile Labels", which output per-sample "tags" into the pprof output.

In Hugo's page-renering code, construct a set of string key-value pair labels with pprof.Labels, then wrap the current context.Context with pprof.Do:

labels := pprof.Labels(
	"target_filename", targetPath,
	"section", p.Section(),
        "template", t)

pprof.Do(ctx, labels, func(ctx context.Context) {
	[... compile the page ...]

This tags each CPU sample with the filename of the page we're compiling, the template, and the top-level section of the site. Then grab a profile:

$ hugo --profile-cpu /tmp/pprof
Total in 13617 ms

$ go tool pprof /tmp/pprof

We can now see the tags we just added, breaking down CPU by template, like the output of --templateMetrics above:

$ go tool pprof /tmp/pprof
(pprof) tags
 template: Total 18.0s
             10.5s (58.11%): attrs/single.html
              2.2s (12.33%): output/single.html
              1.9s (10.67%): attr-types/single.html
              1.6s ( 9.00%): docs/single.html
           740.0ms ( 4.11%): Gallery/single.html

Breaking down CPU by individual page, we see that /faq is the most expensive single page:

target_filename: Total 18.1s
                  180.0ms ( 0.99%): /faq/index.html
                  130.0ms ( 0.72%): /docs/attr-types/startType/index.html
                  130.0ms ( 0.72%): /docs/attrs/arrowtail/index.html
                  120.0ms ( 0.66%): /docs/attrs/sides/index.html
                  120.0ms ( 0.66%): /docs/outputs/json/index.html

Breaking down CPU by top-level site section, we see docs section is most expensive:

 section: Total 17.5s
            16.1s (91.88%): docs
          900.0ms ( 5.15%): Gallery
          210.0ms ( 1.20%): Misc
          130.0ms ( 0.74%): build
          130.0ms ( 0.74%): download
           50.0ms ( 0.29%): _pages

Using these tags, we can drill-down to profile generating a single page, e.g. /faq/index.html (simplified for the blog to 15 nodes):

(pprof) tagfocus=target_filename=/faq/index.html
(pprof) nodecount=15
(pprof) tagshow=none
(pprof) web

Here we see about half the time (.05s/.09s) compiling /faq/index.html is spent in the right-hand branch, calling goldmark-highlighting.RenderFencedCodeBlock.

Now we know why FAQ is slower to compile than other pages: because there is much more syntax highlighting.

I've sent a pull request to add these labels to Hugo.


  • Go's pprof Labels can help you drill-down and slice your profiles by whatever dimension you choose. If your labels have higher cardinality, your profiles will have higher overhead.
  • Go's pprof Labels don't require much code to integrate.
  • I still don't really understand why the site got slower to build. More work is needed to relate the time taken to template code.
  • pprof CPU-time profiles, in a parallel system, are insufficient to understand the wall-clock critical path. We may need either multi-track tracing, or pprof profiles weighted by wall-time.
Mark Hansen

Mark Hansen

Sydney, Australia