Performance counter 1

16 minute read

Published:

In this post series, We will explore the hardware performance counter in X86 architectures.

Difference in hardware performance counter

There are two types of hardware performance counter: Fixed-function and programmable. Fixed-function performance counters are hardware-implemented counters that are dedicated to measuring specific events, such as INST_RETIRED.ANY, which counts the number of instructions retired from execution. Another example is CPU_CLK_UNHALTED.THREAD, which counts the number of core cycles while the thread is not in a halt state. In my x86 machine, the microarchitecture is Skylake. There are 4 fixed-function performance counters. On the other hand, programmable counters can be configured to measure events depending on the user’s needs. There are 4 programmable performance counters in Skylake. A list of programmable events can be found in Skylake Events.

To find out the exact number of fixed-function and programmable performance counters, we can use the cpuid.

cpuid -l 0xA # CPUID leaf 0xA (decimal 10), which is Architectural Performance Monitoring
CPU 0:
   Architecture Performance Monitoring Features (0xa/eax):
      version ID                               = 0x4 (4)
      number of counters per logical processor = 0x4 (4)
      bit width of counter                     = 0x30 (48)
      length of EBX bit vector                 = 0x7 (7)
   Architecture Performance Monitoring Features (0xa/ebx):
      core cycle event not available           = false
      instruction retired event not available  = false
      reference cycles event not available     = false
      last-level cache ref event not available = false
      last-level cache miss event not avail    = false
      branch inst retired event not available  = false
      branch mispred retired event not avail   = false
   Architecture Performance Monitoring Features (0xa/edx):
      number of fixed counters    = 0x3 (3)
      bit width of fixed counters = 0x30 (48)

How many events I can monitor at the same time?

Based on the information above, we can conclude that 4 + 3 = 7 events can be monitored at the same time. But in reality, we can monitor as many events as we can becuase the linux perf system will multiplex the events to the available performance counters when we have more events than the number of performance counters. To not overwhelm the performance counters, we can set pinned = 1 in perf_event_attr to pin an event to a counter without being multiplexed.

struct perf_event_attr example;
example.pinned = 1;

Explanation from Linux man page of perf_event_open():

pinned: The pinned bit specifies that the counter should always be
              on the CPU if at all possible.  It applies only to hardware
              counters and only to group leaders.  If a pinned counter
              cannot be put onto the CPU (e.g., because there are not
              enough hardware counters or because of a conflict with some
              other event), then the counter goes into an 'error' state,
              where reads return end-of-file (i.e., read(2) returns 0)
              until the counter is subsequently enabled or disabled.

Test program

Here is a test program to show that if we try to pin 5 events to the programmable performance counters, the last event will fail to launch. Meanwhile, this test program will also include two fixed-function counter events to show they are independent and unaffected to programmable performance counters. Before we start, we need to relax the kernel.perf_event_paranoid to 1.

cat /proc/sys/kernel/perf_event_paranoid
sudo sysctl kernel.perf_event_paranoid=1

/*
 * Test fixed counters with 5 programmable counters already pinned
 * Build: gcc -O2 -Wall -Wextra pin_test.c -o pin_test
 * Run: ./pin_test
 */
#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <string.h>
#include <stdint.h>
#include <stdio.h>
#include <errno.h>

static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu,
                            int group_fd, unsigned long flags) {
    return syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
}

int main() {
    int fds[7];
    
    printf("========================================\n");
    printf("Testing Fixed Counters with 5 Programmable Counters Pinned\n");
    printf("========================================\n\n");
    
    // Create 5 programmable counter events (pinned)
    struct perf_event_attr prog_attr;
    uint64_t prog_configs[] = {0x02C4, 0x01C4, 0x00C5, 0x01D1, 0x08D1};
    const char *prog_names[] = {
        "BR_INST_RETIRED.NEAR_CALL",
        "BR_INST_RETIRED.CONDITIONAL",
        "BR_MISP_RETIRED.ALL_BRANCHES",
        "MEM_LOAD_RETIRED.L1_HIT",
        "MEM_LOAD_RETIRED.L1_MISS",
    };
    
    printf("Creating 5 programmable counter events:\n");
    for (int i = 0; i < 5; i++) {
        memset(&prog_attr, 0, sizeof(prog_attr));
        prog_attr.size = sizeof(prog_attr);
        prog_attr.type = PERF_TYPE_RAW;
        prog_attr.config = prog_configs[i];
        prog_attr.disabled = 1;
        prog_attr.pinned = 1;
        prog_attr.exclude_kernel = 1;
        prog_attr.exclude_hv = 1;
        
        fds[i] = perf_event_open(&prog_attr, 0, -1, -1, 0);
        printf("  %d. %s: %s (fd=%d)\n", i+1, prog_names[i],
               fds[i] != -1 ? "SUCCESS" : "FAILED", fds[i]);
    }
    
    printf("\nNow creating fixed counter events:\n");
    
    // Create CYCLES fixed counter
    struct perf_event_attr cycles_attr;
    memset(&cycles_attr, 0, sizeof(cycles_attr));
    cycles_attr.size = sizeof(cycles_attr);
    cycles_attr.type = PERF_TYPE_HARDWARE;
    cycles_attr.config = PERF_COUNT_HW_CPU_CYCLES;
    cycles_attr.disabled = 1;
    cycles_attr.pinned = 1;
    cycles_attr.exclude_kernel = 1;
    cycles_attr.exclude_hv = 1;
    
    fds[5] = perf_event_open(&cycles_attr, 0, -1, -1, 0);
    printf("  6. CPU_CYCLES (FIXED): %s (fd=%d)\n",
           fds[5] != -1 ? "CREATE SUCCESS" : "CREATE FAILED", fds[5]);
    if (fds[5] == -1) {
        printf("     Error: %s\n", strerror(errno));
    }
    
    // Create INSTRUCTIONS fixed counter
    struct perf_event_attr inst_attr;
    memset(&inst_attr, 0, sizeof(inst_attr));
    inst_attr.size = sizeof(inst_attr);
    inst_attr.type = PERF_TYPE_HARDWARE;
    inst_attr.config = PERF_COUNT_HW_INSTRUCTIONS;
    inst_attr.disabled = 1;
    inst_attr.pinned = 1;
    inst_attr.exclude_kernel = 1;
    inst_attr.exclude_hv = 1;
    
    fds[6] = perf_event_open(&inst_attr, 0, -1, -1, 0);
    printf("  7. INSTRUCTIONS (FIXED): %s (fd=%d)\n",
           fds[6] != -1 ? "CREATE SUCCESS" : "CREATE FAILED", fds[6]);
    if (fds[6] == -1) {
        printf("     Error: %s\n", strerror(errno));
    }
    
    printf("\n========================================\n");
    printf("Running workload and reading counters\n");
    printf("========================================\n\n");
    
    // Enable all counters
    for (int i = 0; i < 7; i++) {
        if (fds[i] != -1) {
            ioctl(fds[i], PERF_EVENT_IOC_RESET, 0);
            ioctl(fds[i], PERF_EVENT_IOC_ENABLE, 0);
        }
    }
    
    // Run workload
    volatile uint64_t x = 0;
    for (int i = 0; i < 500000; i++) {
        x += i;
        if (i % 2) x *= 2;
        if (i % 3) x /= 2;
    }
    
    // Disable all counters
    for (int i = 0; i < 7; i++) {
        if (fds[i] != -1) {
            ioctl(fds[i], PERF_EVENT_IOC_DISABLE, 0);
        }
    }
    
    // Read programmable counters
    printf("Reading programmable counters:\n");
    int prog_success = 0, prog_fail = 0;
    for (int i = 0; i < 5; i++) {
        if (fds[i] != -1) {
            uint64_t count = 0;
            ssize_t bytes = read(fds[i], &count, sizeof(count));
            printf("  %d. %s: ", i+1, prog_names[i]);
            if (bytes == sizeof(count)) {
                printf("SUCCESS (count=%llu)\n", (unsigned long long)count);
                prog_success++;
            } else {
                printf("READ FAILED (returned %zd bytes)\n", bytes);
                prog_fail++;
            }
        }
    }
    
    printf("\nReading fixed counters:\n");
    int cycles_ok = 0, inst_ok = 0;
    
    if (fds[5] != -1) {
        uint64_t count = 0;
        ssize_t bytes = read(fds[5], &count, sizeof(count));
        printf("  6. CPU_CYCLES: ");
        if (bytes == sizeof(count) && count > 0) {
            printf("SUCCESS (count=%llu)\n", (unsigned long long)count);
            cycles_ok = 1;
        } else if (bytes == sizeof(count) && count == 0) {
            printf("READ SUCCESS but count=0 (counter not actually counting)\n");
        } else {
            printf("READ FAILED (returned %zd bytes)\n", bytes);
        }
    }
    
    if (fds[6] != -1) {
        uint64_t count = 0;
        ssize_t bytes = read(fds[6], &count, sizeof(count));
        printf("  7. INSTRUCTIONS: ");
        if (bytes == sizeof(count) && count > 0) {
            printf("SUCCESS (count=%llu)\n", (unsigned long long)count);
            inst_ok = 1;
        } else if (bytes == sizeof(count) && count == 0) {
            printf("READ SUCCESS but count=0 (counter not actually counting)\n");
        } else {
            printf("READ FAILED (returned %zd bytes)\n", bytes);
        }
    }
    
    // Cleanup
    for (int i = 0; i < 7; i++) {
        if (fds[i] != -1) close(fds[i]);
    }
    
    printf("\n========================================\n");
    printf("CONCLUSION\n");
    printf("========================================\n");
    printf("Programmable counters: %d working, %d failed\n", prog_success, prog_fail);
    printf("Fixed counters: CPU_CYCLES=%s, INSTRUCTIONS=%s\n",
           cycles_ok ? "WORKING" : "NOT COUNTING",
           inst_ok ? "WORKING" : "NOT COUNTING");
    printf("\n");
    if (prog_success == 4 && prog_fail >= 1) {
        printf("RESULT: 4 programmable counter limit demonstrated!\n");
        printf("- First 4 programmable events counted successfully\n");
        printf("- 5th programmable event failed to count\n");
    }
    if (cycles_ok == 0 && inst_ok == 1) {
        printf("\nCPU_CYCLES ISSUE: Likely being used by NMI watchdog\n");
        printf("- INSTRUCTIONS works (uses Fixed Counter 0)\n");
        printf("- CPU_CYCLES fails (Fixed Counter 1 reserved by kernel)\n");
        printf("- This demonstrates fixed counters are separate from programmable\n");
    }
    printf("========================================\n");
    
    return 0;
}

Test result

From the test result, we can see that the first 4 programmable events are counted successfully, while the last event failed to count, which is expected. But why does the CPU_CYCLES fixed counter fail to count but the INSTRUCTIONS fixed counter works?

========================================
Running workload and reading counters
========================================

Reading programmable counters:
  1. BR_INST_RETIRED.NEAR_CALL: SUCCESS (count=13)
  2. BR_INST_RETIRED.CONDITIONAL: SUCCESS (count=1500025)
  3. BR_MISP_RETIRED.ALL_BRANCHES: SUCCESS (count=120)
  4. MEM_LOAD_RETIRED.L1_HIT: SUCCESS (count=1075971)
  5. MEM_LOAD_RETIRED.L1_MISS: READ FAILED (returned 0 bytes)

Reading fixed counters:
  6. CPU_CYCLES: READ FAILED (returned 0 bytes)
  7. INSTRUCTIONS: SUCCESS (count=9250117)

We print the kernel messages and grep the performance counter related messages.

dmesg | grep -i "performance\|pmu\|perf" | tail -20

[    0.188328] Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver.
[    0.188328] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.

From the kernel messages, we can see that the NMI watchdog permanently consumes one hw-PMU counter. The NMI (Non-Maskable Interrupt) watchdog is a kernel feature that helps detect CPU lockups by triggering an NMI. It’s useful for debugging hangs or deadlocks, but it does slightly affect performance. CPU_CYCLES fixed counter fails to count in our test program because we have already pinned 4 events on all the programmable performance counters, there is no fallback counter for the event CPU_CYCLES to be tracked. So if you reduce the number of pins, CPU_CYCLES should be able to multiplex with the programmable performance counters.

We can also disable the NMI watchdog momentarily to verify our theory.

cat /proc/sys/kernel/nmi_watchdog
# it should show 1, which means the NMI watchdog is enabled.

# disable the NMI watchdog
echo 0 | sudo tee /proc/sys/kernel/nmi_watchdog

Now re-run the test program, as we expected, two fixed counters are not interfered by events pinned on programmable performance counters.

./pin_test

Reading programmable counters:
  1. BR_INST_RETIRED.NEAR_CALL: SUCCESS (count=13)
  2. BR_INST_RETIRED.CONDITIONAL: SUCCESS (count=1500025)
  3. BR_MISP_RETIRED.ALL_BRANCHES: SUCCESS (count=36)
  4. MEM_LOAD_RETIRED.L1_HIT: SUCCESS (count=1083198)
  5. MEM_LOAD_RETIRED.L1_MISS: READ FAILED (returned 0 bytes)

Reading fixed counters:
  6. CPU_CYCLES: SUCCESS (count=6763690)
  7. INSTRUCTIONS: SUCCESS (count=9250115)

In the next post, we will explore how to setup and measure all those events that we saw from Skylake Events.