Skip to content

Commit f5caf26

Browse files
bwicaksononvwilldeacon
authored andcommitted
perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU
The Unified Coherence Fabric (UCF) contains last level cache and cache coherent interconnect in Tegra410 SOC. The PMU in this device can be used to capture events related to access to the last level cache and memory from different sources. Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> Signed-off-by: Will Deacon <will@kernel.org>
1 parent d332424 commit f5caf26

3 files changed

Lines changed: 193 additions & 1 deletion

File tree

Documentation/admin-guide/perf/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ Performance monitor support
2525
alibaba_pmu
2626
dwc_pcie_pmu
2727
nvidia-tegra241-pmu
28+
nvidia-tegra410-pmu
2829
meson-ddr-pmu
2930
cxl
3031
ampere_cspmu
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
=====================================================================
2+
NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU)
3+
=====================================================================
4+
5+
The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
6+
metrics like memory bandwidth, latency, and utilization:
7+
8+
* Unified Coherence Fabric (UCF)
9+
10+
PMU Driver
11+
----------
12+
13+
The PMU driver describes the available events and configuration of each PMU in
14+
sysfs. Please see the sections below to get the sysfs path of each PMU. Like
15+
other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show
16+
the CPU id used to handle the PMU event. There is also "associated_cpus"
17+
sysfs attribute, which contains a list of CPUs associated with the PMU instance.
18+
19+
UCF PMU
20+
-------
21+
22+
The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a
23+
distributed cache, last level for CPU Memory and CXL Memory, and cache coherent
24+
interconnect that supports hardware coherence across multiple coherently caching
25+
agents, including:
26+
27+
* CPU clusters
28+
* GPU
29+
* PCIe Ordering Controller Unit (OCU)
30+
* Other IO-coherent requesters
31+
32+
The events and configuration options of this PMU device are described in sysfs,
33+
see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>.
34+
35+
Some of the events available in this PMU can be used to measure bandwidth and
36+
utilization:
37+
38+
* slc_access_rd: count the number of read requests to SLC.
39+
* slc_access_wr: count the number of write requests to SLC.
40+
* slc_bytes_rd: count the number of bytes transferred by slc_access_rd.
41+
* slc_bytes_wr: count the number of bytes transferred by slc_access_wr.
42+
* mem_access_rd: count the number of read requests to local or remote memory.
43+
* mem_access_wr: count the number of write requests to local or remote memory.
44+
* mem_bytes_rd: count the number of bytes transferred by mem_access_rd.
45+
* mem_bytes_wr: count the number of bytes transferred by mem_access_wr.
46+
* cycles: counts the UCF cycles.
47+
48+
The average bandwidth is calculated as::
49+
50+
AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS
51+
AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS
52+
AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS
53+
AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS
54+
55+
The average request rate is calculated as::
56+
57+
AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES
58+
AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES
59+
AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES
60+
AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES
61+
62+
More details about what other events are available can be found in Tegra410 SoC
63+
technical reference manual.
64+
65+
The events can be filtered based on source or destination. The source filter
66+
indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or
67+
remote socket. The destination filter specifies the destination memory type,
68+
e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The
69+
local/remote classification of the destination filter is based on the home
70+
socket of the address, not where the data actually resides. The available
71+
filters are described in
72+
/sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/.
73+
74+
The list of UCF PMU event filters:
75+
76+
* Source filter:
77+
78+
* src_loc_cpu: if set, count events from local CPU
79+
* src_loc_noncpu: if set, count events from local non-CPU device
80+
* src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket
81+
82+
* Destination filter:
83+
84+
* dst_loc_cmem: if set, count events to local system memory (CMEM) address
85+
* dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
86+
* dst_loc_other: if set, count events to local CXL memory address
87+
* dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket
88+
89+
If the source is not specified, the PMU will count events from all sources. If
90+
the destination is not specified, the PMU will count events to all destinations.
91+
92+
Example usage:
93+
94+
* Count event id 0x0 in socket 0 from all sources and to all destinations::
95+
96+
perf stat -a -e nvidia_ucf_pmu_0/event=0x0/
97+
98+
* Count event id 0x0 in socket 0 with source filter = local CPU and destination
99+
filter = local system memory (CMEM)::
100+
101+
perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/
102+
103+
* Count event id 0x0 in socket 1 with source filter = local non-CPU device and
104+
destination filter = remote memory::
105+
106+
perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/

drivers/perf/arm_cspmu/nvidia_cspmu.c

Lines changed: 86 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
// SPDX-License-Identifier: GPL-2.0
22
/*
3-
* Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
44
*
55
*/
66

@@ -21,6 +21,13 @@
2121
#define NV_CNVL_PORT_COUNT 4ULL
2222
#define NV_CNVL_FILTER_ID_MASK GENMASK_ULL(NV_CNVL_PORT_COUNT - 1, 0)
2323

24+
#define NV_UCF_SRC_COUNT 3ULL
25+
#define NV_UCF_DST_COUNT 4ULL
26+
#define NV_UCF_FILTER_ID_MASK GENMASK_ULL(11, 0)
27+
#define NV_UCF_FILTER_SRC GENMASK_ULL(2, 0)
28+
#define NV_UCF_FILTER_DST GENMASK_ULL(11, 8)
29+
#define NV_UCF_FILTER_DEFAULT (NV_UCF_FILTER_SRC | NV_UCF_FILTER_DST)
30+
2431
#define NV_GENERIC_FILTER_ID_MASK GENMASK_ULL(31, 0)
2532

2633
#define NV_PRODID_MASK (PMIIDR_PRODUCTID | PMIIDR_VARIANT | PMIIDR_REVISION)
@@ -124,6 +131,36 @@ static struct attribute *mcf_pmu_event_attrs[] = {
124131
NULL,
125132
};
126133

134+
static struct attribute *ucf_pmu_event_attrs[] = {
135+
ARM_CSPMU_EVENT_ATTR(bus_cycles, 0x1D),
136+
137+
ARM_CSPMU_EVENT_ATTR(slc_allocate, 0xF0),
138+
ARM_CSPMU_EVENT_ATTR(slc_wb, 0xF3),
139+
ARM_CSPMU_EVENT_ATTR(slc_refill_rd, 0x109),
140+
ARM_CSPMU_EVENT_ATTR(slc_refill_wr, 0x10A),
141+
ARM_CSPMU_EVENT_ATTR(slc_hit_rd, 0x119),
142+
143+
ARM_CSPMU_EVENT_ATTR(slc_access_dataless, 0x183),
144+
ARM_CSPMU_EVENT_ATTR(slc_access_atomic, 0x184),
145+
146+
ARM_CSPMU_EVENT_ATTR(slc_access_rd, 0x111),
147+
ARM_CSPMU_EVENT_ATTR(slc_access_wr, 0x112),
148+
ARM_CSPMU_EVENT_ATTR(slc_bytes_rd, 0x113),
149+
ARM_CSPMU_EVENT_ATTR(slc_bytes_wr, 0x114),
150+
151+
ARM_CSPMU_EVENT_ATTR(mem_access_rd, 0x121),
152+
ARM_CSPMU_EVENT_ATTR(mem_access_wr, 0x122),
153+
ARM_CSPMU_EVENT_ATTR(mem_bytes_rd, 0x123),
154+
ARM_CSPMU_EVENT_ATTR(mem_bytes_wr, 0x124),
155+
156+
ARM_CSPMU_EVENT_ATTR(local_snoop, 0x180),
157+
ARM_CSPMU_EVENT_ATTR(ext_snp_access, 0x181),
158+
ARM_CSPMU_EVENT_ATTR(ext_snp_evict, 0x182),
159+
160+
ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
161+
NULL
162+
};
163+
127164
static struct attribute *generic_pmu_event_attrs[] = {
128165
ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
129166
NULL,
@@ -152,6 +189,18 @@ static struct attribute *cnvlink_pmu_format_attrs[] = {
152189
NULL,
153190
};
154191

192+
static struct attribute *ucf_pmu_format_attrs[] = {
193+
ARM_CSPMU_FORMAT_EVENT_ATTR,
194+
ARM_CSPMU_FORMAT_ATTR(src_loc_noncpu, "config1:0"),
195+
ARM_CSPMU_FORMAT_ATTR(src_loc_cpu, "config1:1"),
196+
ARM_CSPMU_FORMAT_ATTR(src_rem, "config1:2"),
197+
ARM_CSPMU_FORMAT_ATTR(dst_loc_cmem, "config1:8"),
198+
ARM_CSPMU_FORMAT_ATTR(dst_loc_gmem, "config1:9"),
199+
ARM_CSPMU_FORMAT_ATTR(dst_loc_other, "config1:10"),
200+
ARM_CSPMU_FORMAT_ATTR(dst_rem, "config1:11"),
201+
NULL
202+
};
203+
155204
static struct attribute *generic_pmu_format_attrs[] = {
156205
ARM_CSPMU_FORMAT_EVENT_ATTR,
157206
ARM_CSPMU_FORMAT_FILTER_ATTR,
@@ -236,6 +285,27 @@ static void nv_cspmu_set_cc_filter(struct arm_cspmu *cspmu,
236285
writel(filter, cspmu->base0 + PMCCFILTR);
237286
}
238287

288+
static u32 ucf_pmu_event_filter(const struct perf_event *event)
289+
{
290+
u32 ret, filter, src, dst;
291+
292+
filter = nv_cspmu_event_filter(event);
293+
294+
/* Monitor all sources if none is selected. */
295+
src = FIELD_GET(NV_UCF_FILTER_SRC, filter);
296+
if (src == 0)
297+
src = GENMASK_ULL(NV_UCF_SRC_COUNT - 1, 0);
298+
299+
/* Monitor all destinations if none is selected. */
300+
dst = FIELD_GET(NV_UCF_FILTER_DST, filter);
301+
if (dst == 0)
302+
dst = GENMASK_ULL(NV_UCF_DST_COUNT - 1, 0);
303+
304+
ret = FIELD_PREP(NV_UCF_FILTER_SRC, src);
305+
ret |= FIELD_PREP(NV_UCF_FILTER_DST, dst);
306+
307+
return ret;
308+
}
239309

240310
enum nv_cspmu_name_fmt {
241311
NAME_FMT_GENERIC,
@@ -342,6 +412,21 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
342412
.init_data = NULL
343413
},
344414
},
415+
{
416+
.prodid = 0x2CF20000,
417+
.prodid_mask = NV_PRODID_MASK,
418+
.name_pattern = "nvidia_ucf_pmu_%u",
419+
.name_fmt = NAME_FMT_SOCKET,
420+
.template_ctx = {
421+
.event_attr = ucf_pmu_event_attrs,
422+
.format_attr = ucf_pmu_format_attrs,
423+
.filter_mask = NV_UCF_FILTER_ID_MASK,
424+
.filter_default_val = NV_UCF_FILTER_DEFAULT,
425+
.filter2_mask = 0x0,
426+
.filter2_default_val = 0x0,
427+
.get_filter = ucf_pmu_event_filter,
428+
},
429+
},
345430
{
346431
.prodid = 0,
347432
.prodid_mask = 0,

0 commit comments

Comments
 (0)