Skip to content

Commit bf585ba

Browse files
bwicaksononvwilldeacon
authored andcommitted
perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU
Adds PCIE PMU support in Tegra410 SOC. This PMU is instanced in each root complex in the SOC and can capture traffic from PCIE device to various memory types. This PMU can filter traffic based on the originating root port or BDF and the target memory types (CPU DRAM, GPU Memory, CXL Memory, or remote Memory). Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> Signed-off-by: Will Deacon <will@kernel.org>
1 parent bc86281 commit bf585ba

2 files changed

Lines changed: 368 additions & 5 deletions

File tree

Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
66
metrics like memory bandwidth, latency, and utilization:
77

88
* Unified Coherence Fabric (UCF)
9+
* PCIE
910

1011
PMU Driver
1112
----------
@@ -104,3 +105,165 @@ Example usage:
104105
destination filter = remote memory::
105106

106107
perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
108+
109+
PCIE PMU
110+
--------
111+
112+
This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
113+
the memory subsystem. It monitors all read/write traffic from the root port(s)
114+
or a particular BDF in a PCIE RC to local or remote memory. There is one PMU per
115+
PCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated into
116+
up to 8 root ports. The traffic from each root port can be filtered using RP or
117+
BDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU counter will
118+
capture traffic from all RPs. Please see below for more details.
119+
120+
The events and configuration options of this PMU device are described in sysfs,
121+
see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>.
122+
123+
The events in this PMU can be used to measure bandwidth, utilization, and
124+
latency:
125+
126+
* rd_req: count the number of read requests by PCIE device.
127+
* wr_req: count the number of write requests by PCIE device.
128+
* rd_bytes: count the number of bytes transferred by rd_req.
129+
* wr_bytes: count the number of bytes transferred by wr_req.
130+
* rd_cum_outs: count outstanding rd_req each cycle.
131+
* cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
132+
133+
The average bandwidth is calculated as::
134+
135+
AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
136+
AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
137+
138+
The average request rate is calculated as::
139+
140+
AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
141+
AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
142+
143+
144+
The average latency is calculated as::
145+
146+
FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
147+
AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
148+
AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
149+
150+
The PMU events can be filtered based on the traffic source and destination.
151+
The source filter indicates the PCIE devices that will be monitored. The
152+
destination filter specifies the destination memory type, e.g. local system
153+
memory (CMEM), local GPU memory (GMEM), or remote memory. The local/remote
154+
classification of the destination filter is based on the home socket of the
155+
address, not where the data actually resides. These filters can be found in
156+
/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
157+
158+
The list of event filters:
159+
160+
* Source filter:
161+
162+
* src_rp_mask: bitmask of root ports that will be monitored. Each bit in this
163+
bitmask represents the RP index in the RC. If the bit is set, all devices under
164+
the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor
165+
devices in root port 0 to 3.
166+
* src_bdf: the BDF that will be monitored. This is a 16-bit value that
167+
follows formula: (bus << 8) + (device << 3) + (function). For example, the
168+
value of BDF 27:01.1 is 0x2781.
169+
* src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in
170+
"src_bdf" is used to filter the traffic.
171+
172+
Note that Root-Port and BDF filters are mutually exclusive and the PMU in
173+
each RC can only have one BDF filter for the whole counters. If BDF filter
174+
is enabled, the BDF filter value will be applied to all events.
175+
176+
* Destination filter:
177+
178+
* dst_loc_cmem: if set, count events to local system memory (CMEM) address
179+
* dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
180+
* dst_loc_pcie_p2p: if set, count events to local PCIE peer address
181+
* dst_loc_pcie_cxl: if set, count events to local CXL memory address
182+
* dst_rem: if set, count events to remote memory address
183+
184+
If the source filter is not specified, the PMU will count events from all root
185+
ports. If the destination filter is not specified, the PMU will count events
186+
to all destinations.
187+
188+
Example usage:
189+
190+
* Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all
191+
destinations::
192+
193+
perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/
194+
195+
* Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and
196+
targeting just local CMEM of socket 0::
197+
198+
perf stat -a -e nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/
199+
200+
* Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all
201+
destinations::
202+
203+
perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/
204+
205+
* Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and
206+
targeting just local CMEM of socket 1::
207+
208+
perf stat -a -e nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/
209+
210+
* Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting all
211+
destinations::
212+
213+
perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
214+
215+
Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA
216+
Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space
217+
for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register
218+
contains the following information to map PCIE devices under the RP back to its RC# :
219+
220+
- Bus# (byte 0xc) : bus number as reported by the lspci output
221+
- Segment# (byte 0xd) : segment number as reported by the lspci output
222+
- RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci for a device with Root Port capability
223+
- RC# (byte 0xf): root complex number associated with the RP
224+
- Socket# (byte 0x10): socket number associated with the RP
225+
226+
Example script for mapping lspci BDF to RC# and socket#::
227+
228+
#!/bin/bash
229+
while read bdf rest; do
230+
dvsec4_reg=$(lspci -vv -s $bdf | awk '
231+
/Designated Vendor-Specific: Vendor=10de ID=0004/ {
232+
match($0, /\[([0-9a-fA-F]+)/, arr);
233+
print "0x" arr[1];
234+
exit
235+
}
236+
')
237+
if [ -n "$dvsec4_reg" ]; then
238+
bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b)
239+
segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b)
240+
rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b)
241+
rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b)
242+
socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b)
243+
echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc, Socket=$socket"
244+
fi
245+
done < <(lspci -d 10de:)
246+
247+
Example output::
248+
249+
0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00
250+
0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00
251+
0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00
252+
0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00
253+
0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00
254+
0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00
255+
0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00
256+
0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00
257+
0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00
258+
0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00
259+
0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00
260+
0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01
261+
000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01
262+
000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01
263+
000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01
264+
000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01
265+
000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01
266+
000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01
267+
000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
268+
000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
269+
000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01

0 commit comments

Comments
 (0)