@@ -6,6 +6,7 @@ The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
66metrics like memory bandwidth, latency, and utilization:
77
88* Unified Coherence Fabric (UCF)
9+ * PCIE
910
1011PMU Driver
1112----------
@@ -104,3 +105,165 @@ Example usage:
104105 destination filter = remote memory::
105106
106107 perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
108+
109+ PCIE PMU
110+ --------
111+
112+ This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
113+ the memory subsystem. It monitors all read/write traffic from the root port(s)
114+ or a particular BDF in a PCIE RC to local or remote memory. There is one PMU per
115+ PCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated into
116+ up to 8 root ports. The traffic from each root port can be filtered using RP or
117+ BDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU counter will
118+ capture traffic from all RPs. Please see below for more details.
119+
120+ The events and configuration options of this PMU device are described in sysfs,
121+ see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>.
122+
123+ The events in this PMU can be used to measure bandwidth, utilization, and
124+ latency:
125+
126+ * rd_req: count the number of read requests by PCIE device.
127+ * wr_req: count the number of write requests by PCIE device.
128+ * rd_bytes: count the number of bytes transferred by rd_req.
129+ * wr_bytes: count the number of bytes transferred by wr_req.
130+ * rd_cum_outs: count outstanding rd_req each cycle.
131+ * cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
132+
133+ The average bandwidth is calculated as::
134+
135+ AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
136+ AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
137+
138+ The average request rate is calculated as::
139+
140+ AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
141+ AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
142+
143+
144+ The average latency is calculated as::
145+
146+ FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
147+ AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
148+ AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
149+
150+ The PMU events can be filtered based on the traffic source and destination.
151+ The source filter indicates the PCIE devices that will be monitored. The
152+ destination filter specifies the destination memory type, e.g. local system
153+ memory (CMEM), local GPU memory (GMEM), or remote memory. The local/remote
154+ classification of the destination filter is based on the home socket of the
155+ address, not where the data actually resides. These filters can be found in
156+ /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
157+
158+ The list of event filters:
159+
160+ * Source filter:
161+
162+ * src_rp_mask: bitmask of root ports that will be monitored. Each bit in this
163+ bitmask represents the RP index in the RC. If the bit is set, all devices under
164+ the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor
165+ devices in root port 0 to 3.
166+ * src_bdf: the BDF that will be monitored. This is a 16-bit value that
167+ follows formula: (bus << 8) + (device << 3) + (function). For example, the
168+ value of BDF 27:01.1 is 0x2781.
169+ * src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in
170+ "src_bdf" is used to filter the traffic.
171+
172+ Note that Root-Port and BDF filters are mutually exclusive and the PMU in
173+ each RC can only have one BDF filter for the whole counters. If BDF filter
174+ is enabled, the BDF filter value will be applied to all events.
175+
176+ * Destination filter:
177+
178+ * dst_loc_cmem: if set, count events to local system memory (CMEM) address
179+ * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
180+ * dst_loc_pcie_p2p: if set, count events to local PCIE peer address
181+ * dst_loc_pcie_cxl: if set, count events to local CXL memory address
182+ * dst_rem: if set, count events to remote memory address
183+
184+ If the source filter is not specified, the PMU will count events from all root
185+ ports. If the destination filter is not specified, the PMU will count events
186+ to all destinations.
187+
188+ Example usage:
189+
190+ * Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all
191+ destinations::
192+
193+ perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/
194+
195+ * Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and
196+ targeting just local CMEM of socket 0::
197+
198+ perf stat -a -e nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/
199+
200+ * Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all
201+ destinations::
202+
203+ perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/
204+
205+ * Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and
206+ targeting just local CMEM of socket 1::
207+
208+ perf stat -a -e nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/
209+
210+ * Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting all
211+ destinations::
212+
213+ perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
214+
215+ Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA
216+ Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space
217+ for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register
218+ contains the following information to map PCIE devices under the RP back to its RC# :
219+
220+ - Bus# (byte 0xc) : bus number as reported by the lspci output
221+ - Segment# (byte 0xd) : segment number as reported by the lspci output
222+ - RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci for a device with Root Port capability
223+ - RC# (byte 0xf): root complex number associated with the RP
224+ - Socket# (byte 0x10): socket number associated with the RP
225+
226+ Example script for mapping lspci BDF to RC# and socket#::
227+
228+ #!/bin/bash
229+ while read bdf rest; do
230+ dvsec4_reg=$(lspci -vv -s $bdf | awk '
231+ /Designated Vendor-Specific: Vendor=10de ID=0004/ {
232+ match($0, /\[([0-9a-fA-F]+)/, arr);
233+ print "0x" arr[1];
234+ exit
235+ }
236+ ')
237+ if [ -n "$dvsec4_reg" ]; then
238+ bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b)
239+ segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b)
240+ rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b)
241+ rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b)
242+ socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b)
243+ echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc, Socket=$socket"
244+ fi
245+ done < <(lspci -d 10de:)
246+
247+ Example output::
248+
249+ 0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00
250+ 0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00
251+ 0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00
252+ 0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00
253+ 0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00
254+ 0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00
255+ 0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00
256+ 0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00
257+ 0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00
258+ 0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00
259+ 0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00
260+ 0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01
261+ 000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01
262+ 000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01
263+ 000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01
264+ 000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01
265+ 000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01
266+ 000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01
267+ 000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
268+ 000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
269+ 000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01
0 commit comments