Skip to content

Commit 3dd6e04

Browse files
authored
Move Troubleshooting to its own page. (#166)
* Move Troubleshooting to its own page. Signed-off-by: Hannah Troisi <htroisi@pixielabs.ai> * Fix image. Signed-off-by: Hannah Troisi <htroisi@pixielabs.ai> * Remove deploy diagram. Signed-off-by: Hannah Troisi <htroisi@pixielabs.ai>
1 parent 1245242 commit 3dd6e04

3 files changed

Lines changed: 420 additions & 95 deletions

File tree

content/assets/troubleshoot-flow.svg

Lines changed: 292 additions & 0 deletions
Loading

content/en/01-about-pixie/05-faq.md

Lines changed: 0 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -36,19 +36,6 @@ order: 5
3636
- [How do I send alerts?](#how-do-i...-how-do-i-send-alerts)
3737
- [How do I delete a cluster?](#how-do-i...-how-do-i-delete-a-cluster)
3838

39-
### Troubleshooting
40-
41-
- [How do I get the Pixie debug logs?](#troubleshooting-how-do-i-get-the-pixie-debug-logs)
42-
- [My deployment is stuck / fails.](#troubleshooting-my-deployment-is-stuck-fails)
43-
- [Why does my cluster show as unavailable / unhealthy in the Live UI?](#troubleshooting-why-does-my-cluster-show-as-unavailable-unhealthy-in-the-live-ui)
44-
- [Why does my cluster show as disconnected in the Live UI?](#troubleshooting-why-does-my-cluster-show-as-disconnected-in-the-live-ui)
45-
- [Why can’t I see data?](#troubleshooting-why-can't-i-see-data)
46-
- [Why can’t I see data after enabling Data Isolation Mode?](#troubleshooting-why-can't-i-see-data-after-enabling-data-isolation-mode)
47-
- [Why can’t I see application profiles / flamegraphs for my pod / node?](#troubleshooting-why-can't-i-see-application-profiles-flamegraphs-for-my-pod-node)
48-
- [Why is the vizier-pem pod’s memory increasing?](#troubleshooting-why-is-the-vizier-pem-pod's-memory-increasing)
49-
- [Troubleshooting tracepoint scripts.](#troubleshooting-troubleshooting-pixie-tracepoint-scripts)
50-
- [How do I get help?](#troubleshooting-how-do-i-get-help)
51-
5239
## General
5340

5441
### What is Pixie?
@@ -164,85 +151,3 @@ For comprehensive alerting, we recommend integrating with third-party observabil
164151
### How do I delete a cluster?
165152

166153
The UI does not currently support deleting clusters. If you’d like to rename your cluster, you can redeploy Pixie with the cluster name flag. See the [install guides](/installing-pixie/install-schemes/) for specific instructions.
167-
168-
## Troubleshooting
169-
170-
### How do I get the Pixie debug logs?
171-
172-
Install Pixie’s [CLI tool](/installing-pixie/install-schemes/cli) and run `px collect-logs.` This command will output a zipped file named `pixie_logs_<datestamp>.zip` in the working directory. The selected kube-context determines the Kubernetes cluster that outputs the logs, so make sure that you are pointing to the correct cluster.
173-
174-
### My deployment is stuck / fails
175-
176-
*Deploy with CLI gets stuck at “Wait for PEMs/Kelvin”*
177-
178-
This step of the deployment waits for the newly deployed Pixie PEM pods to become ready and available. This step can take several minutes.
179-
180-
If some `vizier-pem` pods are not ready, use kubectl to check the individual pod’s events or check Pixie’s debug logs (which also include pod events).
181-
182-
If pods are still stuck in pending, but there are no Pixie specific errors, check that there is no resource pressure (memory, CPU) on the cluster.
183-
184-
*Deploy with CLI fails to pass health checks.*
185-
186-
This step of the deployment checks that the `vizier-cloud-connector` pod can successfully run a query on the `kelvin` pod. These queries are brokered by the `vizer-query-broker` pod. To debug a failing health check, check the Pixie debug logs for those pods for specific errors.
187-
188-
*Deploy with CLI fails waiting for the Cloud Connector to come online.*
189-
190-
This step of the deployment checks that the Cloud Connector can successfully communicate with Pixie Cloud. To debug this step, check the Pixie debug logs for the `vizier-cloud-connector` pod, check the firewall, etc.
191-
192-
### Why does my cluster show as unavailable / unhealthy in the Live UI?
193-
194-
Confirm that all of the `pl` and `px-operator` namespace pods are ready and available using `px debug pods`. Deploying Pixie usually takes anywhere between 5-7 minutes. Once Pixie is deployed, it can take a few minutes for the UI to show that the cluster is healthy.
195-
196-
To debug, follow the steps in the “Deploy with CLI fails to pass health checks” section in the [above question](/about-pixie/faq/#my-deployment-is-stuck-fails). As long as the Kelvin pod, plus at least one PEM pod is up and running, then your cluster should not show as unavailable.
197-
198-
### Why does my cluster show as disconnected in the Live UI?
199-
200-
`Cluster '<CLUSTER_NAME>' is disconnected. Pixie instrumentation on 'CLUSTER_NAME' is disconnected. Please redeploy Pixie to the cluster or choose another cluster.`
201-
202-
This error indicates that the `vizier-cloud-connector` pod is not able to connect to the cloud properly. To debug, check the events / logs for the `vizier-cloud-connector` pod. Note that after deploying Pixie, it can take a few minutes for the UI to show the cluster as available.
203-
204-
### Why can’t I see data?
205-
206-
*Live UI shows an error.*
207-
208-
Error `Table 'http_events' not found` is usually an issue with deploying Pixie onto nodes with unsupported kernel versions. Check that your kernel version is supported [here](/installing-pixie/requirements/).
209-
210-
Error `Invalid Vis Spec: Missing value for required arg service.` occurs when a script has a required argument that is missing a value. Required script arguments are denoted with an asterisk after the argument name. For example, px/service has a required variable for service name. Select the required argument drop-down box in the top left and enter a value.
211-
212-
Error `Unexpected error rpc error: code = Unknown desc = rpc error: code = Canceled desc = context canceled` is associated with a query timing out. Try reducing the `start_time` window.
213-
214-
*Live UI does not show an error, but data is missing.*
215-
216-
It is possible that you need to adjust the `start_time` window. The `start_time` window expects a negative relative time (e.g. `-5m`) or an absolute time in the format `2020-07-13 18:02:5.00 +0000`.
217-
218-
If specific services / requests are missing, it is possible that Pixie doesn't support the encryption library used by that service. You can see the list of encryption libraries supported by Pixie [here](/about-pixie/data-sources/#encryption-libraries).
219-
220-
If specific services / requests are missing, it is possible that your application was not built with [debug information](/reference/admin/debug-info). See the [Data Sources](/about-pixie/data-sources) page to see which protocols and/or encryption libraries require a build with debug information.
221-
222-
### Why can’t I see application profiles / flamegraphs for my pod / node?
223-
224-
Continuous profiling currently only supports Go/C++/Rust. The [Roadmap](/about-pixie/roadmap) contains plans to expand this support to Java, Ruby, Python, etc.
225-
226-
### Why is the vizier-pem pod’s memory increasing?
227-
228-
This is expected behavior. Pixie stores the data it collects in-memory on the nodes in your cluster; data is not sent to any centralized backend cloud outside of the cluster. So what you are observing is simply the data that it is collecting.
229-
230-
Pixie has a minimum 1GiB memory requirement per node. The default deployment is 2GiB of memory. This limit can be configured with the `--pem_memory_limit` flag when deploying Pixie. Using a value less than 1GiB is not currently recommended.
231-
232-
### Troubleshooting Pixie tracepoint scripts
233-
234-
*I’m not seeing any data for my distributed bpftrace script.*
235-
236-
Rather than query data already collected by the Pixie Platform, Distributed bpftrace Deployment scripts extend the Pixie platform to collect new data sources by deploying tracepoints when the script is run. The first time this type of script is run, it will deploy the probe and query the data (but there won't be much data at this point). Re-running the script after the probe has had more time to gather data will produce more results.
237-
238-
*I'm getting error that tracepoints failed to deploy.*
239-
240-
Run the `px/tracepoint_status` script. It should show a longer error message in the "Tracepoint Status" table.
241-
242-
*How do I remove a tracepoint table?*
243-
244-
It is not currently possible to remove a table. Instead, we recommend renaming the table name (e.g. table_name_0) while debugging the script.
245-
246-
### How do I get help?
247-
248-
Ask a question on our [Community Slack](https://slackin.px.dev/) or file an issue on [GitHub](https://github.com/pixie-io/pixie/issues).
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
title: "Troubleshooting"
3+
metaTitle: "About Pixie | Troubleshooting"
4+
metaDescription: "Troubleshooting problems with Pixie."
5+
order: 6
6+
---
7+
8+
This page describes how to troubleshoot Pixie. We frequently answer questions on our [community Slack](https://slackin.px.dev/) channel and in response to [GitHub issues](https://github.com/pixie-io/pixie/issues). You can also check those two places to see if your question has already been addressed. To better understand how Pixie's various components interact, please see the [Architecture](/about-pixie/architecture/overview) overview.
9+
10+
#### Troubleshooting Deployment
11+
12+
- [How do I check the status of Pixie's components?](#troubleshooting-deployment-how-do-i-check-the-status-of-pixie's-components)
13+
- [How do I get the Pixie debug logs?](#troubleshooting-deployment-how-do-i-get-the-pixie-debug-logs)
14+
- [My deployment is stuck / fails.](#troubleshooting-deployment-my-deployment-is-stuck-fails)
15+
- [Why does my cluster show as unavailable / unhealthy in the Live UI?](#troubleshooting-deployment-why-does-my-cluster-show-as-unavailable-unhealthy-in-the-live-ui)
16+
17+
#### Troubleshooting Operation
18+
19+
- [Why does my cluster show as disconnected in the Live UI?](#troubleshooting-operation-why-does-my-cluster-show-as-disconnected-in-the-live-ui)
20+
- [Why can’t I see data?](#troubleshooting-operation-why-can't-i-see-data)
21+
- [Why can’t I see application profiles / flamegraphs for my pod / node?](#troubleshooting-operation-why-can't-i-see-application-profiles-flamegraphs-for-my-pod-node)
22+
- [Why is the vizier-pem pod’s memory increasing?](#troubleshooting-operation-why-is-the-vizier-pem-pod's-memory-increasing)
23+
- [Troubleshooting tracepoint scripts.](#troubleshooting-operation-troubleshooting-pixie-tracepoint-scripts)
24+
25+
## Troubleshooting Deployment
26+
27+
### How do I check the status of Pixie's components?
28+
29+
An initial overview of Pixie can be retrieved by listing all `Vizier` pods to verify whether all pods have the status `Running`:
30+
31+
```shell
32+
$ px debug pods
33+
NAME PHASE RESTARTS MESSAGE REASON START TIME
34+
pl/pl-nats-0 RUNNING 0 2022-04-08T13:17:15-07:00
35+
pl/vizier-certmgr-76f6f89ddf-6sm76 RUNNING 0 2022-04-08T13:17:26-07:00
36+
pl/vizier-cloud-connector-57c7588c67-56p5j RUNNING 0 2022-04-08T13:17:26-07:00
37+
pl/vizier-metadata-0 RUNNING 1 2022-04-08T13:17:27-07:00
38+
pl/vizier-proxy-79bd7d9b55-w5zv5 RUNNING 0 2022-04-08T13:17:27-07:00
39+
pl/vizier-query-broker-75478b59d4-smjt2 RUNNING 0 2022-04-08T13:17:27-07:00
40+
px-operator/vizier-operator-7955d5669d-wbwzz RUNNING 0 2022-04-08T13:16:56-07:00
41+
pl/kelvin-8665676895-7dcgg RUNNING 0 2022-04-08T13:17:26-07:00
42+
pl/vizier-pem-bjkbm RUNNING 0 2022-04-08T13:17:27-07:00
43+
pl/vizier-pem-znglq RUNNING 0 2022-04-08T13:17:27-07:00
44+
```
45+
46+
Cloud components can be checked by running `kubectl get pods -n plc`.
47+
48+
### How do I get the Pixie debug logs?
49+
50+
Install Pixie’s [CLI tool](/installing-pixie/install-schemes/cli) and run `px collect-logs.` This command will output a zipped file named `pixie_logs_<datestamp>.zip` in the working directory. The selected kube-context determines the Kubernetes cluster that outputs the logs, so make sure that you are pointing to the correct cluster.
51+
52+
### My deployment is stuck / fails
53+
54+
We recommend running through the following troubleshooting flow to determine where the deployment has failed.
55+
56+
<svg title='Troubleshooting the Deployment of Pixie' src='troubleshoot-flow.svg' />
57+
58+
*Deploy with CLI gets stuck at “Wait for PEMs/Kelvin”*
59+
60+
This step of the deployment waits for the newly deployed Pixie PEM pods to become ready and available. This step can take several minutes.
61+
62+
If some `vizier-pem` pods are not ready, use kubectl to check the individual pod’s events or check Pixie’s debug logs (which also include pod events).
63+
64+
If pods are still stuck in pending, but there are no Pixie specific errors, check that there is no resource pressure (memory, CPU) on the cluster.
65+
66+
*Deploy with CLI fails to pass health checks.*
67+
68+
This step of the deployment checks that the `vizier-cloud-connector` pod can successfully run a query on the `kelvin` pod. These queries are brokered by the `vizer-query-broker` pod. To debug a failing health check, check the Pixie debug logs for those pods for specific errors.
69+
70+
*Deploy with CLI fails waiting for the Cloud Connector to come online.*
71+
72+
This step of the deployment checks that the Cloud Connector can successfully communicate with Pixie Cloud. To debug this step, check the Pixie debug logs for the `vizier-cloud-connector` pod, check the firewall, etc.
73+
74+
### Why does my cluster show as unavailable / unhealthy in the Live UI?
75+
76+
Confirm that all of the `pl` and `px-operator` namespace pods are ready and available using `px debug pods`. Deploying Pixie usually takes anywhere between 5-7 minutes. Once Pixie is deployed, it can take a few minutes for the UI to show that the cluster is healthy.
77+
78+
To debug, follow the steps in the “Deploy with CLI fails to pass health checks” section in the [above question](/about-pixie/faq/#my-deployment-is-stuck-fails). As long as the Kelvin pod, plus at least one PEM pod is up and running, then your cluster should not show as unavailable.
79+
80+
## Troubleshooting Operation
81+
82+
### Why does my cluster show as disconnected in the Live UI?
83+
84+
`Cluster '<CLUSTER_NAME>' is disconnected. Pixie instrumentation on 'CLUSTER_NAME' is disconnected. Please redeploy Pixie to the cluster or choose another cluster.`
85+
86+
This error indicates that the `vizier-cloud-connector` pod is not able to connect to the cloud properly. To debug, check the events / logs for the `vizier-cloud-connector` pod. Note that after deploying Pixie, it can take a few minutes for the UI to show the cluster as available.
87+
88+
### Why can’t I see data?
89+
90+
*Live UI shows an error.*
91+
92+
Error `Table 'http_events' not found` is usually an issue with deploying Pixie onto nodes with unsupported kernel versions. Check that your kernel version is supported [here](/installing-pixie/requirements/).
93+
94+
Error `Invalid Vis Spec: Missing value for required arg service.` occurs when a script has a required argument that is missing a value. Required script arguments are denoted with an asterisk after the argument name. For example, px/service has a required variable for service name. Select the required argument drop-down box in the top left and enter a value.
95+
96+
Error `Unexpected error rpc error: code = Unknown desc = rpc error: code = Canceled desc = context canceled` is associated with a query timing out. Try reducing the `start_time` window.
97+
98+
*Live UI does not show an error, but data is missing.*
99+
100+
It is possible that you need to adjust the `start_time` window. The `start_time` window expects a negative relative time (e.g. `-5m`) or an absolute time in the format `2020-07-13 18:02:5.00 +0000`.
101+
102+
If specific services / requests are missing, it is possible that Pixie doesn't support the encryption library used by that service. You can see the list of encryption libraries supported by Pixie [here](/about-pixie/data-sources/#encryption-libraries).
103+
104+
If specific services / requests are missing, it is possible that your application was not built with [debug information](/reference/admin/debug-info). See the [Data Sources](/about-pixie/data-sources) page to see which protocols and/or encryption libraries require a build with debug information.
105+
106+
### Why can’t I see application profiles / flamegraphs for my pod / node?
107+
108+
Continuous profiling currently only supports Go/C++/Rust. The [Roadmap](/about-pixie/roadmap) contains plans to expand this support to Java, Ruby, Python, etc.
109+
110+
### Why is the vizier-pem pod’s memory increasing?
111+
112+
This is expected behavior. Pixie stores the data it collects in-memory on the nodes in your cluster; data is not sent to any centralized backend cloud outside of the cluster. So what you are observing is simply the data that it is collecting.
113+
114+
Pixie has a minimum 1GiB memory requirement per node. The default deployment is 2GiB of memory. This limit can be configured with the `--pem_memory_limit` flag when deploying Pixie. Using a value less than 1GiB is not currently recommended.
115+
116+
### Troubleshooting Pixie tracepoint scripts
117+
118+
*I’m not seeing any data for my distributed bpftrace script.*
119+
120+
Rather than query data already collected by the Pixie Platform, Distributed bpftrace Deployment scripts extend the Pixie platform to collect new data sources by deploying tracepoints when the script is run. The first time this type of script is run, it will deploy the probe and query the data (but there won't be much data at this point). Re-running the script after the probe has had more time to gather data will produce more results.
121+
122+
*I'm getting error that tracepoints failed to deploy.*
123+
124+
Run the `px/tracepoint_status` script. It should show a longer error message in the "Tracepoint Status" table.
125+
126+
*How do I remove a tracepoint table?*
127+
128+
It is not currently possible to remove a table. Instead, we recommend renaming the table name (e.g. table_name_0) while debugging the script.

0 commit comments

Comments
 (0)