What is it about?
OpenShift has seen a lot of traction with the release of its third version based on Kubernetes a couple of years ago. More and more companies after a thorough evaluation of OpenShift Container Platform (OCP) have built an on-premise or in the cloud PaaS. With the next step, they have started to run their applications on OCP. One of the important aspects of running applications in production is the capacity of quickly restoring services to the normal service level after an incident followed by the identification and the resolution of the underlying problem. In this respect, I want to present in this blog a few approaches for troubleshooting Java applications running on OpenShift. Similar approaches can be taken with other languages.
Debugging applications during development phase can be done thanks to features like:
- Debug mode for resolving issues during startup.
- Port forwarding for connecting an IDE like JBDS to an application running in a remote container and debugging it with breakpoints and object inspection.
This has been presented in blogs like here and here.
In this blog, on the contrary, I want to focus on troubleshooting applications in production and to cover things like capturing heap and thread dumps, resource consumption per thread. These are techniques that have more than once been helpful in the past for resolving deadlocks, memory leaks or performance degradation due to excessive garbage collection for instance.
Let’s get into the heart of the matter!
Is there something going on?
It all begins by noticing that there is something fishy. OpenShift provides several mechanisms for scrutinizing the health of your application.
Health and readiness
Built-in health and readiness checks that support custom probes are the first defense mechanism. OpenShift reacts to a negative answer to the former by stopping the faulty container and starting a fresh one. The later the answer to the latter is to take the pod out of the work rotation. No request is being forwarded to the service and router layer to the faulty container.
Logs
Container logs can be accessed in several ways:
- $ oc logs <pod-name>
- the web console
- Kibana for the aggregated logs
Metrics
CPU and memory consumption is visible in the web console or accessible through Hawkular, Prometheus or a monitoring system integrated with one of these or directly to Heapster or the kubelet API. Custom metrics can also be gathered through JMX or Prometheus-endpoints. Trends depending on the time range can be visualized on Prometheus, Hawkular or CloudForms
JMX
Besides custom metrics, JMX also gives access to information on the JVM or on the application itself if specific MBeans have been implemented, which is the case for EAP for instance.
Events
When a pod is killed and restarted due to out of memory or due to the health check not giving a positive answer an event gets created, which is visible in the web console, through the CLI: "oc get events" or a monitoring system watching events.
What to do next?
Let’s find the root
Before the rise of containers when one wanted to investigate memory leaks, deadlocks or garbage collections impacting performances a few well-known techniques allowed to capture the information required for improving the situation, like:
- thread dumps
- heap dumps
- capture of thread resource consumption (RAM, CPU)
- analysis of garbage collection cycles (duration and frequency)
Techniques for capturing static states through snapshots at critical points in time or series of snapshots for getting the evolution have been described in articles like this or this.
The good news is that all this can be done with containers running on OpenShift with little difference. Let’s look at that in more detail.
MBeans; the Java console
The xPaaS images deploy the Jolokia library providing a REST API fronting JMX. This can also be used with other Java applications. If the container has a port named "Jolokia" a link to the Java Console is displayed in the pod page of the OpenShift Web Console.
Here are a few MBean operations that are useful for our purpose.
- HotSpotDiagnostic.dumpHeap
com.sun.management.HotSpotDiagnostic MBean allows to take heap dumps through the method dumpHeap (the page also provides the URL for it). The method takes two arguments:- outputFile - the system-dependent file name
- live - if true, dump only live objects i.e. objects that are reachable from others.
- DiagnosticCommand.gcClassHistogramm
com.sun.management.DiagnosticCommand.gcClassHistogramm provides the heap histogram, which is useful to see the classes and the number of their instances that have been loaded in memory. - DiagnosticCommand.threadPrint
com.sun.management.DiagnosticCommand.threadPrint is a convenient way of getting a thread dump. - GarbageCollector.LastGcInfo
com.sum.management.LastGcInfo provides information on garbage collection. Especially interesting is how often they happen and how long they last.
Other MBeans may provide valuable information depending on the investigation.
Command line
It is possible to gather information similar to what has been described in the previous chapter through the command line. Therefore, the name of the pod and possibly the name of the containers need first to be retrieved:
$ oc get pods -n test
NAME READY STATUS RESTARTS AGE
tomcat8-4-37683 1/1 Running 2 4d
$ oc describe pod tomcat8-4-37683
Name: tomcat8-4-37683
Namespace: test
Security Policy: restricted
Node: osenode3.sandbox.com/192.168.122.143
Start Time: Wed, 02 Aug 2017 22:07:15 +0200
Labels: app=tomcat8
deployment=tomcat8-4
deploymentconfig=tomcat8
Status: Running
IP: 10.128.1.3
Controllers: ReplicationController/tomcat8-4
Containers:
tomcat8:
The examples below will be given for the pod tomcat8-4-37683 with IP address 10.128.1.3 and container name tomcat8. This needs to be changed to the values specific to the pod and container the application is running on.
- Memory dump
If the JDK is available inside the container jcmd (previously jmap) can be used. In many companies, for security reasons (reducing the attack surface) only JREs are installed on production machines. In that case, it is a bit trickier gdb may not be available in your container to take a core dump, from which a memory dump can be extracted with jmap. The only approach I am aware of is using the JMX API through Jolokia. In most of the cases, your java process will be the only process running inside the container and will have the id 1.
Note: It is important to be aware that taking a heap dump will freeze the JVM and impact the application performance. An option to limit the impact is to only take a histogram.
Here is how this can be done with jcmd.
jcmd
$ oc exec tomcat8-4-37683 -c tomcat8 -- jcmd 1 GC.heap_dump /tmp/heap.hprof \
oc rsync tomcat8-4-37683:/tmp/heap.hprof . \
oc exec tomcat8-4-37683 -- rm /tmp/heap.hprof
$ oc exec tomcat8-4-37683 -c tomcat8 -- jcmd 1 GC.class_histogram
With Jolokia, it is a bit more involved from the command line. The following examples that connect directly to the pods are to be run from the master where the required certificates and the connectivity are available. Alternatively, the master can be used as a reverse proxy, only for GET requests as far as I have seen. As part of a monitoring solution, one should also be careful not to put unnecessary load on the master.
Jolokia
Direct connection
$ curl -k -H "Authorization: Bearer $(oc sa get-token management-admin -n management-infra)" \
-H "Content-Type: application/json" \
--cert /etc/origin/master/master.proxy-client.crt \
--key /etc/origin/master/master.proxy-client.key \
'https://10.128.1.3:8778/jolokia/' \
-d '{"mbean":"com.sun.management:type=HotSpotDiagnostic", "operation":"dumpHeap", "arguments":["/tmp/heap.hprof","true"], "type":"exec"}' \
oc rsync tomcat8-4-37683:/tmp/heap.hprof . \
oc exec tomcat8-4-37683 -- rm /tmp/heap.hprof
$ curl -k -H "Authorization: Bearer $(oc sa get-token management-admin -n management-infra)" \
-H "Content-Type: application/json" \
--cert /etc/origin/master/master.proxy-client.crt \
--key /etc/origin/master/master.proxy-client.key \
'https://10.128.1.3:8778/jolokia/' \
-d '{"mbean":"com.sun.management:type=DiagnosticCommand", "operation":"gcClassHistogram", "arguments":["-all"], "type":"exec"}' \
| sed 's/\\n/\n/g' > histogram.hprof
Using the master as reverse proxy
$ curl -H "Authorization: Bearer $(oc whoami -t)" \
-H "Content-Type: application/json" \
'https://osemaster.sandbox.com:8443/api/v1/namespaces/test/pods/https:tomcat8-4-37683:8778/proxy/jolokia/exec/com.sun.management:type=HotSpotDiagnostic/dumpHeap/!/tmp!/heap.hprof/true' \
oc rsync tomcat8-4-37683:/tmp/heap.hprof . \
oc exec tomcat8-4-37683 -- rm /tmp/heap.hprof
$ curl -H "Authorization: Bearer $(oc whoami -t)" \
-H "Content-Type: application/json" \
https://osemaster.sandbox.com:8443/api/v1/namespaces/test/pods/https:tomcat8-4-37683:8778/proxy/jolokia/exec/com.sun.management:type=DiagnosticCommand/gcClassHistogram/-all | sed 's/\\n/\n/g' > histogram.hprof
The analysis of the memory dump can be performed with Eclipse Memory Analyzer Tool (MAT):
http://www.eclipse.org/mat/
- Thread dump
Apart from jcmd (previously jstack), it is also possible to take a thread dump by sending the SIGQUIT(3) signal to the Java process.
jcmd
$ oc exec tomcat8-4-37683 -c tomcat8 -- jcmd 1 Thread.print
Kill signal
$ oc exec tomcat8-4-37683 -c tomcat8 -- kill -3 1
Note: This won’t terminate the Java process unless it was started with -Xrs, which disables signal handling in the JVM. The thread dump is written to the standard output, which is redirected to the container logs. The result can be seen with:
$ oc logs tomcat8-4-37683
[...]
Full thread dump OpenJDK 64-Bit Server VM (25.131-b12 mixed mode):
"Attach Listener" #24 daemon prio=9 os_prio=0 tid=0x00007ff2f4001000 nid=0x36 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Thread-6" #23 daemon prio=5 os_prio=0 tid=0x00007ff2e8003800 nid=0x26 waiting on condition [0x00007ff2ec3fa000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000d3b4e070> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
[...]
Jolokia
Direct connection
$ curl -k -H "Authorization: Bearer $(oc sa get-token management-admin -n management-infra)" \
-H "Content-Type: application/json" \
--cert /etc/origin/master/master.proxy-client.crt \
--key /etc/origin/master/master.proxy-client.key \
'https://10.128.1.3:8778/jolokia/' \
-d '{"mbean":"com.sun.management:type=DiagnosticCommand", "operation":"threadPrint", "arguments":["-l"], "type":"exec"}' \
| sed 's/\\n/\n/g' | sed 's/\\t/\t/g' > thread.out
Using the master as reverse proxy
$ curl -H "Authorization: Bearer $(oc whoami -t)" \
-H "Content-Type: application/json" \
'https://osemaster.sandbox.com:8443/api/v1/namespaces/test/pods/https:tomcat8-4-37683:8778/proxy/jolokia/exec/com.sun.management:type=DiagnosticCommand/threadPrint/-l' \
| sed 's/\\n/\n/g' | sed 's/\\t/\t/g' > thread.out
- Resource consumption
It is often convenient to see how much resources are being consumed by each of the threads listed by the previous command. This can easily be done with the top command:
$ oc exec tomcat8-4-37683 -c tomcat8 -- top -b -n 1 -H -p 1 > top.out
$ cat top.out
top - 07:16:34 up 19:50, 0 users, load average: 0.00, 0.08, 0.12
Threads: 23 total, 0 running, 23 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.4 us, 0.7 sy, 0.0 ni, 96.7 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 2915764 total, 611444 free, 1074032 used, 1230288 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 1622968 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 1000060+ 20 0 3272048 272060 16084 S 0.0 9.3 0:00.03 java
15 1000060+ 20 0 3272048 272060 16084 S 0.0 9.3 0:09.97 java
16 1000060+ 20 0 3272048 272060 16084 S 0.0 9.3 0:00.14 java
The 9th column shows the CPU consumption and the 10th column shows the memory one per thread.
Using the thread dump analyzer, it is possible to map resource consumption with thread information through the thread id (native-id in TDA).
Additional information on TDA can be found at:
https://github.com/irockel/tda
If you don’t want to use a tool you will need to convert the thread id to hexadecimal to match the nid field in the thread dump. Something like this should do the job:
$ echo "obase=16; <thread-id>" | bc
- Garbage collections
Frequent and long lasting garbage collections may be a symptom of an issue with the application. Full garbage collection may also add unacceptable latency to request processing as it "stops the world". Information gathering on garbage collection is another frequent activity related to troubleshooting performance issues and optimization.
Jstat
Jstat is often used to capture information on garbage collection. Unfortunately, it does not work out of the box with lots of images on OpenShift. The reason for that is that Java stores relevant information in the file /tmp/hsperfdata_. Containers on OpenShift run per default with a high number user id, something like 1000060000 and container in different projects uses different user ids. This is for security reason: this user id won’t match any real user on the host system. It adds an additional protection layer. A consequence of this dynamic allocation of user ids is that there is no match for them in /etc/passwd inside the container, hence the data cannot be stored by the java process into /tmp/hsperfdata_. There are workarounds. One is to have a writable /etc/passwd, which is updated at startup. A finer approach is to use nss wrapper. Jstat will work for instance out of the box with the Jenkins image here.
jstat
$ oc exec tomcat8-4-37683 -- jstat -gcutil 1 10000 5 > jstat.out
Jolokia
Direct connection
$ curl -k -H "Authorization: Bearer $(oc sa get-token management-admin -n management-infra)" \
-H "Content-Type: application/json" \
--cert /etc/origin/master/master.proxy-client.crt \
--key /etc/origin/master/master.proxy-client.key \
'https://10.129.0.2:8778/jolokia/' \
-d '{"mbean":"java.lang:type=GarbageCollector,name=PS MarkSweep", "attribute":"LastGcInfo", "type":"read"}'
$ curl -k -H "Authorization: Bearer $(oc sa get-token management-admin -n management-infra)" \
-H "Content-Type: application/json" \
--cert /etc/origin/master/master.proxy-client.crt \
--key /etc/origin/master/master.proxy-client.key \
'https://10.129.0.2:8778/jolokia/' \
-d '{"mbean":"java.lang:type=GarbageCollector,name=PS Scavenge", "attribute":"LastGcInfo", "type":"read"}'
Using the master as reverse proxy
$ curl -H "Authorization: Bearer $(oc whoami -t)" \
-H "Content-Type: application/json" \
'https://osemaster.sandbox.com:8443/api/v1/namespaces/test/pods/https:tomcat8-4-37683:8778/proxy/jolokia/read/java.lang:type=GarbageCollector,name=PS%20MarkSweep/LastGcInfo
$ curl -H "Authorization: Bearer $(oc whoami -t)" \
-H "Content-Type: application/json" \
'https://osemaster.sandbox.com:8443/api/v1/namespaces/test/pods/https:tomcat8-4-37683:8778/proxy/jolokia/read/java.lang:type=GarbageCollector,name=PS%20Scavenge/LastGcInfo
- Evolution in time
If no monitoring solution is in place to gather information required for troubleshooting the commands above can easily be grouped into a script that gets launched at regular intervals to see the dynamic aspects: objects, threads creation, the evolution of memory and CPU consumption, etc. As the creation of a complete heap dump is intrusive it is recommended to limit the memory capture to histograms.
Shooting oneself in the foot
One issue that operation teams often face is that the resolution of an incident has for consequence that the data, which could have been used for identifying the root cause and fixing the underlying problem is lost. ITIL theory says:
- The objective of the Incident Management Lifecycle is to restore the service as quickly as possible to meet SLA.
- The objective of the Problem Management function is to identify the underlying causal factor, which may relate to multiple incidents.
In the case of a dead lock or performance degradation due to a memory leak, the response is often to restart the application to restore to a sane state as quickly as possible. Unfortunately by doing that you have just deleted the application state that could have provided you critical information for the resolution of the underlying problem. I have been more than one time like an astronomer patiently waiting for the comet to come back and in my case the next incident to happen with all my tooling in place for capturing the golden dust of data, which would allow me to identify the cause of the incident… unfortunate. OpenShift provides advanced orchestration capabilities and here again, they can come to the rescue.
Ideally, your application would run with more than a single instance for high availability but even if it were not the case, the following approach would help. The rationale is pretty simple: Instead of killing the faulty container you would just take it out of the request serving pool, the service and route layers in OpenShift. This, unfortunately, would not help if the application were, for instance, consuming from a messaging system. Services and Routes know the pods the requests are to be sent to based on a label/selector mechanism. You just need to change the label on the faulty pod for taking it out of the pods backing the service and processing requests. Let’s look at how it can be done with a quick start.
- Create an application.
$ oc new-app --name=myapp jboss-webserver30-tomcat8-openshift~https://github.com/openshiftdemos/os-sample-java-web.git
- Scale the deployment to two instances (optional)
$ oc scale dc myapp --replicas=2
- Change the label on the "faulty" pod
$ oc get pods
NAME READY STATUS RESTARTS AGE
myapp-1-build 0/1 Completed 0 3m
myapp-1-gd2fn 1/1 Running 0 1m
myapp-1-q1k3n 1/1 Running 0 2m
$ oc describe service myapp
Name: myapp
Namespace: test
Labels: app=myapp
Selector: app=myapp,deploymentconfig=myapp
Type: ClusterIP
IP: 172.30.225.169
Port: 8080-tcp 8080/TCP
Endpoints: 10.128.1.6:8080,10.129.0.250:8080
Port: 8443-tcp 8443/TCP
Endpoints: 10.128.1.6:8443,10.129.0.250:8443
Port: 8778-tcp 8778/TCP
Endpoints: 10.128.1.6:8778,10.129.0.250:8778
$ oc label pod myapp-1-gd2fn --overwrite deploymentconfig=mydebug
pod "myapp-1-gd2fn" labeled
$ oc get pods
NAME READY STATUS RESTARTS AGE
myapp-1-build 0/1 Completed 0 8m
myapp-1-gd2fn 1/1 Running 0 5m
myapp-1-gtstj 1/1 Running 0 5s
myapp-1-q1k3n 1/1 Running 0 7m
$oc describe service myapp
Name: myapp
Namespace: test
Labels: app=myapp
Selector: app=myapp,deploymentconfig=myapp
Type: ClusterIP
IP: 172.30.225.169
Port: 8080-tcp 8080/TCP
Endpoints: 10.128.1.7:8080,10.129.0.250:8080
Port: 8443-tcp 8443/TCP
Endpoints: 10.128.1.7:8443,10.129.0.250:8443
Port: 8778-tcp 8778/TCP
Endpoints: 10.128.1.7:8778,10.129.0.250:8778
Immediately after changing the label, OpenShift starts a new pod to satisfy the deployment configuration requirement: to have two running instances → incident resolved. The faulty container is however still running, isolated. You can see in the output above that the new one with the IP 10.128.1.7 in the list of endpoints in the service component has replaced the container with the IP 10.128.1.6.
The service layer is per default implemented through iptables. It is easy to see that the faulty pod is taken out of the service forwarding:
$ iptables -L -n -t nat | grep test/myapp:8080
KUBE-MARK-MASQ all -- 10.129.0.250 0.0.0.0/0 /* test/myapp:8080-tcp */
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 /* test/myapp:8080-tcp */ tcp to:10.129.0.250:8080
KUBE-MARK-MASQ all -- 10.128.1.7 0.0.0.0/0 /* test/myapp:8080-tcp */
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 /* test/myapp:8080-tcp */ tcp to:10.128.1.7:8080
In a similar way, it is possible in OpenShift to look at the HAProxy configuration and see that it has been actualized. You would need to log into the router ("oc rsh ") for that. Here is the relevant part in the configuration file:
/var/lib/haproxy/conf/ haproxy.config
# Plain http backend
backend be_http_test_myapp
[...]
server 625c4787bdcee4668557b3d7edb1c168 10.128.1.7:8080 check inter 5000ms cookie 625c4787bdcee4668557b3d7edb1c168 weight 100
server 4c8bc31e32a06e7a7905c04aa02eb97b 10.129.0.250:8080 check inter 5000ms cookie 4c8bc31e32a06e7a7905c04aa02eb97b weight 100
Now that the incident has been resolved, we can use some of the techniques described previously to troubleshoot the application and resolve the problem. When you are finished, the pod can just be killed.
$ oc delete pod myapp-1-gd2fn
If you decided to label it back to the previous value, instead the replication controller would notice that there is one pod more than the desired number and would kill one of them to come back to the desired state.
That’s it!