Autonomous Operator Troubleshooting

If you run into issues with the Autonomous Operator, you can troubleshoot by examining the logs and events that it generates.

The Autonomous Operator generates logs that can be used for auditing and troubleshooting purposes. This page describes logging that is specific to the Autonomous Operator itself. For information about Couchbase cluster logging, refer to Manage Couchbase Server Logging.

Overview

The Autonomous Operator generates logs that include information about itself and the various other Kubernetes components that make up the Operator deployment. These logs are distinct from the logs that are generated by the Couchbase Server application.

This page provides information about how to collect and scrutinize logging information that is produced by the Autonomous Operator. When troubleshooting the Autonomous Operator, it is important to first rule out Kubernetes itself as the root cause of the problem. The Kubernetes Troubleshooting Guide contains a great deal of helpful information about debugging applications within a Kubernetes cluster.

Familiarity with the Operator’s configuration settings can be helpful when troubleshooting the Autonomous Operator.

Collecting Autonomous Operator Logs

Using kubectl or oc, you can choose to print the Autonomous Operator logs to to standard console output.

Kubernetes
OpenShift

Start by getting the name of the Autonomous Operator pod.

$ kubectl get po -lapp=couchbase-operator

NAME                                  READY     STATUS    RESTARTS   AGE
couchbase-operator-1917615544-h20bm   1/1       Running   0          20h

Use the pod name to get the logs.

$ kubectl logs couchbase-operator-1917615544-h20bm

time="2018-01-23T22:56:34Z" level=info msg="couchbase-operator v1.1.0 (release)" module=main
time="2018-01-23T22:56:34Z" level=info msg="Obtaining resource lock" module=main
time="2018-01-23T22:56:34Z" level=info msg="Starting event recorder" module=main
time="2018-01-23T22:56:34Z" level=info msg="Attempting to be elected the couchbase-operator leader" module=main
time="2018-01-23T22:56:51Z" level=info msg="I'm the leader, attempt to start the operator" module=main
time="2018-01-23T22:56:51Z" level=info msg="Creating the couchbase-operator controller" module=main

Alternatively, you can specify the Autonomous Operator deployment to get the logs.

$ kubectl logs deployment/couchbase-operator

Since there is only one instance of the Autonomous Operator in the deployment, the the underlying command will automatically select the correct pod and print the logs.

Start by getting the name of the Autonomous Operator pod.

$ oc get po -lapp=couchbase-operator

NAME                                  READY     STATUS    RESTARTS   AGE
couchbase-operator-1917615544-h20bm   1/1       Running   0          20h

Use the pod name to get the logs.

$ oc logs couchbase-operator-1917615544-h20bm

time="2018-01-23T22:56:34Z" level=info msg="couchbase-operator v1.1.0 (release)" module=main
time="2018-01-23T22:56:34Z" level=info msg="Obtaining resource lock" module=main
time="2018-01-23T22:56:34Z" level=info msg="Starting event recorder" module=main
time="2018-01-23T22:56:34Z" level=info msg="Attempting to be elected the couchbase-operator leader" module=main
time="2018-01-23T22:56:51Z" level=info msg="I'm the leader, attempt to start the operator" module=main
time="2018-01-23T22:56:51Z" level=info msg="Creating the couchbase-operator controller" module=main

Alternatively, you can specify the Autonomous Operator deployment to get the logs.

$ oc logs deployment/couchbase-operator

Since there is only one instance of the Autonomous Operator in the deployment, the the underlying command will automatically select the correct pod and print the logs.

If you’re troubleshooting the Autonomous Operator, watch for the following messages which indicate that the Operator is unable to reconcile a Couchbase cluster into a desired state:

Logs with level=error
Operator is unable to get cluster state after N retries

Profiling the Autonomous Operator

For more advanced troubleshooting, the Autonomous Operator supports the Go language pprof feature and serves profiling data on its default listen address localhost:8080. You can access this endpoint by running a remote shell or forwarding the port to your local system.

Kubernetes
OpenShift

To access goroutine stack traces using a shell:

$ kubectl exec -it couchbase-operator-599bcf47f-8wswh sh

$ wget -O- 'http://localhost:8080/debug/pprof/goroutine?debug=1' | less

To access Go memory usage using a port forward:

$ kubectl port-forward couchbase-operator-599bcf47f-8wswh 8080:8080

$ go tool pprof localhost:8080/debug/pprof/heap
(pprof) traces

To access goroutine stack traces using a shell:

$ oc exec -it couchbase-operator-599bcf47f-8wswh sh

$ wget -O- 'http://localhost:8080/debug/pprof/goroutine?debug=1' | less

To access Go memory usage using a port forward:

$ oc port-forward couchbase-operator-599bcf47f-8wswh 8080:8080

$ go tool pprof localhost:8080/debug/pprof/heap
(pprof) traces

Kubernetes Events

Kubernetes Events provide insights into what is happening inside a Kubernetes cluster. They record significant occurrences and changes in the state of resources, such as the creation, deletion, or failure of pods, nodes, services, and other Kubernetes objects.

They can be used to monitor changes that have occurred in the cluster, and can be helpful when troubleshooting issues with the Autonomous Operator. However, they expire after a certain period of time, typically one hour. You can use the Kubernetes Event Collector tool to collect and store events for longer periods of time.

The Kubernetes Event Collector (KEL) watches for Kubernetes events within a namespace and stores them to a buffer which can be stashed. It can be deployed and configured using helm

$ helm install event-collector charts/event-collector

For more details about the tool and how to use it, refer to the repo README: https://github.com/couchbase/couchbase-k8s-event-collector