Tracing

      +
      Tracing and Metrics provide fine-grained insight into how an application is performing, and helps to diagnose when it is not.

      Health indicators can tell you a lot about the performance of an application. Monitoring them is vital both during its development and production lifecycle.

      For a database, performance is best encapsulated via per-request performance. In addition to having insight down to the request level, it is also important to understand how the application behaves as a whole. For both the micro and macro-view, tracing and metrics are powerful tools in the toolbox.

      Request Tracing allows you to analyze the latency characteristics of individual requests and answer questions about where the overall wall-clock time is spent. If a request takes 500 milliseconds to execute, it is possible to understand what fraction was spent on the server and network, and how much time was spent in the client. Couchbase’s built-in tracer will perform the aggregations and log the top slowest operations at a configurable interval to provide efficient ways to start debugging and troubleshooting.

      Metrics, by contrast, provide insight into how the application (or SDK for that matter) is performing as a whole. Individual requests are aggregated, and our default meter implementation will log aggregated percentiles on a per operation-type basis. This makes it possible to quickly pinpoint slow operation types (like a get operation,) and then further pinpoint the causes through tracing.

      Request Tracing

      Tracing provides granular timing information down the individual phases of a request/operation. For a simple overview, consider the following output for an individual request — which is logged from the default tracer that ships with the SDK, and is easily enabled and configured in the cluster options.

      {
        "last_local_id": "01a657-d761-384d-25d7-0ca8c81ba9c2e7",
        "last_local_socket": "192.168.106.1:50128",
        "last_operation_id": "0x4b27",
        "last_remote_socket": "192.168.106.129:11210",
        "last_server_duration_us": 8,
        "operation_name": "cb.get",
        "total_duration_us": 1003,
        "total_server_duration_us": 8
      }
      Ensure you have logging enabled to at least info level for tracing and metrics logging.

      It is possible to identify how long the operation took to encode, how much time it spent on the server, to which node it got sent to and much more. Only requests over a configurable threshold are logged and reported, so it is possible to fine tune these settings depending on the performance characteristics of each environment.

      Configuration options are available to not only tune the thresholds, but also the logging interval and sample size.

      Metrics

      Request metrics provide insight into the sum of all individual requests flowing through an SDK. The overall timings (which can be compared to the total_duration_us value from the tracing section above) are aggregated and dumped with percentile values on each configurable interval. Here is an example:

      {
        "meta": {
          "emit_interval_s": 600,
        },
        "operations": {
          "kv": {
              "upsert": {
              "total_count": 120,
              "percentiles_us": {
                  "50.0": 5,
                  "90.0": 10,
                  "99.0": 33,
                  "99.9": 55,
                  "100.0": 101
              }
              }
          }
        }
      }

      The JSON above shows the maximum latency as well as lower percentiles for individual operations. Note that they are aggregated across all nodes and include potential retries, so this really shows end-to-end latency from a users/SDK perspective.

      Additional Information

      A practical look at working with metrics is available in the Slow Operations Logging howto docs.