Rebalance Reference

Couchbase Server creates a report for every rebalance that is performed. This section explains how to obtain the report, and how to read it.

Obtaining a Rebalance Report

Couchbase Server automatically creates a rebalance report for every rebalance that occurs on the cluster. The report’s content consists of a JSON document, which provides statistics for every service that has been involved in the rebalance. On conclusion of the rebalance, the report can be accessed in any of the following ways:

By means of Couchbase Web Console, as described in Add a Node and Rebalance.
By means of the REST API, as described in Getting Cluster Tasks.
By accessing the directory /opt/couchbase/var/lib/couchbase/logs/reblance on any of the cluster nodes. A rebalance report is maintained here for (up to) the last five rebalances performed. Each report is provided as a *.json file, whose name indicates the time at which the report was run — for example, rebalance_report_2020-03-17T11:10:17Z.json.

For more information on logging, see Manage Logging.

Reading a Rebalance Report

Each rebalance report consists of a JSON document whose principal object is stage_info. This itself contains an object corresponding to each rebalance stage: one stage has occurred for each of the services deployed on the cluster. Therefore, if all six services have been deployed, six stages occur in a successful rebalance; and objects for analytics, eventing, search, index query, and data are provided.

Among the details provided for each service are the times at which rebalance-processes started and ended, the durations of rebalance-processes, and the numbers of documents already handled and still to be handled. When rebalance concludes successfully, a report is duly generated, with all fields corresponding to the successful completion of the rebalance-processes they represent. In cases where rebalance is interrupted (for example, by the user’s left-clicking the Stop Rebalance button, in Couchbase Web Console, or due to auto-failover), a generated report will describe a partially unsuccessful rebalance; indicating, in certain fields, an incomplete or unstarted rebalance-process, by means of the value false.

Standard Fields

Standard fields are provided for each stage. These are as follows:

startTime. The point in time at which the stage was started (or possibly, has been determined not to have started). Specified as a UTC date/time string.
completedTime. The point in time at which the stage was completed (or possibly, has been determined not to have completed). Specified as a UTC date/time string.
timeTaken. The amount of elapsed time since startTime. Specified as an integer, in milliseconds. The value is undefined, if the stage has been determined not to have started.
totalProgress. The current, total progress that has been achieved for the stage. Specified as a percentage. In a completed report, the value is expected to be 100. This is an optional field, which may not appear if not implemented for the stage.
perNodeProgress. The current progress that has been achieved for the stage for each node. The value is an object, which itself provides a progress figure for each node. Each progress figure is specified as a floating-point number between 0.0 and 1.0. In a completed report, the value is expected to be 1. This is an optional field, which may not appear if not implemented for the stage.
subStage. Details on any substages that the stage may contain. For each substage, the standard fields are identical to those for each stage. This is an optional field, which does not appear, if the stage contains no substages. (An example is delta_recovery, which is a substage reflected in the data output, when Delta Recovery Warmup has been performed on one or more nodes running the Data Service, during rebalance of that service.)
events. An array of strings, each of which provides information on a significant event. The information may be a warning, an error message, or otherwise informational. This is an optional field, which does not appear if not implemented for the stage, or if no significant events have occurred.
details. Further details of the stage. This is an optional field, which is currently implemented for the data stage only (information provided below).
rebalanceID. A 32-bit string that is the identifier for the rebalance: for example, "4f64c7dd5a1a452d3bcc3668307d64a6".
nodesInfo. Information on the nodes of the cluster, in relation to the rebalance. The following objects are provided — each consists of an array of strings, each of which represents a node. Where the category has no nodes, the array is empty.
- active_nodes. The nodes included in the rebalance.
- keep_nodes. The nodes that were intended to be part of the cluster, following rebalance.
- eject_nodes. The nodes that were to be ejected by rebalance.
- delta_nodes. The nodes on which Delta Recovery was to be performed.
- failed_nodes. The nodes that were failed over prior to rebalance, and will be ejected from the cluster on successful rebalance.
master_node. The master node for the cluster, on which the Master Services run — these are sometimes referred to as the Orchestrator. If this node is rebalanced out of the cluster, a new Orchestrator is elected by the Cluster Manager, on a surviving node. For further information, see ns-server.
completionMessage. A message that signifies the way in which rebalance ended. For example, "Rebalance completed successfully", or "Rebalance stopped by user".

If the report is downloaded by means of Couchbase Web Console, additional information in provided, concerning the download.

Standard Fields Example

The following example displays the initial section of a rebalance report, following a rebalance performed on a cluster consisting of the following nodes:

10.143.194.101, running the Data, Query, and Index Services.
10.143.194.102, running the Data and Search Services.
10.143.194.103, running the Analytics and Eventing Services.
10.143.194.104, running the Data Service.

The rebalance follows an edit made to both buckets resident on the cluster, beer-sample and travel-sample, changing the number of replicas from 1 to 2, for each. This excerpt starts at the top of the structure, and stops at the point at which the Data Service details begin (these are described further below).

{
  "stageInfo": {
    "analytics": {
      "totalProgress": 100,
      "perNodeProgress": {
        "ns_1@10.143.194.103": 1
      },
      "startTime": "2020-03-18T00:34:13.415-07:00",
      "completedTime": "2020-03-18T00:34:14.724-07:00",
      "timeTaken": 1310
    },
    "eventing": {
      "totalProgress": 100,
      "perNodeProgress": {
        "ns_1@10.143.194.103": 1
      },
      "startTime": "2020-03-18T00:34:14.724-07:00",
      "completedTime": "2020-03-18T00:34:14.939-07:00",
      "timeTaken": 214
    },
    "search": {
      "totalProgress": 100,
      "perNodeProgress": {
        "ns_1@10.143.194.102": 1,
        "ns_1@10.143.194.104": 1
      },
      "startTime": "2020-03-18T00:34:12.453-07:00",
      "completedTime": "2020-03-18T00:34:12.758-07:00",
      "timeTaken": 306
    },
    "index": {
      "totalProgress": 100,
      "perNodeProgress": {
        "ns_1@10.143.194.101": 1
      },
      "startTime": "2020-03-18T00:34:12.758-07:00",
      "completedTime": "2020-03-18T00:34:13.415-07:00",
      "timeTaken": 656
    },
    "data": {
      "totalProgress": 100,
      "perNodeProgress": {
        "ns_1@10.143.194.101": 1,
        "ns_1@10.143.194.102": 1,
        "ns_1@10.143.194.104": 1
      },
      "startTime": "2020-03-18T00:33:36.969-07:00",
      "completedTime": "2020-03-18T00:34:12.452-07:00",
      "timeTaken": 35483,
      "details": {
        .
        .
        .

Each service thus has its stage-information provided in an object named after the service. Progress is provided for the whole cluster, and per node. Start times and completion times are provided, as are elapsed times. The Data Service contains the additional object, details, which is described immediately below.

Data Service Details

The details provided for the Data Service are provided per bucket. Therefore, if the cluster contains the buckets travel-sample and bucket-sample, the details object provides a correspondingly named structure for each.

The structure for each bucket may provide:

compactionInfo. Information on compaction, if it is performed for the bucket. If compaction is not performed, the compactionInfo structure is not provided. If the compactionInfo structure is provided, it gives the averageTime required for the bucket’s compaction, per node, in milliseconds.
vbucketLevelInfo. Information on the phases whereby the vBuckets were moved during rebalance. If no vBucket movement occurred, the vbucketLevelInfo structure is not provided. If the vbucketLevelInfo structure is provided, it includes the following:
- Fields that provide the averageTime for the move, backfill, takeover, and persistence phases for the bucket. For an explanation of these terms, see Rebalance and the Data Service. Times are provided in milliseconds, to fourteen decimal places. The totalCount of vBuckets and remainingCount are also provided for the move phase: in a completed report, the remainingCount is expected to be zero.
- vbucketInfo. Detailed information for each moved vbucket that corresponds to the specified bucket. Note that vBuckets that were not moved are not included. The information is as follows:
  - id. The vBucket id, which is an integer between 0 and 1023 (or on MacOS, between 0 and 63).
  - beforeChain. The chain that this vBucket was part of, prior to rebalance. Each chain consists of one or more nodes, on each of which was located a vBucket containing an identical set of documents; one of the vBuckets being the active vBucket, and the others (if other nodes are indeed specified) being the replica vBuckets. The beforeChain is specified as an array of strings, each of which specifies a node; in the form "ns_1@10.143.194.101". The first node in the list is the master node, on which was located the active vBucket: any additional nodes in the list each hosted a replica vBucket.
  - afterChain The chain of which this vBucket is a part, following rebalance.
  - move. The startTime and completedTime for the move process that occurred, specified in each case as a UTC date/time string; plus the timeTaken for the move, in milliseconds.
  - backfill, takeover, and persistence information, specified in the same way as the move information.
  - replicationInfo. Node status-changes that have occurred due to rebalance. An object is provided for each node on which the vBucket has been promoted from replica to active, or has received mutations, or has been created. For each affected node, the node is identified, and its inDocsTotal (number of documents received or mutated) and inDocsLeft (number of documents still to be received or mutated) are specified, as integers.

Data Service Details Example

The following example provides part of the details section from the rebalance report described above, in Standard Field Example. General information is provided on the beer-sample bucket, and specific information for the beer-sample vBucket whose id is 0.

"details": {
  "beer-sample": {
    "compactionInfo": {
      "perNode": {
        "ns_1@10.143.194.101": {
          "averageTime": 465.3636363636364
        },
        "ns_1@10.143.194.102": {
          "averageTime": 267
        },
        "ns_1@10.143.194.104": {
          "averageTime": 174.9090909090909
        }
      }
    },
    "vbucketLevelInfo": {
      "move": {
        "averageTime": 4082.2177734375,
        "totalCount": 1024,
        "remainingCount": 0
      },
      "backfill": {
        "averageTime": 80.076171875
      },
      "takeover": {
        "averageTime": 71.41837732160313
      },
      "persistence": {
        "averageTime": 57.41973298599805
      },
      "vbucketInfo": [
        {
          "id": 0,
          "beforeChain": [
            "ns_1@10.143.194.101",
            "ns_1@10.143.194.102"
          ],
          "afterChain": [
            "ns_1@10.143.194.102",
            "ns_1@10.143.194.101",
            "ns_1@10.143.194.104"
          ],
          "move": {
            "startTime": "2020-03-18T00:42:58.748-07:00",
            "completedTime": "2020-03-18T00:43:02.341-07:00",
            "timeTaken": 3593
          },
          "backfill": {
            "startTime": "2020-03-18T00:42:59.446-07:00",
            "completedTime": "2020-03-18T00:42:59.524-07:00",
            "timeTaken": 77
          },
          "takeover": {
          "startTime": "2020-03-18T00:43:01.548-07:00",
          "completedTime": "2020-03-18T00:43:01.571-07:00",
          "timeTaken": 22
        },
        "persistence": {
          "startTime": "2020-03-18T00:43:01.528-07:00",
          "completedTime": "2020-03-18T00:43:01.548-07:00",
          "timeTaken": 21
        },
        "replicationInfo": {
          "ns_1@10.143.194.102": {
            "node": "ns_1@10.143.194.102",
            "inDocsTotal": 0,
            "inDocsLeft": 0
          },
          "ns_1@10.143.194.104": {
            "node": "ns_1@10.143.194.104",
            "inDocsTotal": 9,
            "inDocsLeft": 0
          }
        }
      },
      .
      .
      .

The compactionInfo object contains the average time taken for compaction per node. The vbucketLevelInfo object provides an overall averageTime for each phase whereby vBuckets were moved, during rebalance.

The vbucketInfo section provides a sequence of objects, one for each vBucket that was moved during rebalance. In this example, only the first of these (id 0) is shown. The beforeChain indicates that prior to rebalance, the active vBucket resided on node 10.143.194.101, with its single replica on node 10.143.194.102. The afterChain indicates that following rebalance, the active vBucket resides on node 10.143.194.102, and the two replicas reside on nodes 10.143.194.101 and 10.143.194.104 respectively.

Date/time strings for startTime and completedTime, and an integer for timeTaken, are provided for each of the move stages for this vBucket.

The replicationInfo object shows that 10.143.194.192 has become the host for the active bucket; with no documents having needed to be moved onto this node — since they already resided there within a replica vBucket, which was promoted to active during rebalance. It also shows that 9 documents were moved onto node 10.143.194.104 — since the rebalance process placed an additional replica there, in accordance with the pre-rebalance bucket-reconfiguration.