Indexes

  • Capella Columnar
  • reference
    +
    You use indexes to speed up queries on remote and standalone collections.

    Indexes can speed up queries if you apply them properly. The sections in this topic describe scenarios in which you can use indexes to speed up query processing.

    You cannot index external collections. To make your queries on external data stores more efficient, when you create the collection you can choose to specify a location path that is as specific as possible. See Design a Location Path.

    Indexes

    An index is a materialized access path for data in a collection. You can create more than one index on the same collection. Each index name must be unique within a collection. Creating an index fails if there is an existing index with the same name in the target collection and IF NOT EXISTS is not specified.

    For each JSON document ingested into a collection, the system computes the indexed key for each index. The system computes the index key of a secondary index as follows:

    • Extract the target field values from the JSON document according to the specified path of fields.

    • For non-numeric typed index fields—that is, the specified type is STRING—if the specified type of the field is the same as the actual field value’s type, the value becomes part of the indexed key, otherwise the system cannot build the indexed key.

    • For numeric typed index fields—that is, the specified type is DOUBLE or BIGINT—the actual field value can be cast to the specified type and it becomes part of the indexed key. If the cast is impossible, the system cannot build the indexed key.

    After the system builds the indexed key, it’s inserted into the secondary index. If the system cannot build the indexed key, there is no entry made in the index for this object.

    Secondary indexes are automatically maintained by the system during data ingestion—that is, when a corresponding remote link is connected and populating its collections. In addition, they’re automatically rebalanced when their shadow collections are rebalanced (scaled up or scaled down).

    Standard Indexes

    This section contains observations about standard, non-array indexes.

    By default, standard indexes store NULL values. This allows prefix search to use composite key secondary indexes. However, when you create the index, you can exclude NULL and MISSING values from the index. If this is the case, the indes does not accelerate queries that select objects based on the existence of a NULL field.

    Indexing for Selection Queries

    The query optimizer chooses to use a secondary index for query execution if both of the following conditions are met:

    • The query contains a conjunctive equality or range predicate over one or more fields, or a join predicate: see the next section. The conjunctive predicate has a form:

      QualifiedName Operator Literal ( AND field Operator Literal )+

      where Operator is =, >, >=, <, <=, or BETWEEN;

    • There is an index with a key, such that the corresponding fields in the predicate form a prefix of that key. For example, suppose that there is an index on collection foo with key fields c_s and c_d.

     CREATE INDEX idx_s_d ON foo(c_s:STRING, c_d:DOUBLE);

    The following query uses the index because it has an equality predicate = on a field c_s that is a prefix of the indexed key c_s, c_d:

     SELECT f.c_x as res
     FROM foo f
     WHERE f.c_s = 'world';

    To prevent an available index from being used for a particular query predicate—for example, because there are many, many matching objects—you can include a skip-index hint as shown in the following example.

    The query optimizer automatically makes these decisions for you in most cases.
     SELECT f.c_x as res
     FROM foo f
     WHERE f.c_s /*+ skip-index */ = 'world';

    If multiple indexes are eligible access paths, there can be two cases:

    • Two or more indexes are sharing the same prefix and the predicate is on that prefix. For example: indexA is on (c1, c2), indexB is on (c1, c3) and the predicate is on c1 (c1 = 100). In this case, the query optimizer picks one of the indexes.

    • No indexes share the same prefix and the predicate refers to fields from each individual index. For example: indexA in on (c1) and indexB is on (c2), the predicate is on c1 and c2 (c1 = 100 and c2 = 200). In this case, both indexes will be used to retrieve matched primary keys and then the key sets will be intersected to further filter retrieved primary keys.

    Indexing for Join Queries

    SQL++ for Capella columnar supports joins from standard SQL in the following forms:

    • Inner join:

      SELECT * FROM ds_outer, ds_inner WHERE <predicate>;
      SELECT * FROM ds_outer JOIN ds_inner ON <predicate>;
      SELECT * FROM ds_outer INNER JOIN ds_inner ON <predicate>;
    • Left outer join:

      SELECT * FROM ds_outer LEFT JOIN ds_inner ON <predicate>;
      SELECT * FROM ds_outer LEFT OUTER JOIN ds_inner ON <predicate>;
    • Right outer join:

      SELECT * FROM ds_outer RIGHT JOIN ds_inner ON <predicate>;
      SELECT * FROM ds_outer RIGHT OUTER JOIN ds_inner ON <predicate>;

    ds_outer is the outer collection and ds_inner is the inner collection, in the order in which they appear in the FROM clause.

    The join predicate is an equality or range predicate that refers to fields from both branches of the join, in the form of:

    expr_outer OP expr_inner

    Where:

    • OP is <, , =, >=, >, or BETWEEN

    • expr_inner is a field from the inner collection

    • expr_outer is a field from the outer collection

    Array Indexes

    Capella Columnar also provides array indexes, which enable you to index values within an array, or fields within an object nested in an array. You can create an array index by providing a sequence of UNNEST and SELECT keywords to identify the field to index.

    Array indexes accelerate a query that involves some array-valued field. This enables fast evaluation of predicates in queries involving arrays or arrays of nested objects. For brevity, all further mentions of array-valued fields are also applicable to multiset-valued fields.

    In Capella Columnar, array indexes are not meant to serve as covering indexes. Instead, array indexes are meant only to accelerate queries involving multi-valued fields.

    There are also some differences between array indexes and standard indexes concerning how the query optimizer uses them. See Array Index Parameter.

    Quantification Queries

    A common use case for array indexes involves quantifying some or all elements within an array. Quantification queries have two variants: existential and universal.

    • Existential queries ask if any element in some array satisfies a given predicate. Membership queries are a specific type of existential query, asking if any element in some array is equal to a particular value.

    • Universal queries ask if all elements in some array satisfy a particular predicate. Empty arrays are not stored in an array index, meaning that you must also specify that the array is non-empty to tell Capella Columnar that it’s possible to use an array index as an access method for the given query.

    The examples that follow suppose the existence of a collection named products, containing two fields: productno, an integer, and categories, an array of strings in the Commerce dataset. You can follow the instructions for the Commerce example dataset to set up a standalone collection for this data.

    [
      { "productno": 347, "categories": ["Food"]},
      { "productno": 193, "categories": ["Drink"]},
      { "productno": 460, "categories": ["Food", "Frozen"]}
    ]

    You can create an array index on the categories field of the products collection as follows.

     CREATE INDEX pCategoriesIdx
     ON products (UNNEST categories:STRING)
     EXCLUDE UNKNOWN KEY;
    EXCLUDE UNKNOWN KEY is required for array indexes.

    Suppose you want to find all products that have the category "Food". The following membership query uses the pCategoriesIdx index.

     SELECT p
     FROM products p
     WHERE "Food" IN p.categories;

    You can rewrite this query as an explicit existential quantification query with an equality predicate. This also uses the pCategoriesIdx index:

     SELECT p
     FROM products p
     WHERE SOME c IN p.categories SATISFIES c = "Food";

    You can create an array index on the qty and price fields in the items array of the orders collection as follows.

     CREATE INDEX oItemsQtyPriceIdx
     ON orders (UNNEST items SELECT qty:BIGINT, price:DOUBLE)
     EXCLUDE UNKNOWN KEY;

    Now suppose you want to find all orders that only have items with large quantities and low prices. The following universal quantification query uses the oItemsQtyPriceIdx index:

     SELECT o
     FROM orders o
     WHERE LEN(o.items) > 0 AND
           (EVERY i IN o.items SATISFIES i.qty > 100 AND i.price < 5.00);

    Take note of the LEN(o.items) > 0 conjunct. Array indexes cannot be used for queries with potentially empty arrays.

    Explicit Unnesting Queries

    You can also use array indexes to accelerate queries that involve the explicit unnesting of array fields. You can express the same membership / existential example above using an explicit UNNEST query. To keep the same cardinality as the query above, that is, to undo the UNNEST, the query adds a DISTINCT clause. The pCategoriesIdx index is still used.

     SELECT DISTINCT p
     FROM products p, p.categories c
     WHERE c = "Food";

    As another example, suppose that you want to find all orders that have some item with a large quantity. The following query uses the oItemsQtyPriceIdx index, using only the qty field.

     SELECT DISTINCT o
     FROM orders o, o.items i
     WHERE i.qty > 100 AND i.price > 0;

    In this case, even though you do not want to filter the results by price, you must specify a dummy predicate on the price field so that the query optimizer can select the required index.

    Join Queries

    Finally, array indexes can also be used for index nested-loop joins if the field being joined is located within an array. You can create an array index on the itemno field in the items array of the orders collection as follows.

     CREATE INDEX oProductIDIdx
     ON orders (UNNEST items SELECT itemno:BIGINT)
     EXCLUDE UNKNOWN KEY;

    Now suppose you want to find all products located in a specific order. You can accomplish this with the join query that follows. If an index is possible for the join, the optimizer uses it if it’s the most cost-effective option. However, if you specify a join hint like indexnl as in the example that follows, Capella Columnar uses the index even if it’s more expensive than a hash join.

     SELECT DISTINCT p
     FROM products p JOIN orders o
     ON SOME i IN o.items SATISFIES i.itemno /*+ indexnl */ = to_bigint(p.productno)
         WHERE o.custid = "C41";

    Arrays in Arrays

    Array indexes are not just limited to arrays of depth 1. You can generalize array indexes to arbitrary depth, as long as an object encapsulates each array. For example, suppose the orders collection includes the qty field in a double-nested items array.

    {
      "orderno": 2001,
      "items0": [
        {
          "items1": [
            {
              "qty": 100,
              // ...
            }
          ]
        }
      ]
    }

    The following statement indexes the qty field in a double-nested items array.

     CREATE INDEX oItemItemQtyIdx
     ON orders (UNNEST items0 UNNEST items1 SELECT qty:INT)
     EXCLUDE UNKNOWN KEY;

    Similarly, suppose the orders collection includes the qty field in a triple-nested items array.

    {
      "orderno": 3001,
      "items0": [
        {
          "items1": [
            {
              "items2": [
                {
                  "qty": 100,
                  // ...
                }
              ]
            }
          ]
        }
      ]
    }

    The following statement indexes the qty field in a triple-nested items array.

     CREATE INDEX oItemItemItemQtyIdx
     ON orders (UNNEST items0 UNNEST items1 UNNEST items2 SELECT qty:BIGINT)
     EXCLUDE UNKNOWN KEY;

    The queries that follow use the indexes above. The first query uses the oItemItemQtyIdx index through nested existential quantification. The second query uses the oItemItemItemQtyIdx index with three unnesting clauses.

    SELECT o
    FROM orders o
    WHERE SOME o0 IN o.items0 SATISFIES (
        SOME o1 IN o0.items1 SATISFIES o1.qty = 100
    );
    
    SELECT DISTINCT o
    FROM orders o, o.items0 o0, o0.items1 o1, o1.items2 o2
    WHERE o2.qty = 100;