GiST index and siglen

by 12 February 2018

This post is based on the talk by Oleg Bartunov on pgConf2018.RU

It is imperative that a user be able to construct new
access methods to provide efficient access to
instances of nontraditional base types
Michael Stonebraker, Jeff Anton, Michael Hirohama.
Extendability in POSTGRES , IEEE Data Eng. Bull. 10 (2)
pp.16-23, 1987

Introduction

On February 5-7, the largest conference on PostgreSQL PgConf2018.Ru was held at Lomonosov Moscow State University. Interesting topics were sounded at the conference. For example, Bruce Momjian talked about security in Postgres, Alexander Korotkov talked about the pluggable storages, and Oleg Bartunov told about the appearance of SQL2016 standard  in 11 Release. In this blog we will try to figure out what to expect from JSON in the PostgreSQL 11.

Generalized Search Tree ( GiST)

Access Method is a hierarchy of predicates in which each predicate is executed for all subnodes of this hierarchy. GiST provides methods for navigating through the tree, effective tree and tree updates. A good example of using GiST indexes is searching by arrays:

S1 = {1,2,3,5,6,9}
S2 = {1,2,5}
S3 = {0,5,6,9}
S4 = {1,4,5,8}
S5 = {0,9}
S6 = {3,5,6,7,8}
S7 = {4,7,9}

Query:

{2,9}

Intarray – Access Method for array of integers. We can use operators overlap and contains.

RD-TREE

Tom Lane and nesting doll (Matryoshka)

RD-Tree can be compared with a nested doll. Each node contains a portion of the previous node.

The picture shows how the RD-Tree search is performed.

Searching in RD-Tree

FTS Index (GiST): RD-Tree

When developers add new functionality to PostgreSQL, they sometimes add some constants to the code. For example, the length of the signature in the GIST index is 252 bytes for FTS Index. And what if we do not need such a long signature and it is redundant?

We can shows how signatures are used in PostgreSQL using the example of Latin proverbs:

id | proverb
---+-----------------------
 1 | Ars longa, vita brevis
 2 | Ars vitae
 3 | Jus vitae ac necis
 4 | Jus generis humani
 5 | Vita nostra brevis

FTS Index (GiST): RD-Tree

As you understand the longer the signature, the more accurate the search, but this leads to an increase in the index size and slowing down the search. If the signature is not long enough, it can lead to false search and you will have to delete them at the check phase. Below is the case of a false signature:

FTS (false drop)

Opclass parameters: GiST

Parametrized GiST opclass should specify optional 10th (GIST_OPCLASSOPT_PROC) support function with signature internal (options internal, validate bool). GiST examples:

In the PostgreSQL core:

  • tsvector_ops(siglen)

In the contrib module for intarray:

  • gist__int_ops(num_ranges)
  • gist__intbig_ops(siglen)

In the contrib module for ltree:

  • gist_ltree_ops(siglen)
  • gist__ltree_ops(siglen)

In the contrib module for pg_trgm:

  • gist_trgm_ops(siglen)

Compare the size of the index and the execution time of queries, depending on the length of the signature for intarray (NULL-free arrays of integers). Create a test table and three indexes:

SELECT i AS id,
(select array_agg(round(random() * (100 - 1)) + 1)::int[]
FROM generate_series (0, (random()*100 + i * 0)::int))
AS arr INTO arrays FROM generate_series(1,1000000) i;
 Table "public.arrays" – 255MB size
Column | Type | Collation | Nullable | Default
--------+-----------+-----------+----------+---------
id | integer | | |
arr | integer[] | | |
Indexes:
 "arrays_arr_idx" gist (arr gist__intbig_ops) -- siglen = 252
 "arrays_arr_idx1" gist (arr gist__intbig_ops (siglen='64'))
 "arrays_arr_idx2" gist (arr gist__intbig_ops (siglen='32'))

The first index has a default signature length of 252 bytes, the second 64 bytes and the third 32 bytes. Executed two queries:

Q1: SELECT count(*) FROM arrays WHERE arr @> '{1,2,3}';
Q2: SELECT count(*) FROM arrays WHERE arr <@ '{1,2,3}';

Below you can see the results:

 List of relations
Schema | Name | create | Size | Q1 | Q2
--------+-----------------+--------+--------+--------+-
public | arrays_arr_idx | 28 sec| 451 MB | 641 | 387
public | arrays_arr_idx1 | 17 sec| 143 MB | 535 | 164
public | arrays_arr_idx2 | 15 sec| 86 MB | 527 | 73
(3 rows)

As you can see sometimes it is useful to be able to manage the size of the signature in the index.

Instead of concluding

To be continued.

The next part will be a review of the GIN indexes and the new SQL2016 standard in PostgreSQL 11.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *