Sampling in PostgreSQL

by 16 June 2017

Sometimes there are situations when you need to select a certain number of random entries from the table. Since version 9.5, this feature has appeared. Using TABLESAMPLE, you can not select all the data from the table, but only some of them.

Synopsis:

SELECT [...] FROM table_name TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]

A TABLESAMPLE clause after a table_name indicates that the specified sampling_method should be used to retrieve a subset of the rows in that table.  This sampling precedes the application of any other filters such as WHERE clauses. The standard PostgreSQL distribution includes two sampling methods, BERNOULLI and SYSTEM, and other sampling methods can be installed in the database via extensions. The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression.

Suppose we have a table in which the transaction id is stored, the transaction amount and the transaction date:

CREATE TABLE foo (
 id SERIAL PRIMARY KEY,
 amount NUMERIC(10,2),
 created_at TIMESTAMP WITHOUT TIME ZONE
 );

In the table we will add 100000 records:

INSERT INTO foo (amount, created_at)
 SELECT
 (round(CAST(random() * 100000 AS NUMERIC), 2)),
 now() - random() * CAST(CONCAT(CAST(random()*10 AS INTEGER)::TEXT,' day') AS INTERVAL)
 FROM generate_series(1, 100000);

VACUUM ANALYZE foo;

The pageinspect module provides functions that allow you to inspect the contents of database pages at a low level:

CREATE EXTENSION pageinspect;

Count how many rows fit on one page:

SELECT count(*) FROM heap_page_items(get_raw_page('foo', 0));
 count
 -----
 157

Now, try to query the table with TABLESAMPLE SYSTEM (0.1):

SELECT count(*) FROM foo TABLESAMPLE SYSTEM (0.1);

count
 -----
 157

Oh! The number of rows are equal! Why did this happen? Why did not we get 100 records, but 157?

The answer is simple. Because PostgreSQL stores the data in a table as an 8 kb page array (by default, this parameter can be changed when building the server from source code) and in the method of sampling SYSTEM simply takes the necessary number of random pages for a given number of percentages and gives them “as is.” In this case, 157 entries are placed in one page.

Another example:

SELECT count(*) FROM foo TABLESAMPLE SYSTEM (1);
count
 -----
 1570

Now Postgres selected 10 pages. In our case, the ideal variant is shown, when all the data was inserted by one query. Therefore, the number of rows on the page is the same. In reality, the number of rows can be different on each of the pages. As you have already understood, the SYSTEM method is not suitable for working with more complex aggregate functions, since it builds a sample at the block level, determining for each block the chance to use it, and returns all the rows from each enabled block. In order to obtain a more distributed sample, you can use the BERNOULLI sampling method, which scans the entire table and selects random records.

For example:

SELECT min(created_at), max(created_at) FROM foo TABLESAMPLE BERNOULLI(0.1);

Compare the results:

Method min(created_at) max(created_at)
No 2017-03-10 09:40:15.864453 2017-03-20 09:31:28.800323
System(0.1) 2017-03-11 01:25:00.236133 2017-03-20 09:31:28.800323
Bernoulli(0.1) 2017-03-10 18:18:29.025916 2017-03-20 09:31:28.800323

The sampling by the SYSTEM method works faster, but the accuracy is lower, while the BERNOULLI sampling works slower, but the accuracy is higher. You can choose the trade-off between speed and accuracy. Also note that for sampling a new type of scan is used: Sample scan.

Dependence of sampling accuracy on the coefficient

IMG_2.png (836×517)

Dependence of sampling accuracy on the coefficient

 

As you can see, the SYSTEM method loses more in performance as the percentage of the sample increases. This is logical, since BERNOULLI did the full scan as it does, while SYSTEM should return 10 times more pages.

As a result, it can be noted that the SYSTEM method at a small percentage of the sample works much faster than BERNOULLI, but it gives a less random sample of records. But with the fascination of the percent this advantage is lost.

Using the optional keyword REPEATABLE, we can specify a seed for the random variable generator. If two queries have the same sampling method, sampling percentage and seed, then the same sample will be selected for the two queries.

SELECT max(amount) FROM foo TABLESAMPLE SYSTEM (0.1) REPEATABLE (100);
max
------
97513.63

Let’s repeat:

SELECT max(amount) FROM foo TABLESAMPLE SYSTEM (0.1) REPEATABLE (100);
max
------
97513.63

 

Resume:

Sampling methods in PostgreSQL help solve some problems, where it is not necessary to know the exact value, but speed is very important. But you should use these methods with caution.

Comments

  • No comment

Leave a Reply

Your email address will not be published. Required fields are marked *