Joins using LIKE or why PostgreSQL FTS is a powerful alternative

Joins using LIKE or why PostgreSQL FTS is a powerful alternative

Introduction

In the scope of the migration project from Oracle to PostgreSQL, the one of our clients DBA team faced a complicated performance issue. This issue seemed simple at first glance, but it took a lot of time and research. The formula for this issue is simple: LIKE/ILIKE operator + JOIN operator. Let’s take a closer look at this below.

Part 1. Slow JOIN using LIKE operator

A fairly common situation of using the LIKE/ILIKE operator is a join of two strings when one contains another string. For example:

SELECT 'Microsoft Windows XP' ILIKE '%Windows%XP%';
------------- ?column? t (1 row)

This construct lends itself well to indexing. Postgres can use B-Tree index for like/ilike search only with text_pattern_ops/varchar_pattern_ops operators or “C” COLLATION. But in this case index can be used if the LIKE predicate is a plain text.

SELECT * FROM foo WHERE value ILIKE '%Windows%XP%';

But what if we need to join two tables using LIKE? In this case, using B-tree indexes is not possible. Unfortunately, PostgreSQL does not support such index joins. But it’s not as bad as you might think. PostgreSQL has other indexes that can handle these joins. Of course, we are talking about the GiST and GIN indexes.

These indexes can be used with pg_trgm extension. GiST/GIN indexes support trigram-based index searches for LIKE, ILIKE, ~ and ~* queries. For more effective searching Postgres can mix Btree and GiST/GIN indexes in a general index using btree_gist/btree_gin extension. These extensions are included in the standard contrib package and you can easily install them by executing:

CREATE EXTENSION pg_trgm;
CREATE EXTENSION btree_gin; --(or btree_gist)

This is what we thought when we started solving a performance issue for our client. But…there are nuances. If the tables contain a lot of rows, then the search time can be quite significant. Let’s look at our example. There are two tables, first(“foo”) with tens of millions of rows, and the second(“bar”) with only about 1600 rows. Let’s create btree_gin index:

CREATE INDEX test1 ON foo USING BTREE (attr_id, value gin_trgm_ops);

And execute the query like:

EXPLAIN (ANALYZE, VERBOSE, BUFFERS) 
select epg_sav.attr_id
       ,sav.id
       ,sav.value
       ,epg_sav.attr_value
from foo sav
join bar epg_sav on epg_sav.attr_id = sav.attr_id
where sav.value ilike epg_sav.attr_value
and sav.attr_id in (1,4,5,6,14,18,32075,32080,32086,32106,32115,32117);

Let’s check the plan. I have displayed only the part of the plan that is important to us. This is the number of buffers that were read during the execution of the query. 

QUERY PLAN
----------------------------------------------------------------------------------------------------------
Nested Loop (cost=177.38..644316.64 rows=603266 width=81) (actual time=294.146..536430.966 rows=7976 loops=1)
Output: epg_sav.group_id, epg_sav.attr_id, sav.id, sav.value, epg_sav.attr_value
Buffers: shared hit=15129127
...
Timing: Generation 1.128 ms, Inlining 59.265 ms, Optimization 58.373 ms, Emission 40.618 ms, Total 159.383 ms
Execution Time: 536479.459 ms
(21 rows)

shared hit = 15129127 indicates that Postgres read about 115GB of data. That’s a lot for such a query. But why is it so? The index(test1) is in use, but the query is still slow and reads a lot of data. It’s all about the nature of trigrams. The inhibition of pg_trgm, I suspect, is due to the fact that trigram search is very inaccurate. Index scan returns a lot of redundant data and then this data is rechecked against the table. In conclusion more pages are readed in total than in both tables contains. In PostgreSQL 13, this issue fixed. But it is not yet known how much this will speed up the search in our case.

Part 2. Full-text search

As the above solution not provide required performance wise we considered to use Full Text Search (FTS) functionality available OOTB. But for this we need to change the search syntax, since FTS is different from LIKE.

Step 1

Create a field that will be automatically generated with a special type of tsvector for FTS. In this field, add the “attr_id” field that previously participated in the B-Tree index:

ALTER TABLE foo ADD COLUMN ts_vector_value tsvector GENERATED ALWAYS AS (to_tsvector('english', coalesce(attr_id::text,'')||' '||coalesce(value,''))) STORED;

Step 2

Create a GIN index on tsvector field:

CREATE INDEX test2 ON foo USING GIN (ts_vector_value);

Step 3

Rewrite the data of the “bar” table to support FTS search syntax like:

COPY bar (attr_id, attr_value) FROM stdin;
6 6 <-> Adobe & Acrobat
6 6 <-> Adobe & Reader
5 5 <-> Microsoft & Office
6 6 <-> Microsoft <-> Project
4 4 <-> Mozilla <-> Firefox
1 1 <-> Adobe & Reader & X & 10.:*

General rules for updates:
1. Searching for two words in the same string without specific order between words: WORD1 & WORD2
2. Phrase search searching for consequential words: WORD1 <-> WORD2
3. Prefix search: WORD1:*

Step 4

Rewrite the query and remove “epg_sav.attr_id = sav.attr_id” join, because we have it in tsvector. Replace query to FTS syntax like:

EXPLAIN ANALYZE select epg_sav.group_id, 
epg_sav.attr_id, 
sav.id, 
sav.value , 
epg_sav.attr_value 
from foo sav 
join bar epg_sav on ts_vector_value @@ to_tsquery(epg_sav.attr_value) and sav.attr_id in (1,4,5,6,14,18,32075,32080,32086,32106,32115,32117);

And check the query plan:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------ Nested Loop (cost=1794.89..486914427.95 rows=22432777 width=88) (actual time=151.487..1560.651 rows=7976 loops=1) -> Seq Scan on bar epg_sav (cost=0.00..34.12 rows=1612 width=53) (actual time=0.010..0.380 rows=1612 loops=1) -> Bitmap Heap Scan on foo sav (cost=1794.89..301916.92 rows=13916 width=100) (actual time=0.926..0.966 rows=5 loops=1612) Recheck Cond: (ts_vector_value @@ to_tsquery((epg_sav.attr_value)::text)) Rows Removed by Index Recheck: 2 Filter: (attr_id = ANY ('{1,4,5,6,14,18,32075,32080,32086,32106,32115,32117}'::bigint[])) Rows Removed by Filter: 0 Heap Blocks: exact=8362 -> Bitmap Index Scan on test2 (cost=0.00..1791.41 rows=223599 width=0) (actual time=0.817..0.817 rows=7 loops=1612) Index Cond: (ts_vector_value @@ to_tsquery((epg_sav.attr_value)::text)) Planning Time: 0.265 ms JIT: Functions: 9 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 1.464 ms, Inlining 8.519 ms, Optimization 84.110 ms, Emission 55.854 ms, Total 149.946 ms Execution Time: 1562.618 ms (16 rows)

Summary

As you can see, the problem was solved. Query execution time has become quite small compared to LIKE/ILIKE. The solution described above is, of course, a special case. Perhaps in your case the LIKE operator will be OK, but sometimes its capabilities may not be enough. And maybe FTS can help you. Full-text search is a powerful functionality in PostgreSQL and can significantly optimize queries.

Vadim Yatsenko
No Comments

Post a Comment

Comment
Name
Email
Website

This site uses Akismet to reduce spam. Learn how your comment data is processed.