Strange cost estimation for foreign tables

Strange cost estimation for foreign tables


As you know postgres_fdw extension using to access data stored in external PostgreSQL servers.  postgres_fdw provides has more transparent and standards-compliant syntax for accessing remote tables than dblink. Additionaly, this extension provides some options to control how cost estimation is done. One of them is “use_remote_estimate”.  The estimated cost of scanning a foreign table should be whatever it costs to be done on the remote server (and overhead for communication).

Let’s take a look at some problems of using postgres_fdw with which we are faced.


  1. PostgreSQL 10.4 with contrib module (postgres_fdw)
  2. PostgreSQL 9.6.9 (foreign server)
  3. 10 FDW tables with complex joins

I will skip creating an foreign server, user and tables, since they completely repeat the documentation. Below is the query on which our research will be based.

from "SO" so
  inner join "Re_Tr" rt on rt."E_C_D_ID" = so."Or_ID"
  inner join "BGr" rtg on rt."Gr_ID" = rtg."Gr_ID"
  inner join "Acc" cu_acc on cu_acc."Acc_ID" = so."cu_accID"
  inner join "Acc" ve_acc on ve_acc."Acc_ID" = so."ve_accID"
  left join "AccAt" vattr on vattr."Ob_ID" = ve_acc."Acc_ID" and vattr."ATID" = 'DATA'
  left join "Sub_Sc" s on s."sub_sc_ID" = rt."sub_sc_ID"
  left join "Pl_P" pp on pp."Pl_Pe_ID" = rt."Pl_Pe_ID"
  left join "Pl" sp on sp."Pl_ID" = pp."Pl_ID"
  left join "Pl" subscr_p on subscr_p."Pl_ID" = s."Pl_ID"
  left join "BMR" r on r."res_ID" = rt."res_ID"
  left join "Ve_CM" as ve_cat_m_ex_m on ve_cat_m_ex_m."TC_ID" = rt."TC_ID" and ve_cat_m_ex_m."DeT_ID" = rt."DeT_ID" and ve_cat_m_ex_m."ve_accID" = ve_acc."Ve_Ac_ID"
  left join "Ve_CM" as ve_cat_m_an_m on ve_cat_m_an_m."TC_ID" = rt."TC_ID" and ve_cat_m_an_m."DeT_ID" = 0 and ve_cat_m_an_m."ve_accID" = ve_acc."Ve_Ac_ID"
where rt."Or_ID" in (119221,137241, ...);

As you see, it’s complicated, but execution time on the local server is ~ 360 ms.  PostgreSQL optimally uses indexes and returned 1954 rows.

But let’s check what about foreign server.

First step

The documentation advises to use option “use_remote_estimate” = true. postgres_fdw will execute EXPLAIN commands remotely to obtain cost estimates.

Cost Estimation Options:

  • use_remote_estimate = true
  • fdw_startup_cost = 100
  • fdw_tuple_cost = 0.01

The following is a query plan:


 Foreign Scan  (cost=9651.73..211442.85 rows=6196 width=1348) (actual time=64.035..393.319 rows=1954 loops=1)
   Relations: ((((((((((((public."SO" so) INNER JOIN (public."Re_Tr" rt)) INNER JOIN (public."BGr" rtg)) INNER JOIN (public."Acc" cu_acc)) INNER JOIN (public."Acc" ve_acc)) LEFT JO
IN (public."AccAt" vattr)) LEFT JOIN (public."Sub_Sc" s)) LEFT JOIN (public."Pl_P" pp)) LEFT JOIN (public."Pl" sp)) LEFT JOIN (public."Pl" subscr_p)) LEFT JOIN (public."BMR" r)) LEFT JOIN (public."VertexCatMappin
g" ve_cat_m_ex_m)) LEFT JOIN (public."Ve_CM" ve_cat_m_an_m)
 Planning time: 9648.467 ms
 Execution time: 396.197 ms
(4 rows)

Postgres has executed the table joins on the remote server(push join). And it is right. The execution time is 396 ms, but what about planning time? 10 seconds! This is several times more than the execution time. The documentation says that planning time can increase, but in our case it’s catastrophic! This behavior is associated with a lot of EXPLAIN queries that are sent for cost estimation. For one join PostgreSQL sends to the remote server 7 explain queries. For 4 joins more than 30…

Second step

Fortunately postgres_fdw allows to collect and use local statistics. This should solve the problem with long planning. Let’s perform ANALYZE for all our tables:


and set “use_remote_estimate” to false:

ALTER SERVER server_fdw OPTIONS (SET use_remote_estimate 'false');

Cost Estimation Options:

  • use_remote_estimate = false
  • fdw_startup_cost = 100
  • fdw_tuple_cost = 0.01

New query plan is below:

 Hash Left Join  (cost=1971373.11..5324301.20 rows=2807 width=1348) (actual time=122731.428..122853.973 rows=1954 loops=1)
   Hash Cond: (((rt."TC_ID")::text = (ve_cat_m_an_m."TC_ID")::text) AND (ve_acc."Ve_Ac_ID" = ve_cat_m_an_m."ve_accID"))
   ->  Hash Left Join  (cost=1971037.37..5323775.37 rows=2807 width=1097) (actual time=122717.161..122835.705 rows=1954 loops=1)
         Hash Cond: (((rt."TC_ID")::text = (ve_cat_m_ex_m."TC_ID")::text) AND (rt."DeT_ID" = ve_cat_m_ex_m."DeT_ID") AND (ve_acc."Ve_Ac_ID" = ve_cat_m_ex_m."Ve
         ->  Hash Left Join  (cost=1970701.63..5323334.36 rows=2807 width=1088) (actual time=122702.809..122819.557 rows=1954 loops=1)
               Hash Cond: (rt."res_ID" = r."res_ID")
               ->  Hash Left Join  (cost=1970132.57..5322740.61 rows=2807 width=862) (actual time=122679.760..122794.669 rows=1954 loops=1)
                     Hash Cond: (s."Pl_ID" = subscr_p."Pl_ID")
                     ->  Hash Left Join  (cost=1934261.95..5281604.39 rows=2807 width=580) (actual time=121274.045..121406.249 rows=1954 loops=1)
                           Hash Cond: (pp."Pl_ID" = sp."Pl_ID")
                           ->  Hash Left Join  (cost=1898391.32..5240660.17 rows=2807 width=298) (actual time=117525.046..119646.255 rows=1954 loops=1)
                                 Hash Cond: (rt."sub_sc_ID" = s."sub_sc_ID")
                                 ->  Hash Join  (cost=1884438.98..5225061.23 rows=2807 width=266) (actual time=116165.580..118263.740 rows=1954 loops=1)
                                       Hash Cond: (so."cu_accID" = cu_acc."Acc_ID")
                                       ->  Hash Join  (cost=1871399.89..5211071.54 rows=2807 width=245) (actual time=115673.048..117761.266 rows=1954 loops=1)
                                             Hash Cond: (rt."Gr_ID" = rtg."Gr_ID")
                                             ->  Hash Left Join  (cost=1797011.59..5130473.65 rows=2807 width=240) (actual time=112213.395..114320.987 rows=1954 loops=1)
                                                   Hash Cond: (ve_acc."Acc_ID" = vattr."Ob_ID")
                                                   ->  Hash Join  (cost=1764463.28..5096750.73 rows=2807 width=238) (actual time=111110.581..113213.117 rows=1954 loops=1)
                                                         Hash Cond: (so."ve_accID" = ve_acc."Acc_ID")
                                                         ->  Hash Join  (cost=1751631.19..5083187.04 rows=2807 width=231) (actual time=110727.987..112822.988 rows=1954 loops=1)
                                                               Hash Cond: (rt."E_C_D_ID" = so."Or_ID")
                                                               ->  Hash Left Join  (cost=39516.03..3262604.60 rows=2807 width=223) (actual time=911.791..1016.880 rows=1954 loops=1)
                                                                     Hash Cond: (rt."Pl_Pe_ID" = pp."Pl_Pe_ID")
                                                                     ->  Foreign Scan on "Re_Tr" rt  (cost=100.00..3222366.98 rows=2807 width=223) (actual time=7.079..87.948 rows=1954 loops=1)
                                                                     ->  Hash  (cost=36844.49..36844.49 rows=156683 width=8) (actual time=904.511..904.511 rows=156683 loops=1)
                                                                           Buckets: 131072  Batches: 4  Memory Usage: 2557kB
                                                                           ->  Foreign Scan on "Pl_P" pp  (cost=100.00..36844.49 rows=156683 width=8) (actual time=1.409..830.206 rows=156683 loops=1)
                                                               ->  Hash  (cost=1326977.76..1326977.76 rows=22156192 width=16) (actual time=109810.216..109810.216 rows=22156193 loops=1)
                                                                     Buckets: 131072  Batches: 512  Memory Usage: 3057kB
                                                                     ->  Foreign Scan on "SO" so  (cost=100.00..1326977.76 rows=22156192 width=16) (actual time=1.950..96762.129 rows=22156193 loops=1)
                                                         ->  Hash  (cost=10991.83..10991.83 rows=105861 width=11) (actual time=382.399..382.399 rows=105861 loops=1)
                                                               Buckets: 131072  Batches: 2  Memory Usage: 3304kB
                                                               ->  Foreign Scan on "Acc" ve_acc  (cost=100.00..10991.83 rows=105861 width=11) (actual time=0.648..344.895 rows=105861 loops=1)
                                                   ->  Hash  (cost=31278.73..31278.73 rows=77327 width=6) (actual time=1102.532..1102.532 rows=79251 loops=1)
                                                         Buckets: 131072  Batches: 2  Memory Usage: 2572kB
                                                         ->  Foreign Scan on "AccAt" vattr  (cost=100.00..31278.73 rows=77327 width=6) (actual time=2.953..1051.579 rows=79251 loops=1)
                                             ->  Hash  (cost=53068.03..53068.03 rows=1226501 width=9) (actual time=3335.270..3335.270 rows=1226501 loops=1)
                                                   Buckets: 131072  Batches: 32  Memory Usage: 2558kB
                                                   ->  Foreign Scan on "BGr" rtg  (cost=100.00..53068.03 rows=1226501 width=9) (actual time=1.034..2890.583 rows=1226501 loops=1)
                                       ->  Hash  (cost=10991.83..10991.83 rows=105861 width=25) (actual time=492.360..492.360 rows=105861 loops=1)
                                             Buckets: 65536  Batches: 2  Memory Usage: 3510kB
                                             ->  Foreign Scan on "Acc" cu_acc  (cost=100.00..10991.83 rows=105861 width=25) (actual time=2.289..442.974 rows=105861 loops=1)
                                 ->  Hash  (cost=10303.24..10303.24 rows=179608 width=36) (actual time=1359.293..1359.293 rows=179608 loops=1)
                                       Buckets: 65536  Batches: 4  Memory Usage: 3505kB
                                       ->  Foreign Scan on "Sub_Sc" s  (cost=100.00..10303.24 rows=179608 width=36) (actual time=2.274..1213.749 rows=179608 loops=1)
                           ->  Hash  (cost=29483.50..29483.50 rows=126250 width=286) (actual time=1701.231..1701.231 rows=126250 loops=1)
                                 Buckets: 16384  Batches: 16  Memory Usage: 2619kB
                                 ->  Foreign Scan on "Pl" sp  (cost=100.00..29483.50 rows=126250 width=286) (actual time=1.914..1525.445 rows=126250 loops=1)
                     ->  Hash  (cost=29483.50..29483.50 rows=126250 width=286) (actual time=1329.513..1329.513 rows=126250 loops=1)
                           Buckets: 16384  Batches: 16  Memory Usage: 2631kB
                           ->  Foreign Scan on "Pl" subscr_p  (cost=100.00..29483.50 rows=126250 width=286) (actual time=0.647..1225.394 rows=126250 loops=1)
               ->  Hash  (cost=518.16..518.16 rows=4072 width=230) (actual time=23.025..23.025 rows=4072 loops=1)
                     Buckets: 4096  Batches: 1  Memory Usage: 1090kB
                     ->  Foreign Scan on "BMR" r  (cost=100.00..518.16 rows=4072 width=230) (actual time=1.804..20.670 rows=4072 loops=1)
         ->  Hash  (cost=262.52..262.52 rows=4184 width=29) (actual time=14.284..14.284 rows=4184 loops=1)
               Buckets: 8192  Batches: 1  Memory Usage: 324kB
               ->  Foreign Scan on "Ve_CM" ve_cat_m_ex_m  (cost=100.00..262.52 rows=4184 width=29) (actual time=0.610..12.060 rows=4184 loops=1)
   ->  Hash  (cost=272.98..272.98 rows=4184 width=25) (actual time=14.223..14.223 rows=4184 loops=1)
         Buckets: 8192  Batches: 1  Memory Usage: 308kB
         ->  Foreign Scan on "Ve_CM" ve_cat_m_an_m  (cost=100.00..272.98 rows=4184 width=25) (actual time=0.482..12.174 rows=4184 loops=1)
 Planning time: 12.344 ms
 Execution time: 122882.575 ms
(63 rows)

The query plan has changed and now postgres scans each table separately and uses hash joins. Planning time has decreased, but the execution time is huge. The query plan is not optimal. Thus, we could not decide using local statistics. If we compare the cost of these two plans, we will see that the cost of the second is more than 25 times (200 000 vs 5 000 000).

Third step

Since PostgreSQL used separate scans for each table, we can try to tell the planner that the cost of creating an foreign scan is very large. In this way, we will incline it to use fewer foreign scans.

ALTER SERVER server_fdw OPTIONS (SET fdw_startup_cost '100000000000000000');

Cost Estimation Options:

  • use_remote_estimate = false
  • fdw_startup_cost = 100000000000000000
  • fdw_tuple_cost = 0.01

New query plan:

Foreign Scan (cost=100000000000000000.00..100000077915810560.00 rows=2807 width=1348) (actual time=24.871..183.216 rows=1954 loops=1)
Relations: ((((((((((((public."SO" so) INNER JOIN (public."Re_Tr" rt)) INNER JOIN (public."BGr" rtg)) INNER JOIN (public."Acc" cu_acc)) INNER JOIN (public."Acc" ve_acc)) LEFT JO
IN (public."AccAt" vattr)) LEFT JOIN (public."Sub_Sc" s)) LEFT JOIN (public."Pl_P" pp)) LEFT JOIN (public."Pl" sp)) LEFT JOIN (public."Pl" subscr_p)) LEFT JOIN (public."BMR" r)) LEFT JOIN (public."VertexCatMappin
g" ve_cat_m_ex_m)) LEFT JOIN (public."Ve_CM" ve_cat_m_an_m)
Planning time: 12.045 ms
Execution time: 185.141 ms
(4 rows)

Great! We achieved minimum planning time and minimum execution time. But the cost of the query looks awesome. But, of course, such settings are not recommended for use in production, since they can affect other queries.

We became interested in this behavior postgres_fdw and looked into the source code. And that’s what we saw there:

* If we're not using remote estimates, stop here. We have no way to
* estimate whether any join clauses would be worth sending across, so
* don't bother building parameterized paths.
if (!fpinfo->use_remote_estimate)

If use_remote_estimate is false, then the ability to do joins remotely is not considered. The problem of push join because no information on indexes on the local server, so the local server can not accurately estimate the cost of the plan. Now PostgreSQL considers it very conservative, as a sequential search in the cross-product. Not surprisingly, it turns out so big.


We shared this problem with our colleague Konstantin Knizhnik (Postgres Professional). He helped us understand the issue and proposed a solution:

The availability of statistics for all tables allows you to evaluate the selectivity (the number of rows returned). If we use this selectivity to edit the value (assuming that we will not build the entire cross-product, but only the selected number of rows), then the cost is normal.

More details about this you can read here. We express gratitude to Konstantin for help! We hope that its patch will be accepted!


Vadim Yatsenko
No Comments

Post a Comment


This site uses Akismet to reduce spam. Learn how your comment data is processed.