PostgreSQL is not using a straight forward index2019 Community Moderator ElectionAuto increment table columnHow does database indexing work?PostgreSQL “DESCRIBE TABLE”Show tables in PostgreSQLHow can I drop all the tables in a PostgreSQL database?How to exit from PostgreSQL command line utility: psqlPostgreSQL 9.2 10x slower than MySQL 5.5?Which version of PostgreSQL am I running?DISTINCT INNER JOIN slowOptimize Window queries in PostgresqlWhy is our Postgres query performance slow?

Should QA ask requirements to developers?

Touchscreen-controlled dentist office snowman collector game

Can infringement of a trademark be pursued for using a company's name in a sentence?

Decoding assembly instructions in a Game Boy disassembler

US to Europe trip with Montreal layover - is 52 minutes enough?

Can "semicircle" be used to refer to a part-circle that is not a exact half-circle?

Ban on all campaign finance?

Counter-example to the existence of left Bousfield localization of combinatorial model category

Can the druid cantrip Thorn Whip really defeat a water weird this easily?

What exactly is the purpose of connection links straped between the rocket and the launch pad

Why do Australian milk farmers need to protest supermarkets' milk price?

Gravity alteration as extermination tool viable?

Coworker uses her breast-pump everywhere in the office

How do I hide Chekhov's Gun?

What is the likely impact on flights of grounding an entire aircraft series?

As a monk, can you make a melee attack roll using your Strength modifier, but roll damage with your Dexterity modifier?

The three point beverage

Why would a jet engine that runs at temps excess of 2000°C burn when it crashes?

How to make readers know that my work has used a hidden constraint?

What is the definition of "Natural Selection"?

Question about partial fractions with irreducible quadratic factors

Time dilation for a moving electronic clock

What does it mean when multiple 々 marks follow a 、?

Is "history" a male-biased word ("his+story")?

PostgreSQL is not using a straight forward index

2019 Community Moderator ElectionAuto increment table columnHow does database indexing work?PostgreSQL “DESCRIBE TABLE”Show tables in PostgreSQLHow can I drop all the tables in a PostgreSQL database?How to exit from PostgreSQL command line utility: psqlPostgreSQL 9.2 10x slower than MySQL 5.5?Which version of PostgreSQL am I running?DISTINCT INNER JOIN slowOptimize Window queries in PostgresqlWhy is our Postgres query performance slow?

I have a PostgreSQL 10.6 database on Amazon RDS. My table is like this:

CREATE TABLE dfo_by_quarter (
 release_key int4 NOT NULL,
 country varchar(100) NOT NULL,
 product_group varchar(100) NOT NULL,
 distribution_type varchar(100) NOT NULL,
 "year" int2 NOT NULL,
 "date" date NULL,
 quarter int2 NOT NULL,
 category varchar(100) NOT NULL,
 units numeric(38,6) NOT NULL,
 sales_value_eur numeric(38,6) NOT NULL,
 sales_value_usd numeric(38,6) NOT NULL,
 sales_value_local numeric(38,6) NOT NULL,
 data_status bpchar(1) NOT NULL,
 panel_market_units numeric(38,6) NOT NULL,
 panel_market_sales_value_eur numeric(38,6) NOT NULL,
 panel_market_sales_value_usd numeric(38,6) NOT NULL,
 panel_market_sales_value_local numeric(38,6) NOT NULL,
 CONSTRAINT pk_dpretailer_dfo_by_quarter PRIMARY KEY (release_key, country, category, product_group, distribution_type, year, quarter),
 CONSTRAINT fk_dpretailer_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id)
);

I understand Primary Key implies a unique index

If I simply ask how many rows I have when filtering on non existing data (release_key = 1 returns nothing), I can see it uses the index

EXPLAIN
SELECT COUNT(*)
 FROM dpretailer.dfo_by_quarter
 WHERE release_key = 1

Aggregate (cost=6.32..6.33 rows=1 width=8)
 -> Index Only Scan using pk_dpretailer_dfo_by_quarter on dfo_by_quarter (cost=0.55..6.32 rows=1 width=0)
 Index Cond: (release_key = 1)

But if I run the same query on a value that returns data, it scans the table, which is bound to be more expensive...

EXPLAIN
SELECT COUNT(*)
 FROM dpretailer.dfo_by_quarter
 WHERE release_key = 2

Finalize Aggregate (cost=47611.07..47611.08 rows=1 width=8)
 -> Gather (cost=47610.86..47611.07 rows=2 width=8)
 Workers Planned: 2
 -> Partial Aggregate (cost=46610.86..46610.87 rows=1 width=8)
 -> Parallel Seq Scan on dfo_by_quarter (cost=0.00..46307.29 rows=121428 width=0)
 Filter: (release_key = 2)

I get it that using the index when there is no data makes sense and is driven by the stats on the table (I ran ANALYSE before the tests)

But why not using my index if there is data?

Surely, it must be quicker to scan part of an index (because release_key is the first column) rather than scanning an entire table???

I must be missing something...?

Update 2019-03-07

Thank You for your comments, which are very useful.

This simple query was just me trying to understand why the index was not used...

But I should have known better (I am new to postgresql but have MANY years experience with SQL Server) and it makes sense that it is not, as you commented about.

bad selectivity because my criteria only filters about 20% of the rows

bad table design (too fat, which we knew and are now addressing)

index not "covering" the query, etc...

So let me change "slightly" my question if I may...

Our table will be normalised in facts/dimensions (no more varchars in the wrong place).

We do only inserts, never updates and so few deletes that we can ignore it.

The table size will not be huge (tens of million of rows order).

Our queries will ALWAYS specify an exact release_key value.

Our new version of the table would look like this

CREATE TABLE dfo_by_quarter (
 release_key int4 NOT NULL,
 country_key int2 NOT NULL,
 product_group_key int2 NOT NULL,
 distribution_type_key int2 NOT NULL,
 category_key int2 NOT NULL,
 "year" int2 NOT NULL,
 "date" date NULL,
 quarter int2 NOT NULL,
 units numeric(38,6) NOT NULL,
 sales_value_eur numeric(38,6) NOT NULL,
 sales_value_usd numeric(38,6) NOT NULL,
 sales_value_local numeric(38,6) NOT NULL,
 CONSTRAINT pk_milly_dfo_by_quarter PRIMARY KEY (release_key, country_key, category_key, product_group_key, distribution_type_key, year, quarter),
 CONSTRAINT fk_milly_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id),
 CONSTRAINT fk_milly_dim_dfo_category FOREIGN KEY (category_key) REFERENCES milly.dim_dfo_category(category_key),
 CONSTRAINT fk_milly_dim_dfo_country FOREIGN KEY (country_key) REFERENCES milly.dim_dfo_country(country_key),
 CONSTRAINT fk_milly_dim_dfo_distribution_type FOREIGN KEY (distribution_type_key) REFERENCES milly.dim_dfo_distribution_type(distribution_type_key),
 CONSTRAINT fk_milly_dim_dfo_product_group FOREIGN KEY (product_group_key) REFERENCES milly.dim_dfo_product_group(product_group_key)
);

With that in mind, in a SQL Server environment, I could solve this by having a "Clustered" primary key (the entire table being sorted), or having an index on the primary key with INCLUDE option for the other columns required to cover the queries (Units, Values, etc).

Question 1)

In postgresql, is there an equivalent to the SQL Server Clustered index? A way to actually sort the entire table? I suppose it might be difficult because postgresql does not do updates "in place", hence it might make sorting expensive...

Or, is there a way to create something like a SQL Server Index WITH INCLUDE(units, values)?

update: I came across the SQL CLUSTER command, which is the closest thing I suppose.
It would be suitable for us

Question 2

With the query below

EXPLAIN (ANALYZE, BUFFERS)
WITH "rank_query" AS
(
 SELECT
 ROW_NUMBER() OVER(PARTITION BY "year" ORDER BY SUM("main"."units") DESC) AS "rank_by",
 "year",
 "main"."product_group_key" AS "productgroupkey",
 SUM("main"."units") AS "salesunits",
 SUM("main"."sales_value_eur") AS "salesvalue",
 SUM("sales_value_eur")/SUM("units") AS "asp"
 FROM "milly"."dfo_by_quarter" AS "main"

 WHERE
 "release_key" = 17 AND
 "main"."year" >= 2010
 GROUP BY
 "year",
 "main"."product_group_key"
)
,BeforeLookup
AS (
SELECT
 "year" AS date,
 SUM("salesunits") AS "salesunits",
 SUM("salesvalue") AS "salesvalue",
 SUM("salesvalue")/SUM("salesunits") AS "asp",
 CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END AS "productgroupkey"
FROM
 "rank_query"
GROUP BY
 "year",
 CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END
)
SELECT BL.date, BL.salesunits, BL.salesvalue, BL.asp
 FROM BeforeLookup AS BL
 INNER JOIN milly.dim_dfo_product_group PG ON PG.product_group_key = BL.productgroupkey;

I get this

Hash Join (cost=40883.82..40896.46 rows=558 width=98) (actual time=676.565..678.308 rows=663 loops=1)
 Hash Cond: (bl.productgroupkey = pg.product_group_key)
 Buffers: shared hit=483 read=22719
 CTE rank_query
 -> WindowAgg (cost=40507.15..40632.63 rows=5577 width=108) (actual time=660.076..668.272 rows=5418 loops=1)
 Buffers: shared hit=480 read=22719
 -> Sort (cost=40507.15..40521.09 rows=5577 width=68) (actual time=660.062..661.226 rows=5418 loops=1)
 Sort Key: main.year, (sum(main.units)) DESC
 Sort Method: quicksort Memory: 616kB
 Buffers: shared hit=480 read=22719
 -> Finalize HashAggregate (cost=40076.46..40160.11 rows=5577 width=68) (actual time=648.762..653.227 rows=5418 loops=1)
 Group Key: main.year, main.product_group_key
 Buffers: shared hit=480 read=22719
 -> Gather (cost=38710.09..39909.15 rows=11154 width=68) (actual time=597.878..622.379 rows=11938 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 Buffers: shared hit=480 read=22719
 -> Partial HashAggregate (cost=37710.09..37793.75 rows=5577 width=68) (actual time=594.044..600.494 rows=3979 loops=3)
 Group Key: main.year, main.product_group_key
 Buffers: shared hit=480 read=22719
 -> Parallel Seq Scan on dfo_by_quarter main (cost=0.00..36019.74 rows=169035 width=22) (actual time=106.916..357.071 rows=137171 loops=3)
 Filter: ((year >= 2010) AND (release_key = 17))
 Rows Removed by Filter: 546602
 Buffers: shared hit=480 read=22719
 CTE beforelookup
 -> HashAggregate (cost=223.08..238.43 rows=558 width=102) (actual time=676.293..677.167 rows=663 loops=1)
 Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
 Buffers: shared hit=480 read=22719
 -> CTE Scan on rank_query (cost=0.00..139.43 rows=5577 width=70) (actual time=660.079..672.978 rows=5418 loops=1)
 Buffers: shared hit=480 read=22719
 -> CTE Scan on beforelookup bl (cost=0.00..11.16 rows=558 width=102) (actual time=676.296..677.665 rows=663 loops=1)
 Buffers: shared hit=480 read=22719
 -> Hash (cost=7.34..7.34 rows=434 width=4) (actual time=0.253..0.253 rows=435 loops=1)
 Buckets: 1024 Batches: 1 Memory Usage: 24kB
 Buffers: shared hit=3
 -> Seq Scan on dim_dfo_product_group pg (cost=0.00..7.34 rows=434 width=4) (actual time=0.017..0.121 rows=435 loops=1)
 Buffers: shared hit=3
Planning time: 0.319 ms
Execution time: 678.714 ms

Does anything spring to mind?

If I read it properly, it means my biggest cost by far is the initial scanof the table... but I don't manage to make it use an index...

I had created an index I hoped would help but it got ignored...

CREATE INDEX eric_silly_index ON milly.dfo_by_quarter(release_key, YEAR, date, product_group_key, units, sales_value_eur);

ANALYZE milly.dfo_by_quarter;

I also tried to cluster the table but no visible effect either

CLUSTER milly.dfo_by_quarter USING pk_milly_dfo_by_quarter; -- took 30 seconds (uidev)

ANALYZE milly.dfo_by_quarter;

Many thanks

Eric

edited Mar 7 at 22:59

asked Mar 6 at 17:50

Eric Mamet

808

How many rows in your table? How many with release_key = 2. Is there (concurrent) write load on the table? Do you see an index-only scan after VACUUM dpretailer.dfo_by_quarter?

– Erwin Brandstetter
Mar 6 at 18:44

Also, can you replace the EXPLAINs with output from EXPLAIN (ANALYZE, BUFFERS)? That would give us concrete timings and shared cache hits/misses.

– Ancoron
Mar 6 at 18:48

add a comment |

I have a PostgreSQL 10.6 database on Amazon RDS. My table is like this:

CREATE TABLE dfo_by_quarter (
 release_key int4 NOT NULL,
 country varchar(100) NOT NULL,
 product_group varchar(100) NOT NULL,
 distribution_type varchar(100) NOT NULL,
 "year" int2 NOT NULL,
 "date" date NULL,
 quarter int2 NOT NULL,
 category varchar(100) NOT NULL,
 units numeric(38,6) NOT NULL,
 sales_value_eur numeric(38,6) NOT NULL,
 sales_value_usd numeric(38,6) NOT NULL,
 sales_value_local numeric(38,6) NOT NULL,
 data_status bpchar(1) NOT NULL,
 panel_market_units numeric(38,6) NOT NULL,
 panel_market_sales_value_eur numeric(38,6) NOT NULL,
 panel_market_sales_value_usd numeric(38,6) NOT NULL,
 panel_market_sales_value_local numeric(38,6) NOT NULL,
 CONSTRAINT pk_dpretailer_dfo_by_quarter PRIMARY KEY (release_key, country, category, product_group, distribution_type, year, quarter),
 CONSTRAINT fk_dpretailer_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id)
);

I understand Primary Key implies a unique index

If I simply ask how many rows I have when filtering on non existing data (release_key = 1 returns nothing), I can see it uses the index

EXPLAIN
SELECT COUNT(*)
 FROM dpretailer.dfo_by_quarter
 WHERE release_key = 1

Aggregate (cost=6.32..6.33 rows=1 width=8)
 -> Index Only Scan using pk_dpretailer_dfo_by_quarter on dfo_by_quarter (cost=0.55..6.32 rows=1 width=0)
 Index Cond: (release_key = 1)

But if I run the same query on a value that returns data, it scans the table, which is bound to be more expensive...

EXPLAIN
SELECT COUNT(*)
 FROM dpretailer.dfo_by_quarter
 WHERE release_key = 2

Finalize Aggregate (cost=47611.07..47611.08 rows=1 width=8)
 -> Gather (cost=47610.86..47611.07 rows=2 width=8)
 Workers Planned: 2
 -> Partial Aggregate (cost=46610.86..46610.87 rows=1 width=8)
 -> Parallel Seq Scan on dfo_by_quarter (cost=0.00..46307.29 rows=121428 width=0)
 Filter: (release_key = 2)

I get it that using the index when there is no data makes sense and is driven by the stats on the table (I ran ANALYSE before the tests)

But why not using my index if there is data?

Surely, it must be quicker to scan part of an index (because release_key is the first column) rather than scanning an entire table???

I must be missing something...?

Update 2019-03-07

Thank You for your comments, which are very useful.

This simple query was just me trying to understand why the index was not used...

But I should have known better (I am new to postgresql but have MANY years experience with SQL Server) and it makes sense that it is not, as you commented about.

bad selectivity because my criteria only filters about 20% of the rows

bad table design (too fat, which we knew and are now addressing)

index not "covering" the query, etc...

So let me change "slightly" my question if I may...

Our table will be normalised in facts/dimensions (no more varchars in the wrong place).

We do only inserts, never updates and so few deletes that we can ignore it.

The table size will not be huge (tens of million of rows order).

Our queries will ALWAYS specify an exact release_key value.

Our new version of the table would look like this

CREATE TABLE dfo_by_quarter (
 release_key int4 NOT NULL,
 country_key int2 NOT NULL,
 product_group_key int2 NOT NULL,
 distribution_type_key int2 NOT NULL,
 category_key int2 NOT NULL,
 "year" int2 NOT NULL,
 "date" date NULL,
 quarter int2 NOT NULL,
 units numeric(38,6) NOT NULL,
 sales_value_eur numeric(38,6) NOT NULL,
 sales_value_usd numeric(38,6) NOT NULL,
 sales_value_local numeric(38,6) NOT NULL,
 CONSTRAINT pk_milly_dfo_by_quarter PRIMARY KEY (release_key, country_key, category_key, product_group_key, distribution_type_key, year, quarter),
 CONSTRAINT fk_milly_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id),
 CONSTRAINT fk_milly_dim_dfo_category FOREIGN KEY (category_key) REFERENCES milly.dim_dfo_category(category_key),
 CONSTRAINT fk_milly_dim_dfo_country FOREIGN KEY (country_key) REFERENCES milly.dim_dfo_country(country_key),
 CONSTRAINT fk_milly_dim_dfo_distribution_type FOREIGN KEY (distribution_type_key) REFERENCES milly.dim_dfo_distribution_type(distribution_type_key),
 CONSTRAINT fk_milly_dim_dfo_product_group FOREIGN KEY (product_group_key) REFERENCES milly.dim_dfo_product_group(product_group_key)
);

Question 1)

Or, is there a way to create something like a SQL Server Index WITH INCLUDE(units, values)?

update: I came across the SQL CLUSTER command, which is the closest thing I suppose.
It would be suitable for us

Question 2

With the query below

EXPLAIN (ANALYZE, BUFFERS)
WITH "rank_query" AS
(
 SELECT
 ROW_NUMBER() OVER(PARTITION BY "year" ORDER BY SUM("main"."units") DESC) AS "rank_by",
 "year",
 "main"."product_group_key" AS "productgroupkey",
 SUM("main"."units") AS "salesunits",
 SUM("main"."sales_value_eur") AS "salesvalue",
 SUM("sales_value_eur")/SUM("units") AS "asp"
 FROM "milly"."dfo_by_quarter" AS "main"

 WHERE
 "release_key" = 17 AND
 "main"."year" >= 2010
 GROUP BY
 "year",
 "main"."product_group_key"
)
,BeforeLookup
AS (
SELECT
 "year" AS date,
 SUM("salesunits") AS "salesunits",
 SUM("salesvalue") AS "salesvalue",
 SUM("salesvalue")/SUM("salesunits") AS "asp",
 CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END AS "productgroupkey"
FROM
 "rank_query"
GROUP BY
 "year",
 CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END
)
SELECT BL.date, BL.salesunits, BL.salesvalue, BL.asp
 FROM BeforeLookup AS BL
 INNER JOIN milly.dim_dfo_product_group PG ON PG.product_group_key = BL.productgroupkey;

I get this

Hash Join (cost=40883.82..40896.46 rows=558 width=98) (actual time=676.565..678.308 rows=663 loops=1)
 Hash Cond: (bl.productgroupkey = pg.product_group_key)
 Buffers: shared hit=483 read=22719
 CTE rank_query
 -> WindowAgg (cost=40507.15..40632.63 rows=5577 width=108) (actual time=660.076..668.272 rows=5418 loops=1)
 Buffers: shared hit=480 read=22719
 -> Sort (cost=40507.15..40521.09 rows=5577 width=68) (actual time=660.062..661.226 rows=5418 loops=1)
 Sort Key: main.year, (sum(main.units)) DESC
 Sort Method: quicksort Memory: 616kB
 Buffers: shared hit=480 read=22719
 -> Finalize HashAggregate (cost=40076.46..40160.11 rows=5577 width=68) (actual time=648.762..653.227 rows=5418 loops=1)
 Group Key: main.year, main.product_group_key
 Buffers: shared hit=480 read=22719
 -> Gather (cost=38710.09..39909.15 rows=11154 width=68) (actual time=597.878..622.379 rows=11938 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 Buffers: shared hit=480 read=22719
 -> Partial HashAggregate (cost=37710.09..37793.75 rows=5577 width=68) (actual time=594.044..600.494 rows=3979 loops=3)
 Group Key: main.year, main.product_group_key
 Buffers: shared hit=480 read=22719
 -> Parallel Seq Scan on dfo_by_quarter main (cost=0.00..36019.74 rows=169035 width=22) (actual time=106.916..357.071 rows=137171 loops=3)
 Filter: ((year >= 2010) AND (release_key = 17))
 Rows Removed by Filter: 546602
 Buffers: shared hit=480 read=22719
 CTE beforelookup
 -> HashAggregate (cost=223.08..238.43 rows=558 width=102) (actual time=676.293..677.167 rows=663 loops=1)
 Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
 Buffers: shared hit=480 read=22719
 -> CTE Scan on rank_query (cost=0.00..139.43 rows=5577 width=70) (actual time=660.079..672.978 rows=5418 loops=1)
 Buffers: shared hit=480 read=22719
 -> CTE Scan on beforelookup bl (cost=0.00..11.16 rows=558 width=102) (actual time=676.296..677.665 rows=663 loops=1)
 Buffers: shared hit=480 read=22719
 -> Hash (cost=7.34..7.34 rows=434 width=4) (actual time=0.253..0.253 rows=435 loops=1)
 Buckets: 1024 Batches: 1 Memory Usage: 24kB
 Buffers: shared hit=3
 -> Seq Scan on dim_dfo_product_group pg (cost=0.00..7.34 rows=434 width=4) (actual time=0.017..0.121 rows=435 loops=1)
 Buffers: shared hit=3
Planning time: 0.319 ms
Execution time: 678.714 ms

Does anything spring to mind?

If I read it properly, it means my biggest cost by far is the initial scanof the table... but I don't manage to make it use an index...

I had created an index I hoped would help but it got ignored...

CREATE INDEX eric_silly_index ON milly.dfo_by_quarter(release_key, YEAR, date, product_group_key, units, sales_value_eur);

ANALYZE milly.dfo_by_quarter;

I also tried to cluster the table but no visible effect either

CLUSTER milly.dfo_by_quarter USING pk_milly_dfo_by_quarter; -- took 30 seconds (uidev)

ANALYZE milly.dfo_by_quarter;

Many thanks

Eric

edited Mar 7 at 22:59

asked Mar 6 at 17:50

Eric Mamet

808

How many rows in your table? How many with release_key = 2. Is there (concurrent) write load on the table? Do you see an index-only scan after VACUUM dpretailer.dfo_by_quarter?

– Erwin Brandstetter
Mar 6 at 18:44

Also, can you replace the EXPLAINs with output from EXPLAIN (ANALYZE, BUFFERS)? That would give us concrete timings and shared cache hits/misses.

– Ancoron
Mar 6 at 18:48

add a comment |

I have a PostgreSQL 10.6 database on Amazon RDS. My table is like this:

CREATE TABLE dfo_by_quarter (
 release_key int4 NOT NULL,
 country varchar(100) NOT NULL,
 product_group varchar(100) NOT NULL,
 distribution_type varchar(100) NOT NULL,
 "year" int2 NOT NULL,
 "date" date NULL,
 quarter int2 NOT NULL,
 category varchar(100) NOT NULL,
 units numeric(38,6) NOT NULL,
 sales_value_eur numeric(38,6) NOT NULL,
 sales_value_usd numeric(38,6) NOT NULL,
 sales_value_local numeric(38,6) NOT NULL,
 data_status bpchar(1) NOT NULL,
 panel_market_units numeric(38,6) NOT NULL,
 panel_market_sales_value_eur numeric(38,6) NOT NULL,
 panel_market_sales_value_usd numeric(38,6) NOT NULL,
 panel_market_sales_value_local numeric(38,6) NOT NULL,
 CONSTRAINT pk_dpretailer_dfo_by_quarter PRIMARY KEY (release_key, country, category, product_group, distribution_type, year, quarter),
 CONSTRAINT fk_dpretailer_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id)
);

I understand Primary Key implies a unique index

If I simply ask how many rows I have when filtering on non existing data (release_key = 1 returns nothing), I can see it uses the index

EXPLAIN
SELECT COUNT(*)
 FROM dpretailer.dfo_by_quarter
 WHERE release_key = 1

Aggregate (cost=6.32..6.33 rows=1 width=8)
 -> Index Only Scan using pk_dpretailer_dfo_by_quarter on dfo_by_quarter (cost=0.55..6.32 rows=1 width=0)
 Index Cond: (release_key = 1)

But if I run the same query on a value that returns data, it scans the table, which is bound to be more expensive...

EXPLAIN
SELECT COUNT(*)
 FROM dpretailer.dfo_by_quarter
 WHERE release_key = 2

Finalize Aggregate (cost=47611.07..47611.08 rows=1 width=8)
 -> Gather (cost=47610.86..47611.07 rows=2 width=8)
 Workers Planned: 2
 -> Partial Aggregate (cost=46610.86..46610.87 rows=1 width=8)
 -> Parallel Seq Scan on dfo_by_quarter (cost=0.00..46307.29 rows=121428 width=0)
 Filter: (release_key = 2)

I get it that using the index when there is no data makes sense and is driven by the stats on the table (I ran ANALYSE before the tests)

But why not using my index if there is data?

Surely, it must be quicker to scan part of an index (because release_key is the first column) rather than scanning an entire table???

I must be missing something...?

Update 2019-03-07

Thank You for your comments, which are very useful.

This simple query was just me trying to understand why the index was not used...

But I should have known better (I am new to postgresql but have MANY years experience with SQL Server) and it makes sense that it is not, as you commented about.

bad selectivity because my criteria only filters about 20% of the rows

bad table design (too fat, which we knew and are now addressing)

index not "covering" the query, etc...

So let me change "slightly" my question if I may...

Our table will be normalised in facts/dimensions (no more varchars in the wrong place).

We do only inserts, never updates and so few deletes that we can ignore it.

The table size will not be huge (tens of million of rows order).

Our queries will ALWAYS specify an exact release_key value.

Our new version of the table would look like this

CREATE TABLE dfo_by_quarter (
 release_key int4 NOT NULL,
 country_key int2 NOT NULL,
 product_group_key int2 NOT NULL,
 distribution_type_key int2 NOT NULL,
 category_key int2 NOT NULL,
 "year" int2 NOT NULL,
 "date" date NULL,
 quarter int2 NOT NULL,
 units numeric(38,6) NOT NULL,
 sales_value_eur numeric(38,6) NOT NULL,
 sales_value_usd numeric(38,6) NOT NULL,
 sales_value_local numeric(38,6) NOT NULL,
 CONSTRAINT pk_milly_dfo_by_quarter PRIMARY KEY (release_key, country_key, category_key, product_group_key, distribution_type_key, year, quarter),
 CONSTRAINT fk_milly_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id),
 CONSTRAINT fk_milly_dim_dfo_category FOREIGN KEY (category_key) REFERENCES milly.dim_dfo_category(category_key),
 CONSTRAINT fk_milly_dim_dfo_country FOREIGN KEY (country_key) REFERENCES milly.dim_dfo_country(country_key),
 CONSTRAINT fk_milly_dim_dfo_distribution_type FOREIGN KEY (distribution_type_key) REFERENCES milly.dim_dfo_distribution_type(distribution_type_key),
 CONSTRAINT fk_milly_dim_dfo_product_group FOREIGN KEY (product_group_key) REFERENCES milly.dim_dfo_product_group(product_group_key)
);

Question 1)

Or, is there a way to create something like a SQL Server Index WITH INCLUDE(units, values)?

update: I came across the SQL CLUSTER command, which is the closest thing I suppose.
It would be suitable for us

Question 2

With the query below

EXPLAIN (ANALYZE, BUFFERS)
WITH "rank_query" AS
(
 SELECT
 ROW_NUMBER() OVER(PARTITION BY "year" ORDER BY SUM("main"."units") DESC) AS "rank_by",
 "year",
 "main"."product_group_key" AS "productgroupkey",
 SUM("main"."units") AS "salesunits",
 SUM("main"."sales_value_eur") AS "salesvalue",
 SUM("sales_value_eur")/SUM("units") AS "asp"
 FROM "milly"."dfo_by_quarter" AS "main"

 WHERE
 "release_key" = 17 AND
 "main"."year" >= 2010
 GROUP BY
 "year",
 "main"."product_group_key"
)
,BeforeLookup
AS (
SELECT
 "year" AS date,
 SUM("salesunits") AS "salesunits",
 SUM("salesvalue") AS "salesvalue",
 SUM("salesvalue")/SUM("salesunits") AS "asp",
 CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END AS "productgroupkey"
FROM
 "rank_query"
GROUP BY
 "year",
 CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END
)
SELECT BL.date, BL.salesunits, BL.salesvalue, BL.asp
 FROM BeforeLookup AS BL
 INNER JOIN milly.dim_dfo_product_group PG ON PG.product_group_key = BL.productgroupkey;

I get this

Hash Join (cost=40883.82..40896.46 rows=558 width=98) (actual time=676.565..678.308 rows=663 loops=1)
 Hash Cond: (bl.productgroupkey = pg.product_group_key)
 Buffers: shared hit=483 read=22719
 CTE rank_query
 -> WindowAgg (cost=40507.15..40632.63 rows=5577 width=108) (actual time=660.076..668.272 rows=5418 loops=1)
 Buffers: shared hit=480 read=22719
 -> Sort (cost=40507.15..40521.09 rows=5577 width=68) (actual time=660.062..661.226 rows=5418 loops=1)
 Sort Key: main.year, (sum(main.units)) DESC
 Sort Method: quicksort Memory: 616kB
 Buffers: shared hit=480 read=22719
 -> Finalize HashAggregate (cost=40076.46..40160.11 rows=5577 width=68) (actual time=648.762..653.227 rows=5418 loops=1)
 Group Key: main.year, main.product_group_key
 Buffers: shared hit=480 read=22719
 -> Gather (cost=38710.09..39909.15 rows=11154 width=68) (actual time=597.878..622.379 rows=11938 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 Buffers: shared hit=480 read=22719
 -> Partial HashAggregate (cost=37710.09..37793.75 rows=5577 width=68) (actual time=594.044..600.494 rows=3979 loops=3)
 Group Key: main.year, main.product_group_key
 Buffers: shared hit=480 read=22719
 -> Parallel Seq Scan on dfo_by_quarter main (cost=0.00..36019.74 rows=169035 width=22) (actual time=106.916..357.071 rows=137171 loops=3)
 Filter: ((year >= 2010) AND (release_key = 17))
 Rows Removed by Filter: 546602
 Buffers: shared hit=480 read=22719
 CTE beforelookup
 -> HashAggregate (cost=223.08..238.43 rows=558 width=102) (actual time=676.293..677.167 rows=663 loops=1)
 Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
 Buffers: shared hit=480 read=22719
 -> CTE Scan on rank_query (cost=0.00..139.43 rows=5577 width=70) (actual time=660.079..672.978 rows=5418 loops=1)
 Buffers: shared hit=480 read=22719
 -> CTE Scan on beforelookup bl (cost=0.00..11.16 rows=558 width=102) (actual time=676.296..677.665 rows=663 loops=1)
 Buffers: shared hit=480 read=22719
 -> Hash (cost=7.34..7.34 rows=434 width=4) (actual time=0.253..0.253 rows=435 loops=1)
 Buckets: 1024 Batches: 1 Memory Usage: 24kB
 Buffers: shared hit=3
 -> Seq Scan on dim_dfo_product_group pg (cost=0.00..7.34 rows=434 width=4) (actual time=0.017..0.121 rows=435 loops=1)
 Buffers: shared hit=3
Planning time: 0.319 ms
Execution time: 678.714 ms

Does anything spring to mind?

If I read it properly, it means my biggest cost by far is the initial scanof the table... but I don't manage to make it use an index...

I had created an index I hoped would help but it got ignored...

CREATE INDEX eric_silly_index ON milly.dfo_by_quarter(release_key, YEAR, date, product_group_key, units, sales_value_eur);

ANALYZE milly.dfo_by_quarter;

I also tried to cluster the table but no visible effect either

CLUSTER milly.dfo_by_quarter USING pk_milly_dfo_by_quarter; -- took 30 seconds (uidev)

ANALYZE milly.dfo_by_quarter;

Many thanks

Eric

edited Mar 7 at 22:59

asked Mar 6 at 17:50

Eric Mamet

808

I have a PostgreSQL 10.6 database on Amazon RDS. My table is like this:

CREATE TABLE dfo_by_quarter (
 release_key int4 NOT NULL,
 country varchar(100) NOT NULL,
 product_group varchar(100) NOT NULL,
 distribution_type varchar(100) NOT NULL,
 "year" int2 NOT NULL,
 "date" date NULL,
 quarter int2 NOT NULL,
 category varchar(100) NOT NULL,
 units numeric(38,6) NOT NULL,
 sales_value_eur numeric(38,6) NOT NULL,
 sales_value_usd numeric(38,6) NOT NULL,
 sales_value_local numeric(38,6) NOT NULL,
 data_status bpchar(1) NOT NULL,
 panel_market_units numeric(38,6) NOT NULL,
 panel_market_sales_value_eur numeric(38,6) NOT NULL,
 panel_market_sales_value_usd numeric(38,6) NOT NULL,
 panel_market_sales_value_local numeric(38,6) NOT NULL,
 CONSTRAINT pk_dpretailer_dfo_by_quarter PRIMARY KEY (release_key, country, category, product_group, distribution_type, year, quarter),
 CONSTRAINT fk_dpretailer_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id)
);

I understand Primary Key implies a unique index

If I simply ask how many rows I have when filtering on non existing data (release_key = 1 returns nothing), I can see it uses the index

EXPLAIN
SELECT COUNT(*)
 FROM dpretailer.dfo_by_quarter
 WHERE release_key = 1

Aggregate (cost=6.32..6.33 rows=1 width=8)
 -> Index Only Scan using pk_dpretailer_dfo_by_quarter on dfo_by_quarter (cost=0.55..6.32 rows=1 width=0)
 Index Cond: (release_key = 1)

But if I run the same query on a value that returns data, it scans the table, which is bound to be more expensive...

EXPLAIN
SELECT COUNT(*)
 FROM dpretailer.dfo_by_quarter
 WHERE release_key = 2

Finalize Aggregate (cost=47611.07..47611.08 rows=1 width=8)
 -> Gather (cost=47610.86..47611.07 rows=2 width=8)
 Workers Planned: 2
 -> Partial Aggregate (cost=46610.86..46610.87 rows=1 width=8)
 -> Parallel Seq Scan on dfo_by_quarter (cost=0.00..46307.29 rows=121428 width=0)
 Filter: (release_key = 2)

I get it that using the index when there is no data makes sense and is driven by the stats on the table (I ran ANALYSE before the tests)

But why not using my index if there is data?

Surely, it must be quicker to scan part of an index (because release_key is the first column) rather than scanning an entire table???

I must be missing something...?

Update 2019-03-07

Thank You for your comments, which are very useful.

This simple query was just me trying to understand why the index was not used...

But I should have known better (I am new to postgresql but have MANY years experience with SQL Server) and it makes sense that it is not, as you commented about.

bad selectivity because my criteria only filters about 20% of the rows

bad table design (too fat, which we knew and are now addressing)

index not "covering" the query, etc...

So let me change "slightly" my question if I may...

Our table will be normalised in facts/dimensions (no more varchars in the wrong place).

We do only inserts, never updates and so few deletes that we can ignore it.

The table size will not be huge (tens of million of rows order).

Our queries will ALWAYS specify an exact release_key value.

Our new version of the table would look like this

CREATE TABLE dfo_by_quarter (
 release_key int4 NOT NULL,
 country_key int2 NOT NULL,
 product_group_key int2 NOT NULL,
 distribution_type_key int2 NOT NULL,
 category_key int2 NOT NULL,
 "year" int2 NOT NULL,
 "date" date NULL,
 quarter int2 NOT NULL,
 units numeric(38,6) NOT NULL,
 sales_value_eur numeric(38,6) NOT NULL,
 sales_value_usd numeric(38,6) NOT NULL,
 sales_value_local numeric(38,6) NOT NULL,
 CONSTRAINT pk_milly_dfo_by_quarter PRIMARY KEY (release_key, country_key, category_key, product_group_key, distribution_type_key, year, quarter),
 CONSTRAINT fk_milly_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id),
 CONSTRAINT fk_milly_dim_dfo_category FOREIGN KEY (category_key) REFERENCES milly.dim_dfo_category(category_key),
 CONSTRAINT fk_milly_dim_dfo_country FOREIGN KEY (country_key) REFERENCES milly.dim_dfo_country(country_key),
 CONSTRAINT fk_milly_dim_dfo_distribution_type FOREIGN KEY (distribution_type_key) REFERENCES milly.dim_dfo_distribution_type(distribution_type_key),
 CONSTRAINT fk_milly_dim_dfo_product_group FOREIGN KEY (product_group_key) REFERENCES milly.dim_dfo_product_group(product_group_key)
);

Question 1)

Or, is there a way to create something like a SQL Server Index WITH INCLUDE(units, values)?

update: I came across the SQL CLUSTER command, which is the closest thing I suppose.
It would be suitable for us

Question 2

With the query below

EXPLAIN (ANALYZE, BUFFERS)
WITH "rank_query" AS
(
 SELECT
 ROW_NUMBER() OVER(PARTITION BY "year" ORDER BY SUM("main"."units") DESC) AS "rank_by",
 "year",
 "main"."product_group_key" AS "productgroupkey",
 SUM("main"."units") AS "salesunits",
 SUM("main"."sales_value_eur") AS "salesvalue",
 SUM("sales_value_eur")/SUM("units") AS "asp"
 FROM "milly"."dfo_by_quarter" AS "main"

 WHERE
 "release_key" = 17 AND
 "main"."year" >= 2010
 GROUP BY
 "year",
 "main"."product_group_key"
)
,BeforeLookup
AS (
SELECT
 "year" AS date,
 SUM("salesunits") AS "salesunits",
 SUM("salesvalue") AS "salesvalue",
 SUM("salesvalue")/SUM("salesunits") AS "asp",
 CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END AS "productgroupkey"
FROM
 "rank_query"
GROUP BY
 "year",
 CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END
)
SELECT BL.date, BL.salesunits, BL.salesvalue, BL.asp
 FROM BeforeLookup AS BL
 INNER JOIN milly.dim_dfo_product_group PG ON PG.product_group_key = BL.productgroupkey;

I get this

Hash Join (cost=40883.82..40896.46 rows=558 width=98) (actual time=676.565..678.308 rows=663 loops=1)
 Hash Cond: (bl.productgroupkey = pg.product_group_key)
 Buffers: shared hit=483 read=22719
 CTE rank_query
 -> WindowAgg (cost=40507.15..40632.63 rows=5577 width=108) (actual time=660.076..668.272 rows=5418 loops=1)
 Buffers: shared hit=480 read=22719
 -> Sort (cost=40507.15..40521.09 rows=5577 width=68) (actual time=660.062..661.226 rows=5418 loops=1)
 Sort Key: main.year, (sum(main.units)) DESC
 Sort Method: quicksort Memory: 616kB
 Buffers: shared hit=480 read=22719
 -> Finalize HashAggregate (cost=40076.46..40160.11 rows=5577 width=68) (actual time=648.762..653.227 rows=5418 loops=1)
 Group Key: main.year, main.product_group_key
 Buffers: shared hit=480 read=22719
 -> Gather (cost=38710.09..39909.15 rows=11154 width=68) (actual time=597.878..622.379 rows=11938 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 Buffers: shared hit=480 read=22719
 -> Partial HashAggregate (cost=37710.09..37793.75 rows=5577 width=68) (actual time=594.044..600.494 rows=3979 loops=3)
 Group Key: main.year, main.product_group_key
 Buffers: shared hit=480 read=22719
 -> Parallel Seq Scan on dfo_by_quarter main (cost=0.00..36019.74 rows=169035 width=22) (actual time=106.916..357.071 rows=137171 loops=3)
 Filter: ((year >= 2010) AND (release_key = 17))
 Rows Removed by Filter: 546602
 Buffers: shared hit=480 read=22719
 CTE beforelookup
 -> HashAggregate (cost=223.08..238.43 rows=558 width=102) (actual time=676.293..677.167 rows=663 loops=1)
 Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
 Buffers: shared hit=480 read=22719
 -> CTE Scan on rank_query (cost=0.00..139.43 rows=5577 width=70) (actual time=660.079..672.978 rows=5418 loops=1)
 Buffers: shared hit=480 read=22719
 -> CTE Scan on beforelookup bl (cost=0.00..11.16 rows=558 width=102) (actual time=676.296..677.665 rows=663 loops=1)
 Buffers: shared hit=480 read=22719
 -> Hash (cost=7.34..7.34 rows=434 width=4) (actual time=0.253..0.253 rows=435 loops=1)
 Buckets: 1024 Batches: 1 Memory Usage: 24kB
 Buffers: shared hit=3
 -> Seq Scan on dim_dfo_product_group pg (cost=0.00..7.34 rows=434 width=4) (actual time=0.017..0.121 rows=435 loops=1)
 Buffers: shared hit=3
Planning time: 0.319 ms
Execution time: 678.714 ms

Does anything spring to mind?

If I read it properly, it means my biggest cost by far is the initial scanof the table... but I don't manage to make it use an index...

I had created an index I hoped would help but it got ignored...

CREATE INDEX eric_silly_index ON milly.dfo_by_quarter(release_key, YEAR, date, product_group_key, units, sales_value_eur);

ANALYZE milly.dfo_by_quarter;

I also tried to cluster the table but no visible effect either

CLUSTER milly.dfo_by_quarter USING pk_milly_dfo_by_quarter; -- took 30 seconds (uidev)

ANALYZE milly.dfo_by_quarter;

Many thanks

Eric

postgresql indexing amazon-rds postgresql-performance

edited Mar 7 at 22:59

asked Mar 6 at 17:50

Eric Mamet

808

edited Mar 7 at 22:59

asked Mar 6 at 17:50

Eric Mamet

808

edited Mar 7 at 22:59

asked Mar 6 at 17:50

Eric Mamet

808

asked Mar 6 at 17:50

Eric Mamet

808

asked Mar 6 at 17:50

Eric Mamet

808

How many rows in your table? How many with release_key = 2. Is there (concurrent) write load on the table? Do you see an index-only scan after VACUUM dpretailer.dfo_by_quarter?

– Erwin Brandstetter
Mar 6 at 18:44

Also, can you replace the EXPLAINs with output from EXPLAIN (ANALYZE, BUFFERS)? That would give us concrete timings and shared cache hits/misses.

– Ancoron
Mar 6 at 18:48

add a comment |

How many rows in your table? How many with release_key = 2. Is there (concurrent) write load on the table? Do you see an index-only scan after VACUUM dpretailer.dfo_by_quarter?

– Erwin Brandstetter
Mar 6 at 18:44

Also, can you replace the EXPLAINs with output from EXPLAIN (ANALYZE, BUFFERS)? That would give us concrete timings and shared cache hits/misses.

– Ancoron
Mar 6 at 18:48

How many rows in your table? How many with release_key = 2. Is there (concurrent) write load on the table? Do you see an index-only scan after VACUUM dpretailer.dfo_by_quarter?

– Erwin Brandstetter
Mar 6 at 18:44

Also, can you replace the EXPLAINs with output from EXPLAIN (ANALYZE, BUFFERS)? That would give us concrete timings and shared cache hits/misses.

– Ancoron
Mar 6 at 18:48

add a comment |

3 Answers
3

active

oldest

votes

Generally, while possible, a PK spanning 7 columns, several of which being varchar(100) is not optimized for performance, to say the least.

Such an index is large to begin with and tends to bloat quickly, if you have updates on involved columns.

I would operate with a surrogate PK, a serial (or bigserial if you have that many rows). Or IDENTITY. See:

Auto increment table column

And a UNIQUE constraint on all 7 to enforce uniqueness (all are NOT NULL anyway).

If you have lots of counting queries with the only predicate on release_key consider an additional plain btree index on just that column.

The data type varchar(100) for so many columns may not be optimal. Some normalization might help.

More advise depends on missing information ...

answered Mar 6 at 18:50

Erwin Brandstetter

350k68637817

Voting +1 for this answer since adding an index on the filter column increases the chance of getting an index-only scan. Once that is in-place (remember that creating a foreign key does not create and index), the primary key should be irrelevant.

– Ancoron
Mar 6 at 19:08

* I ran this particular query purely as an exercise to try and get an index scan. The original query was far more complex. * In "real scenarios", I need the other columns. * There was no other operation concurrently * There is probably about 300,000 rows for this release_key value and a few millions overall * At this stage, my table is "fat" (varchar columns) because it was a quick and dirty implementation but I'll make it look more like fact/dimensions very soon (therefore surrogate keys instead of varchars)

– Eric Mamet
Mar 6 at 22:23

@EricMamet: Then the question might be misleading. All details matter. Postgres decides the query plan based on estimated costs. "A few millions" is too fuzzy. Could be 2 or 9 millions, which makes a big difference.

– Erwin Brandstetter
Mar 6 at 22:28

@ErwinBrandstetter I understand that and I'll come back with more details tomorrow. In SQL Server, a request like this would always use the index because it would not need to look at the table at all and would only sequentially scan a subset of the index (regardless of the number of rows). I was wondering whether postgresql would behave the same. Obviously not!

– Eric Mamet
Mar 6 at 22:43

@EricMamet: It depends. For index-only scans like we see in your first query plan, some preconditions have to be met.

– Erwin Brandstetter
Mar 7 at 0:13

add a comment |

The answer to my initial question: why is postgresql not using my index on something like SELECT (*)... can be found in the documentation...

Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT

In particular: This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn't been deleted.

This explains a lot why I don't manage to get postgresql to use my indexes when, from a SQL Server perspective, it obviously "should".

answered Mar 8 at 12:14

Eric Mamet

808

add a comment |

Because release_key isn't actually a unique column, it's not possible from the information you've provided to know whether or not the index should be used. If a high percentage of rows have release_key = 2 or even a smaller percentage of rows match on a large table, it may not be efficient to use the index.

In part this is because Postgres indexes are indirect -- that is the index actually contains a pointer to the location on disk in the heap where the real tuple lives. So looping through an index requires reading an entry from the index, reading the tuple from the heap, and repeating. For a large number of tuples it's often more valuable to scan the heap directly and avoid the indirect disk access penalty.

Edit:
You generally don't want to be using CLUSTER in PostgreSQL; it's not how indexes are maintained, and it's rare to see that in the wild for that reason.

Your updated query with no data gives this plan:

 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on beforelookup bl (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
 Buffers: shared hit=4
 CTE rank_query
 -> WindowAgg (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
 Buffers: shared hit=4
 -> Sort (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
 Sort Key: main.year, (sum(main.units)) DESC
 Sort Method: quicksort Memory: 25kB
 Buffers: shared hit=4
 -> GroupAggregate (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
 Group Key: main.year, main.product_group_key
 Buffers: shared hit=1
 -> Sort (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
 Sort Key: main.year, main.product_group_key
 Sort Method: quicksort Memory: 25kB
 Buffers: shared hit=1
 -> Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
 Index Cond: ((release_key = 17) AND (year >= 2010))
 Buffers: shared hit=1
 CTE beforelookup
 -> HashAggregate (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
 Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
 Buffers: shared hit=4
 -> CTE Scan on rank_query (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
 Buffers: shared hit=4
 Planning Time: 0.723 ms
 Execution Time: 0.485 ms
(27 rows)

So PostgreSQL is entirely capable of using the index for your query, but the planner is deciding that it's not worth it (i.e., the costing for using the index directly is higher than the costing for using the parallel sequence scan).

If you set enable_indexscan = off; with no data, you get a bitmap index scan (as I'd expect). If you set enable_bitmapscan = off; with no data you get an (non-parallel) sequence scan.

You should see the plan change back (with large amounts of data) if you set max_parallel_workers = 0;.

But looking at your query's explain results, I'd very much expect using the index to be more expensive and take longer than using the parallel sequence scan. In your updated query you're still scanning a very high percentage of the table and a large number of rows, and you're also forcing accessing the heap by accessing fields not in the index. Postgres 11 (I believe) adds covering indexes which would theoretically allow you to make this query be driven by the index alone, but I'm not at all convinced in this example it would actually be worth it.

edited Mar 8 at 13:02

answered Mar 6 at 22:12

jcoleman

428314

I don't have access to the database right now so I won't be able to supply more details this evening. Coming from a SQL Server background, I understand the "indirect" aspect of the index and the same would apply to SQL Server if I referred to any column not covered by the index. But in my example, I made a point to do just a COUNT(*) that, in SQL Server, would be resolved without accessing the underlying table since the row count could be worked out from the index.

– Eric Mamet
Mar 6 at 22:31

@EricMamet I updated my answer for your updated question.

– jcoleman
Mar 8 at 13:03

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55029323%2fpostgresql-is-not-using-a-straight-forward-index%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Generally, while possible, a PK spanning 7 columns, several of which being varchar(100) is not optimized for performance, to say the least.

Such an index is large to begin with and tends to bloat quickly, if you have updates on involved columns.

I would operate with a surrogate PK, a serial (or bigserial if you have that many rows). Or IDENTITY. See:

Auto increment table column

And a UNIQUE constraint on all 7 to enforce uniqueness (all are NOT NULL anyway).

If you have lots of counting queries with the only predicate on release_key consider an additional plain btree index on just that column.

The data type varchar(100) for so many columns may not be optimal. Some normalization might help.

More advise depends on missing information ...

answered Mar 6 at 18:50

Erwin Brandstetter

350k68637817

Voting +1 for this answer since adding an index on the filter column increases the chance of getting an index-only scan. Once that is in-place (remember that creating a foreign key does not create and index), the primary key should be irrelevant.

– Ancoron
Mar 6 at 19:08

* I ran this particular query purely as an exercise to try and get an index scan. The original query was far more complex. * In "real scenarios", I need the other columns. * There was no other operation concurrently * There is probably about 300,000 rows for this release_key value and a few millions overall * At this stage, my table is "fat" (varchar columns) because it was a quick and dirty implementation but I'll make it look more like fact/dimensions very soon (therefore surrogate keys instead of varchars)

– Eric Mamet
Mar 6 at 22:23

@EricMamet: Then the question might be misleading. All details matter. Postgres decides the query plan based on estimated costs. "A few millions" is too fuzzy. Could be 2 or 9 millions, which makes a big difference.

– Erwin Brandstetter
Mar 6 at 22:28

@ErwinBrandstetter I understand that and I'll come back with more details tomorrow. In SQL Server, a request like this would always use the index because it would not need to look at the table at all and would only sequentially scan a subset of the index (regardless of the number of rows). I was wondering whether postgresql would behave the same. Obviously not!

– Eric Mamet
Mar 6 at 22:43

@EricMamet: It depends. For index-only scans like we see in your first query plan, some preconditions have to be met.

– Erwin Brandstetter
Mar 7 at 0:13

add a comment |

Generally, while possible, a PK spanning 7 columns, several of which being varchar(100) is not optimized for performance, to say the least.

Such an index is large to begin with and tends to bloat quickly, if you have updates on involved columns.

I would operate with a surrogate PK, a serial (or bigserial if you have that many rows). Or IDENTITY. See:

Auto increment table column

And a UNIQUE constraint on all 7 to enforce uniqueness (all are NOT NULL anyway).

If you have lots of counting queries with the only predicate on release_key consider an additional plain btree index on just that column.

The data type varchar(100) for so many columns may not be optimal. Some normalization might help.

More advise depends on missing information ...

answered Mar 6 at 18:50

Erwin Brandstetter

350k68637817

Voting +1 for this answer since adding an index on the filter column increases the chance of getting an index-only scan. Once that is in-place (remember that creating a foreign key does not create and index), the primary key should be irrelevant.

– Ancoron
Mar 6 at 19:08

* I ran this particular query purely as an exercise to try and get an index scan. The original query was far more complex. * In "real scenarios", I need the other columns. * There was no other operation concurrently * There is probably about 300,000 rows for this release_key value and a few millions overall * At this stage, my table is "fat" (varchar columns) because it was a quick and dirty implementation but I'll make it look more like fact/dimensions very soon (therefore surrogate keys instead of varchars)

– Eric Mamet
Mar 6 at 22:23

@EricMamet: Then the question might be misleading. All details matter. Postgres decides the query plan based on estimated costs. "A few millions" is too fuzzy. Could be 2 or 9 millions, which makes a big difference.

– Erwin Brandstetter
Mar 6 at 22:28

@ErwinBrandstetter I understand that and I'll come back with more details tomorrow. In SQL Server, a request like this would always use the index because it would not need to look at the table at all and would only sequentially scan a subset of the index (regardless of the number of rows). I was wondering whether postgresql would behave the same. Obviously not!

– Eric Mamet
Mar 6 at 22:43

@EricMamet: It depends. For index-only scans like we see in your first query plan, some preconditions have to be met.

– Erwin Brandstetter
Mar 7 at 0:13

add a comment |

Generally, while possible, a PK spanning 7 columns, several of which being varchar(100) is not optimized for performance, to say the least.

Such an index is large to begin with and tends to bloat quickly, if you have updates on involved columns.

I would operate with a surrogate PK, a serial (or bigserial if you have that many rows). Or IDENTITY. See:

Auto increment table column

And a UNIQUE constraint on all 7 to enforce uniqueness (all are NOT NULL anyway).

If you have lots of counting queries with the only predicate on release_key consider an additional plain btree index on just that column.

The data type varchar(100) for so many columns may not be optimal. Some normalization might help.

More advise depends on missing information ...

answered Mar 6 at 18:50

Erwin Brandstetter

350k68637817

Generally, while possible, a PK spanning 7 columns, several of which being varchar(100) is not optimized for performance, to say the least.

Such an index is large to begin with and tends to bloat quickly, if you have updates on involved columns.

I would operate with a surrogate PK, a serial (or bigserial if you have that many rows). Or IDENTITY. See:

Auto increment table column

And a UNIQUE constraint on all 7 to enforce uniqueness (all are NOT NULL anyway).

If you have lots of counting queries with the only predicate on release_key consider an additional plain btree index on just that column.

The data type varchar(100) for so many columns may not be optimal. Some normalization might help.

More advise depends on missing information ...

answered Mar 6 at 18:50

Erwin Brandstetter

350k68637817

answered Mar 6 at 18:50

Erwin Brandstetter

350k68637817

answered Mar 6 at 18:50

Erwin Brandstetter

350k68637817

answered Mar 6 at 18:50

Erwin Brandstetter

350k68637817

Voting +1 for this answer since adding an index on the filter column increases the chance of getting an index-only scan. Once that is in-place (remember that creating a foreign key does not create and index), the primary key should be irrelevant.

– Ancoron
Mar 6 at 19:08

* I ran this particular query purely as an exercise to try and get an index scan. The original query was far more complex. * In "real scenarios", I need the other columns. * There was no other operation concurrently * There is probably about 300,000 rows for this release_key value and a few millions overall * At this stage, my table is "fat" (varchar columns) because it was a quick and dirty implementation but I'll make it look more like fact/dimensions very soon (therefore surrogate keys instead of varchars)

– Eric Mamet
Mar 6 at 22:23

@EricMamet: Then the question might be misleading. All details matter. Postgres decides the query plan based on estimated costs. "A few millions" is too fuzzy. Could be 2 or 9 millions, which makes a big difference.

– Erwin Brandstetter
Mar 6 at 22:28

@ErwinBrandstetter I understand that and I'll come back with more details tomorrow. In SQL Server, a request like this would always use the index because it would not need to look at the table at all and would only sequentially scan a subset of the index (regardless of the number of rows). I was wondering whether postgresql would behave the same. Obviously not!

– Eric Mamet
Mar 6 at 22:43

@EricMamet: It depends. For index-only scans like we see in your first query plan, some preconditions have to be met.

– Erwin Brandstetter
Mar 7 at 0:13

add a comment |

Voting +1 for this answer since adding an index on the filter column increases the chance of getting an index-only scan. Once that is in-place (remember that creating a foreign key does not create and index), the primary key should be irrelevant.

– Ancoron
Mar 6 at 19:08

* I ran this particular query purely as an exercise to try and get an index scan. The original query was far more complex. * In "real scenarios", I need the other columns. * There was no other operation concurrently * There is probably about 300,000 rows for this release_key value and a few millions overall * At this stage, my table is "fat" (varchar columns) because it was a quick and dirty implementation but I'll make it look more like fact/dimensions very soon (therefore surrogate keys instead of varchars)

– Eric Mamet
Mar 6 at 22:23

@EricMamet: Then the question might be misleading. All details matter. Postgres decides the query plan based on estimated costs. "A few millions" is too fuzzy. Could be 2 or 9 millions, which makes a big difference.

– Erwin Brandstetter
Mar 6 at 22:28

@ErwinBrandstetter I understand that and I'll come back with more details tomorrow. In SQL Server, a request like this would always use the index because it would not need to look at the table at all and would only sequentially scan a subset of the index (regardless of the number of rows). I was wondering whether postgresql would behave the same. Obviously not!

– Eric Mamet
Mar 6 at 22:43

@EricMamet: It depends. For index-only scans like we see in your first query plan, some preconditions have to be met.

– Erwin Brandstetter
Mar 7 at 0:13

Voting +1 for this answer since adding an index on the filter column increases the chance of getting an index-only scan. Once that is in-place (remember that creating a foreign key does not create and index), the primary key should be irrelevant.

– Ancoron
Mar 6 at 19:08

* I ran this particular query purely as an exercise to try and get an index scan. The original query was far more complex. * In "real scenarios", I need the other columns. * There was no other operation concurrently * There is probably about 300,000 rows for this release_key value and a few millions overall * At this stage, my table is "fat" (varchar columns) because it was a quick and dirty implementation but I'll make it look more like fact/dimensions very soon (therefore surrogate keys instead of varchars)

– Eric Mamet
Mar 6 at 22:23

@EricMamet: Then the question might be misleading. All details matter. Postgres decides the query plan based on estimated costs. "A few millions" is too fuzzy. Could be 2 or 9 millions, which makes a big difference.

– Erwin Brandstetter
Mar 6 at 22:28

@ErwinBrandstetter I understand that and I'll come back with more details tomorrow. In SQL Server, a request like this would always use the index because it would not need to look at the table at all and would only sequentially scan a subset of the index (regardless of the number of rows). I was wondering whether postgresql would behave the same. Obviously not!

– Eric Mamet
Mar 6 at 22:43

@EricMamet: It depends. For index-only scans like we see in your first query plan, some preconditions have to be met.

– Erwin Brandstetter
Mar 7 at 0:13

add a comment |

The answer to my initial question: why is postgresql not using my index on something like SELECT (*)... can be found in the documentation...

Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT

In particular: This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn't been deleted.

This explains a lot why I don't manage to get postgresql to use my indexes when, from a SQL Server perspective, it obviously "should".

answered Mar 8 at 12:14

Eric Mamet

808

add a comment |

The answer to my initial question: why is postgresql not using my index on something like SELECT (*)... can be found in the documentation...

Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT

In particular: This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn't been deleted.

This explains a lot why I don't manage to get postgresql to use my indexes when, from a SQL Server perspective, it obviously "should".

answered Mar 8 at 12:14

Eric Mamet

808

add a comment |

The answer to my initial question: why is postgresql not using my index on something like SELECT (*)... can be found in the documentation...

Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT

In particular: This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn't been deleted.

This explains a lot why I don't manage to get postgresql to use my indexes when, from a SQL Server perspective, it obviously "should".

answered Mar 8 at 12:14

Eric Mamet

808

The answer to my initial question: why is postgresql not using my index on something like SELECT (*)... can be found in the documentation...

Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT

In particular: This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn't been deleted.

This explains a lot why I don't manage to get postgresql to use my indexes when, from a SQL Server perspective, it obviously "should".

answered Mar 8 at 12:14

Eric Mamet

808

answered Mar 8 at 12:14

Eric Mamet

808

answered Mar 8 at 12:14

Eric Mamet

808

answered Mar 8 at 12:14

Eric Mamet

808

add a comment |

Edit:
You generally don't want to be using CLUSTER in PostgreSQL; it's not how indexes are maintained, and it's rare to see that in the wild for that reason.

Your updated query with no data gives this plan:

 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on beforelookup bl (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
 Buffers: shared hit=4
 CTE rank_query
 -> WindowAgg (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
 Buffers: shared hit=4
 -> Sort (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
 Sort Key: main.year, (sum(main.units)) DESC
 Sort Method: quicksort Memory: 25kB
 Buffers: shared hit=4
 -> GroupAggregate (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
 Group Key: main.year, main.product_group_key
 Buffers: shared hit=1
 -> Sort (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
 Sort Key: main.year, main.product_group_key
 Sort Method: quicksort Memory: 25kB
 Buffers: shared hit=1
 -> Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
 Index Cond: ((release_key = 17) AND (year >= 2010))
 Buffers: shared hit=1
 CTE beforelookup
 -> HashAggregate (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
 Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
 Buffers: shared hit=4
 -> CTE Scan on rank_query (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
 Buffers: shared hit=4
 Planning Time: 0.723 ms
 Execution Time: 0.485 ms
(27 rows)

If you set enable_indexscan = off; with no data, you get a bitmap index scan (as I'd expect). If you set enable_bitmapscan = off; with no data you get an (non-parallel) sequence scan.

You should see the plan change back (with large amounts of data) if you set max_parallel_workers = 0;.

edited Mar 8 at 13:02

answered Mar 6 at 22:12

jcoleman

428314

I don't have access to the database right now so I won't be able to supply more details this evening. Coming from a SQL Server background, I understand the "indirect" aspect of the index and the same would apply to SQL Server if I referred to any column not covered by the index. But in my example, I made a point to do just a COUNT(*) that, in SQL Server, would be resolved without accessing the underlying table since the row count could be worked out from the index.

– Eric Mamet
Mar 6 at 22:31

@EricMamet I updated my answer for your updated question.

– jcoleman
Mar 8 at 13:03

add a comment |

Edit:
You generally don't want to be using CLUSTER in PostgreSQL; it's not how indexes are maintained, and it's rare to see that in the wild for that reason.

Your updated query with no data gives this plan:

 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on beforelookup bl (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
 Buffers: shared hit=4
 CTE rank_query
 -> WindowAgg (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
 Buffers: shared hit=4
 -> Sort (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
 Sort Key: main.year, (sum(main.units)) DESC
 Sort Method: quicksort Memory: 25kB
 Buffers: shared hit=4
 -> GroupAggregate (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
 Group Key: main.year, main.product_group_key
 Buffers: shared hit=1
 -> Sort (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
 Sort Key: main.year, main.product_group_key
 Sort Method: quicksort Memory: 25kB
 Buffers: shared hit=1
 -> Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
 Index Cond: ((release_key = 17) AND (year >= 2010))
 Buffers: shared hit=1
 CTE beforelookup
 -> HashAggregate (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
 Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
 Buffers: shared hit=4
 -> CTE Scan on rank_query (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
 Buffers: shared hit=4
 Planning Time: 0.723 ms
 Execution Time: 0.485 ms
(27 rows)

If you set enable_indexscan = off; with no data, you get a bitmap index scan (as I'd expect). If you set enable_bitmapscan = off; with no data you get an (non-parallel) sequence scan.

You should see the plan change back (with large amounts of data) if you set max_parallel_workers = 0;.

edited Mar 8 at 13:02

answered Mar 6 at 22:12

jcoleman

428314

I don't have access to the database right now so I won't be able to supply more details this evening. Coming from a SQL Server background, I understand the "indirect" aspect of the index and the same would apply to SQL Server if I referred to any column not covered by the index. But in my example, I made a point to do just a COUNT(*) that, in SQL Server, would be resolved without accessing the underlying table since the row count could be worked out from the index.

– Eric Mamet
Mar 6 at 22:31

@EricMamet I updated my answer for your updated question.

– jcoleman
Mar 8 at 13:03

add a comment |

Edit:
You generally don't want to be using CLUSTER in PostgreSQL; it's not how indexes are maintained, and it's rare to see that in the wild for that reason.

Your updated query with no data gives this plan:

 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on beforelookup bl (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
 Buffers: shared hit=4
 CTE rank_query
 -> WindowAgg (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
 Buffers: shared hit=4
 -> Sort (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
 Sort Key: main.year, (sum(main.units)) DESC
 Sort Method: quicksort Memory: 25kB
 Buffers: shared hit=4
 -> GroupAggregate (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
 Group Key: main.year, main.product_group_key
 Buffers: shared hit=1
 -> Sort (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
 Sort Key: main.year, main.product_group_key
 Sort Method: quicksort Memory: 25kB
 Buffers: shared hit=1
 -> Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
 Index Cond: ((release_key = 17) AND (year >= 2010))
 Buffers: shared hit=1
 CTE beforelookup
 -> HashAggregate (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
 Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
 Buffers: shared hit=4
 -> CTE Scan on rank_query (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
 Buffers: shared hit=4
 Planning Time: 0.723 ms
 Execution Time: 0.485 ms
(27 rows)

If you set enable_indexscan = off; with no data, you get a bitmap index scan (as I'd expect). If you set enable_bitmapscan = off; with no data you get an (non-parallel) sequence scan.

You should see the plan change back (with large amounts of data) if you set max_parallel_workers = 0;.

edited Mar 8 at 13:02

answered Mar 6 at 22:12

jcoleman

428314

Edit:
You generally don't want to be using CLUSTER in PostgreSQL; it's not how indexes are maintained, and it's rare to see that in the wild for that reason.

Your updated query with no data gives this plan:

 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on beforelookup bl (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
 Buffers: shared hit=4
 CTE rank_query
 -> WindowAgg (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
 Buffers: shared hit=4
 -> Sort (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
 Sort Key: main.year, (sum(main.units)) DESC
 Sort Method: quicksort Memory: 25kB
 Buffers: shared hit=4
 -> GroupAggregate (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
 Group Key: main.year, main.product_group_key
 Buffers: shared hit=1
 -> Sort (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
 Sort Key: main.year, main.product_group_key
 Sort Method: quicksort Memory: 25kB
 Buffers: shared hit=1
 -> Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
 Index Cond: ((release_key = 17) AND (year >= 2010))
 Buffers: shared hit=1
 CTE beforelookup
 -> HashAggregate (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
 Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
 Buffers: shared hit=4
 -> CTE Scan on rank_query (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
 Buffers: shared hit=4
 Planning Time: 0.723 ms
 Execution Time: 0.485 ms
(27 rows)

If you set enable_indexscan = off; with no data, you get a bitmap index scan (as I'd expect). If you set enable_bitmapscan = off; with no data you get an (non-parallel) sequence scan.

You should see the plan change back (with large amounts of data) if you set max_parallel_workers = 0;.

edited Mar 8 at 13:02

answered Mar 6 at 22:12

jcoleman

428314

edited Mar 8 at 13:02

answered Mar 6 at 22:12

jcoleman

428314

answered Mar 6 at 22:12

jcoleman

428314

answered Mar 6 at 22:12

jcoleman

428314

I don't have access to the database right now so I won't be able to supply more details this evening. Coming from a SQL Server background, I understand the "indirect" aspect of the index and the same would apply to SQL Server if I referred to any column not covered by the index. But in my example, I made a point to do just a COUNT(*) that, in SQL Server, would be resolved without accessing the underlying table since the row count could be worked out from the index.

– Eric Mamet
Mar 6 at 22:31

@EricMamet I updated my answer for your updated question.

– jcoleman
Mar 8 at 13:03

add a comment |

I don't have access to the database right now so I won't be able to supply more details this evening. Coming from a SQL Server background, I understand the "indirect" aspect of the index and the same would apply to SQL Server if I referred to any column not covered by the index. But in my example, I made a point to do just a COUNT(*) that, in SQL Server, would be resolved without accessing the underlying table since the row count could be worked out from the index.

– Eric Mamet
Mar 6 at 22:31

@EricMamet I updated my answer for your updated question.

– jcoleman
Mar 8 at 13:03

I don't have access to the database right now so I won't be able to supply more details this evening. Coming from a SQL Server background, I understand the "indirect" aspect of the index and the same would apply to SQL Server if I referred to any column not covered by the index. But in my example, I made a point to do just a COUNT(*) that, in SQL Server, would be resolved without accessing the underlying table since the row count could be worked out from the index.

– Eric Mamet
Mar 6 at 22:31

@EricMamet I updated my answer for your updated question.

– jcoleman
Mar 8 at 13:03

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ufdjrw

3 Answers
3

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Post as a guest

Popular posts from this blog

Гладіатор

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Гладіатор

3 Answers
3

3 Answers
3

3 Answers
3