PostgreSQL is not using a straight forward index2019 Community Moderator ElectionAuto increment table columnHow does database indexing work?PostgreSQL “DESCRIBE TABLE”Show tables in PostgreSQLHow can I drop all the tables in a PostgreSQL database?How to exit from PostgreSQL command line utility: psqlPostgreSQL 9.2 10x slower than MySQL 5.5?Which version of PostgreSQL am I running?DISTINCT INNER JOIN slowOptimize Window queries in PostgresqlWhy is our Postgres query performance slow?

Should QA ask requirements to developers?

Touchscreen-controlled dentist office snowman collector game

Can infringement of a trademark be pursued for using a company's name in a sentence?

Decoding assembly instructions in a Game Boy disassembler

US to Europe trip with Montreal layover - is 52 minutes enough?

Can "semicircle" be used to refer to a part-circle that is not a exact half-circle?

Ban on all campaign finance?

Counter-example to the existence of left Bousfield localization of combinatorial model category

Can the druid cantrip Thorn Whip really defeat a water weird this easily?

What exactly is the purpose of connection links straped between the rocket and the launch pad

Why do Australian milk farmers need to protest supermarkets' milk price?

Gravity alteration as extermination tool viable?

Coworker uses her breast-pump everywhere in the office

How do I hide Chekhov's Gun?

What is the likely impact on flights of grounding an entire aircraft series?

As a monk, can you make a melee attack roll using your Strength modifier, but roll damage with your Dexterity modifier?

The three point beverage

Why would a jet engine that runs at temps excess of 2000°C burn when it crashes?

How to make readers know that my work has used a hidden constraint?

What is the definition of "Natural Selection"?

Question about partial fractions with irreducible quadratic factors

Time dilation for a moving electronic clock

What does it mean when multiple 々 marks follow a 、?

Is "history" a male-biased word ("his+story")?



PostgreSQL is not using a straight forward index



2019 Community Moderator ElectionAuto increment table columnHow does database indexing work?PostgreSQL “DESCRIBE TABLE”Show tables in PostgreSQLHow can I drop all the tables in a PostgreSQL database?How to exit from PostgreSQL command line utility: psqlPostgreSQL 9.2 10x slower than MySQL 5.5?Which version of PostgreSQL am I running?DISTINCT INNER JOIN slowOptimize Window queries in PostgresqlWhy is our Postgres query performance slow?










0















I have a PostgreSQL 10.6 database on Amazon RDS. My table is like this:



CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country varchar(100) NOT NULL,
product_group varchar(100) NOT NULL,
distribution_type varchar(100) NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
category varchar(100) NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
data_status bpchar(1) NOT NULL,
panel_market_units numeric(38,6) NOT NULL,
panel_market_sales_value_eur numeric(38,6) NOT NULL,
panel_market_sales_value_usd numeric(38,6) NOT NULL,
panel_market_sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_dpretailer_dfo_by_quarter PRIMARY KEY (release_key, country, category, product_group, distribution_type, year, quarter),
CONSTRAINT fk_dpretailer_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id)
);


I understand Primary Key implies a unique index



If I simply ask how many rows I have when filtering on non existing data (release_key = 1 returns nothing), I can see it uses the index



EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 1

Aggregate (cost=6.32..6.33 rows=1 width=8)
-> Index Only Scan using pk_dpretailer_dfo_by_quarter on dfo_by_quarter (cost=0.55..6.32 rows=1 width=0)
Index Cond: (release_key = 1)


But if I run the same query on a value that returns data, it scans the table, which is bound to be more expensive...



EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 2

Finalize Aggregate (cost=47611.07..47611.08 rows=1 width=8)
-> Gather (cost=47610.86..47611.07 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=46610.86..46610.87 rows=1 width=8)
-> Parallel Seq Scan on dfo_by_quarter (cost=0.00..46307.29 rows=121428 width=0)
Filter: (release_key = 2)


I get it that using the index when there is no data makes sense and is driven by the stats on the table (I ran ANALYSE before the tests)



But why not using my index if there is data?



Surely, it must be quicker to scan part of an index (because release_key is the first column) rather than scanning an entire table???



I must be missing something...?



Update 2019-03-07



Thank You for your comments, which are very useful.



This simple query was just me trying to understand why the index was not used...



But I should have known better (I am new to postgresql but have MANY years experience with SQL Server) and it makes sense that it is not, as you commented about.



  • bad selectivity because my criteria only filters about 20% of the rows

  • bad table design (too fat, which we knew and are now addressing)

  • index not "covering" the query, etc...

So let me change "slightly" my question if I may...



Our table will be normalised in facts/dimensions (no more varchars in the wrong place).



We do only inserts, never updates and so few deletes that we can ignore it.



The table size will not be huge (tens of million of rows order).



Our queries will ALWAYS specify an exact release_key value.



Our new version of the table would look like this



CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country_key int2 NOT NULL,
product_group_key int2 NOT NULL,
distribution_type_key int2 NOT NULL,
category_key int2 NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_milly_dfo_by_quarter PRIMARY KEY (release_key, country_key, category_key, product_group_key, distribution_type_key, year, quarter),
CONSTRAINT fk_milly_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id),
CONSTRAINT fk_milly_dim_dfo_category FOREIGN KEY (category_key) REFERENCES milly.dim_dfo_category(category_key),
CONSTRAINT fk_milly_dim_dfo_country FOREIGN KEY (country_key) REFERENCES milly.dim_dfo_country(country_key),
CONSTRAINT fk_milly_dim_dfo_distribution_type FOREIGN KEY (distribution_type_key) REFERENCES milly.dim_dfo_distribution_type(distribution_type_key),
CONSTRAINT fk_milly_dim_dfo_product_group FOREIGN KEY (product_group_key) REFERENCES milly.dim_dfo_product_group(product_group_key)
);


With that in mind, in a SQL Server environment, I could solve this by having a "Clustered" primary key (the entire table being sorted), or having an index on the primary key with INCLUDE option for the other columns required to cover the queries (Units, Values, etc).



Question 1)



In postgresql, is there an equivalent to the SQL Server Clustered index? A way to actually sort the entire table? I suppose it might be difficult because postgresql does not do updates "in place", hence it might make sorting expensive...



Or, is there a way to create something like a SQL Server Index WITH INCLUDE(units, values)?



update: I came across the SQL CLUSTER command, which is the closest thing I suppose.
It would be suitable for us



Question 2



With the query below



EXPLAIN (ANALYZE, BUFFERS)
WITH "rank_query" AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY "year" ORDER BY SUM("main"."units") DESC) AS "rank_by",
"year",
"main"."product_group_key" AS "productgroupkey",
SUM("main"."units") AS "salesunits",
SUM("main"."sales_value_eur") AS "salesvalue",
SUM("sales_value_eur")/SUM("units") AS "asp"
FROM "milly"."dfo_by_quarter" AS "main"

WHERE
"release_key" = 17 AND
"main"."year" >= 2010
GROUP BY
"year",
"main"."product_group_key"
)
,BeforeLookup
AS (
SELECT
"year" AS date,
SUM("salesunits") AS "salesunits",
SUM("salesvalue") AS "salesvalue",
SUM("salesvalue")/SUM("salesunits") AS "asp",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END AS "productgroupkey"
FROM
"rank_query"
GROUP BY
"year",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END
)
SELECT BL.date, BL.salesunits, BL.salesvalue, BL.asp
FROM BeforeLookup AS BL
INNER JOIN milly.dim_dfo_product_group PG ON PG.product_group_key = BL.productgroupkey;


I get this



Hash Join (cost=40883.82..40896.46 rows=558 width=98) (actual time=676.565..678.308 rows=663 loops=1)
Hash Cond: (bl.productgroupkey = pg.product_group_key)
Buffers: shared hit=483 read=22719
CTE rank_query
-> WindowAgg (cost=40507.15..40632.63 rows=5577 width=108) (actual time=660.076..668.272 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> Sort (cost=40507.15..40521.09 rows=5577 width=68) (actual time=660.062..661.226 rows=5418 loops=1)
Sort Key: main.year, (sum(main.units)) DESC
Sort Method: quicksort Memory: 616kB
Buffers: shared hit=480 read=22719
-> Finalize HashAggregate (cost=40076.46..40160.11 rows=5577 width=68) (actual time=648.762..653.227 rows=5418 loops=1)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Gather (cost=38710.09..39909.15 rows=11154 width=68) (actual time=597.878..622.379 rows=11938 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=480 read=22719
-> Partial HashAggregate (cost=37710.09..37793.75 rows=5577 width=68) (actual time=594.044..600.494 rows=3979 loops=3)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Parallel Seq Scan on dfo_by_quarter main (cost=0.00..36019.74 rows=169035 width=22) (actual time=106.916..357.071 rows=137171 loops=3)
Filter: ((year >= 2010) AND (release_key = 17))
Rows Removed by Filter: 546602
Buffers: shared hit=480 read=22719
CTE beforelookup
-> HashAggregate (cost=223.08..238.43 rows=558 width=102) (actual time=676.293..677.167 rows=663 loops=1)
Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
Buffers: shared hit=480 read=22719
-> CTE Scan on rank_query (cost=0.00..139.43 rows=5577 width=70) (actual time=660.079..672.978 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> CTE Scan on beforelookup bl (cost=0.00..11.16 rows=558 width=102) (actual time=676.296..677.665 rows=663 loops=1)
Buffers: shared hit=480 read=22719
-> Hash (cost=7.34..7.34 rows=434 width=4) (actual time=0.253..0.253 rows=435 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 24kB
Buffers: shared hit=3
-> Seq Scan on dim_dfo_product_group pg (cost=0.00..7.34 rows=434 width=4) (actual time=0.017..0.121 rows=435 loops=1)
Buffers: shared hit=3
Planning time: 0.319 ms
Execution time: 678.714 ms


Does anything spring to mind?



If I read it properly, it means my biggest cost by far is the initial scanof the table... but I don't manage to make it use an index...



I had created an index I hoped would help but it got ignored...



CREATE INDEX eric_silly_index ON milly.dfo_by_quarter(release_key, YEAR, date, product_group_key, units, sales_value_eur);

ANALYZE milly.dfo_by_quarter;


I also tried to cluster the table but no visible effect either



CLUSTER milly.dfo_by_quarter USING pk_milly_dfo_by_quarter; -- took 30 seconds (uidev)

ANALYZE milly.dfo_by_quarter;


Many thanks



Eric










share|improve this question
























  • How many rows in your table? How many with release_key = 2. Is there (concurrent) write load on the table? Do you see an index-only scan after VACUUM dpretailer.dfo_by_quarter?

    – Erwin Brandstetter
    Mar 6 at 18:44











  • Also, can you replace the EXPLAINs with output from EXPLAIN (ANALYZE, BUFFERS)? That would give us concrete timings and shared cache hits/misses.

    – Ancoron
    Mar 6 at 18:48















0















I have a PostgreSQL 10.6 database on Amazon RDS. My table is like this:



CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country varchar(100) NOT NULL,
product_group varchar(100) NOT NULL,
distribution_type varchar(100) NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
category varchar(100) NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
data_status bpchar(1) NOT NULL,
panel_market_units numeric(38,6) NOT NULL,
panel_market_sales_value_eur numeric(38,6) NOT NULL,
panel_market_sales_value_usd numeric(38,6) NOT NULL,
panel_market_sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_dpretailer_dfo_by_quarter PRIMARY KEY (release_key, country, category, product_group, distribution_type, year, quarter),
CONSTRAINT fk_dpretailer_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id)
);


I understand Primary Key implies a unique index



If I simply ask how many rows I have when filtering on non existing data (release_key = 1 returns nothing), I can see it uses the index



EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 1

Aggregate (cost=6.32..6.33 rows=1 width=8)
-> Index Only Scan using pk_dpretailer_dfo_by_quarter on dfo_by_quarter (cost=0.55..6.32 rows=1 width=0)
Index Cond: (release_key = 1)


But if I run the same query on a value that returns data, it scans the table, which is bound to be more expensive...



EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 2

Finalize Aggregate (cost=47611.07..47611.08 rows=1 width=8)
-> Gather (cost=47610.86..47611.07 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=46610.86..46610.87 rows=1 width=8)
-> Parallel Seq Scan on dfo_by_quarter (cost=0.00..46307.29 rows=121428 width=0)
Filter: (release_key = 2)


I get it that using the index when there is no data makes sense and is driven by the stats on the table (I ran ANALYSE before the tests)



But why not using my index if there is data?



Surely, it must be quicker to scan part of an index (because release_key is the first column) rather than scanning an entire table???



I must be missing something...?



Update 2019-03-07



Thank You for your comments, which are very useful.



This simple query was just me trying to understand why the index was not used...



But I should have known better (I am new to postgresql but have MANY years experience with SQL Server) and it makes sense that it is not, as you commented about.



  • bad selectivity because my criteria only filters about 20% of the rows

  • bad table design (too fat, which we knew and are now addressing)

  • index not "covering" the query, etc...

So let me change "slightly" my question if I may...



Our table will be normalised in facts/dimensions (no more varchars in the wrong place).



We do only inserts, never updates and so few deletes that we can ignore it.



The table size will not be huge (tens of million of rows order).



Our queries will ALWAYS specify an exact release_key value.



Our new version of the table would look like this



CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country_key int2 NOT NULL,
product_group_key int2 NOT NULL,
distribution_type_key int2 NOT NULL,
category_key int2 NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_milly_dfo_by_quarter PRIMARY KEY (release_key, country_key, category_key, product_group_key, distribution_type_key, year, quarter),
CONSTRAINT fk_milly_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id),
CONSTRAINT fk_milly_dim_dfo_category FOREIGN KEY (category_key) REFERENCES milly.dim_dfo_category(category_key),
CONSTRAINT fk_milly_dim_dfo_country FOREIGN KEY (country_key) REFERENCES milly.dim_dfo_country(country_key),
CONSTRAINT fk_milly_dim_dfo_distribution_type FOREIGN KEY (distribution_type_key) REFERENCES milly.dim_dfo_distribution_type(distribution_type_key),
CONSTRAINT fk_milly_dim_dfo_product_group FOREIGN KEY (product_group_key) REFERENCES milly.dim_dfo_product_group(product_group_key)
);


With that in mind, in a SQL Server environment, I could solve this by having a "Clustered" primary key (the entire table being sorted), or having an index on the primary key with INCLUDE option for the other columns required to cover the queries (Units, Values, etc).



Question 1)



In postgresql, is there an equivalent to the SQL Server Clustered index? A way to actually sort the entire table? I suppose it might be difficult because postgresql does not do updates "in place", hence it might make sorting expensive...



Or, is there a way to create something like a SQL Server Index WITH INCLUDE(units, values)?



update: I came across the SQL CLUSTER command, which is the closest thing I suppose.
It would be suitable for us



Question 2



With the query below



EXPLAIN (ANALYZE, BUFFERS)
WITH "rank_query" AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY "year" ORDER BY SUM("main"."units") DESC) AS "rank_by",
"year",
"main"."product_group_key" AS "productgroupkey",
SUM("main"."units") AS "salesunits",
SUM("main"."sales_value_eur") AS "salesvalue",
SUM("sales_value_eur")/SUM("units") AS "asp"
FROM "milly"."dfo_by_quarter" AS "main"

WHERE
"release_key" = 17 AND
"main"."year" >= 2010
GROUP BY
"year",
"main"."product_group_key"
)
,BeforeLookup
AS (
SELECT
"year" AS date,
SUM("salesunits") AS "salesunits",
SUM("salesvalue") AS "salesvalue",
SUM("salesvalue")/SUM("salesunits") AS "asp",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END AS "productgroupkey"
FROM
"rank_query"
GROUP BY
"year",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END
)
SELECT BL.date, BL.salesunits, BL.salesvalue, BL.asp
FROM BeforeLookup AS BL
INNER JOIN milly.dim_dfo_product_group PG ON PG.product_group_key = BL.productgroupkey;


I get this



Hash Join (cost=40883.82..40896.46 rows=558 width=98) (actual time=676.565..678.308 rows=663 loops=1)
Hash Cond: (bl.productgroupkey = pg.product_group_key)
Buffers: shared hit=483 read=22719
CTE rank_query
-> WindowAgg (cost=40507.15..40632.63 rows=5577 width=108) (actual time=660.076..668.272 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> Sort (cost=40507.15..40521.09 rows=5577 width=68) (actual time=660.062..661.226 rows=5418 loops=1)
Sort Key: main.year, (sum(main.units)) DESC
Sort Method: quicksort Memory: 616kB
Buffers: shared hit=480 read=22719
-> Finalize HashAggregate (cost=40076.46..40160.11 rows=5577 width=68) (actual time=648.762..653.227 rows=5418 loops=1)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Gather (cost=38710.09..39909.15 rows=11154 width=68) (actual time=597.878..622.379 rows=11938 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=480 read=22719
-> Partial HashAggregate (cost=37710.09..37793.75 rows=5577 width=68) (actual time=594.044..600.494 rows=3979 loops=3)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Parallel Seq Scan on dfo_by_quarter main (cost=0.00..36019.74 rows=169035 width=22) (actual time=106.916..357.071 rows=137171 loops=3)
Filter: ((year >= 2010) AND (release_key = 17))
Rows Removed by Filter: 546602
Buffers: shared hit=480 read=22719
CTE beforelookup
-> HashAggregate (cost=223.08..238.43 rows=558 width=102) (actual time=676.293..677.167 rows=663 loops=1)
Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
Buffers: shared hit=480 read=22719
-> CTE Scan on rank_query (cost=0.00..139.43 rows=5577 width=70) (actual time=660.079..672.978 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> CTE Scan on beforelookup bl (cost=0.00..11.16 rows=558 width=102) (actual time=676.296..677.665 rows=663 loops=1)
Buffers: shared hit=480 read=22719
-> Hash (cost=7.34..7.34 rows=434 width=4) (actual time=0.253..0.253 rows=435 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 24kB
Buffers: shared hit=3
-> Seq Scan on dim_dfo_product_group pg (cost=0.00..7.34 rows=434 width=4) (actual time=0.017..0.121 rows=435 loops=1)
Buffers: shared hit=3
Planning time: 0.319 ms
Execution time: 678.714 ms


Does anything spring to mind?



If I read it properly, it means my biggest cost by far is the initial scanof the table... but I don't manage to make it use an index...



I had created an index I hoped would help but it got ignored...



CREATE INDEX eric_silly_index ON milly.dfo_by_quarter(release_key, YEAR, date, product_group_key, units, sales_value_eur);

ANALYZE milly.dfo_by_quarter;


I also tried to cluster the table but no visible effect either



CLUSTER milly.dfo_by_quarter USING pk_milly_dfo_by_quarter; -- took 30 seconds (uidev)

ANALYZE milly.dfo_by_quarter;


Many thanks



Eric










share|improve this question
























  • How many rows in your table? How many with release_key = 2. Is there (concurrent) write load on the table? Do you see an index-only scan after VACUUM dpretailer.dfo_by_quarter?

    – Erwin Brandstetter
    Mar 6 at 18:44











  • Also, can you replace the EXPLAINs with output from EXPLAIN (ANALYZE, BUFFERS)? That would give us concrete timings and shared cache hits/misses.

    – Ancoron
    Mar 6 at 18:48













0












0








0








I have a PostgreSQL 10.6 database on Amazon RDS. My table is like this:



CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country varchar(100) NOT NULL,
product_group varchar(100) NOT NULL,
distribution_type varchar(100) NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
category varchar(100) NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
data_status bpchar(1) NOT NULL,
panel_market_units numeric(38,6) NOT NULL,
panel_market_sales_value_eur numeric(38,6) NOT NULL,
panel_market_sales_value_usd numeric(38,6) NOT NULL,
panel_market_sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_dpretailer_dfo_by_quarter PRIMARY KEY (release_key, country, category, product_group, distribution_type, year, quarter),
CONSTRAINT fk_dpretailer_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id)
);


I understand Primary Key implies a unique index



If I simply ask how many rows I have when filtering on non existing data (release_key = 1 returns nothing), I can see it uses the index



EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 1

Aggregate (cost=6.32..6.33 rows=1 width=8)
-> Index Only Scan using pk_dpretailer_dfo_by_quarter on dfo_by_quarter (cost=0.55..6.32 rows=1 width=0)
Index Cond: (release_key = 1)


But if I run the same query on a value that returns data, it scans the table, which is bound to be more expensive...



EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 2

Finalize Aggregate (cost=47611.07..47611.08 rows=1 width=8)
-> Gather (cost=47610.86..47611.07 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=46610.86..46610.87 rows=1 width=8)
-> Parallel Seq Scan on dfo_by_quarter (cost=0.00..46307.29 rows=121428 width=0)
Filter: (release_key = 2)


I get it that using the index when there is no data makes sense and is driven by the stats on the table (I ran ANALYSE before the tests)



But why not using my index if there is data?



Surely, it must be quicker to scan part of an index (because release_key is the first column) rather than scanning an entire table???



I must be missing something...?



Update 2019-03-07



Thank You for your comments, which are very useful.



This simple query was just me trying to understand why the index was not used...



But I should have known better (I am new to postgresql but have MANY years experience with SQL Server) and it makes sense that it is not, as you commented about.



  • bad selectivity because my criteria only filters about 20% of the rows

  • bad table design (too fat, which we knew and are now addressing)

  • index not "covering" the query, etc...

So let me change "slightly" my question if I may...



Our table will be normalised in facts/dimensions (no more varchars in the wrong place).



We do only inserts, never updates and so few deletes that we can ignore it.



The table size will not be huge (tens of million of rows order).



Our queries will ALWAYS specify an exact release_key value.



Our new version of the table would look like this



CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country_key int2 NOT NULL,
product_group_key int2 NOT NULL,
distribution_type_key int2 NOT NULL,
category_key int2 NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_milly_dfo_by_quarter PRIMARY KEY (release_key, country_key, category_key, product_group_key, distribution_type_key, year, quarter),
CONSTRAINT fk_milly_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id),
CONSTRAINT fk_milly_dim_dfo_category FOREIGN KEY (category_key) REFERENCES milly.dim_dfo_category(category_key),
CONSTRAINT fk_milly_dim_dfo_country FOREIGN KEY (country_key) REFERENCES milly.dim_dfo_country(country_key),
CONSTRAINT fk_milly_dim_dfo_distribution_type FOREIGN KEY (distribution_type_key) REFERENCES milly.dim_dfo_distribution_type(distribution_type_key),
CONSTRAINT fk_milly_dim_dfo_product_group FOREIGN KEY (product_group_key) REFERENCES milly.dim_dfo_product_group(product_group_key)
);


With that in mind, in a SQL Server environment, I could solve this by having a "Clustered" primary key (the entire table being sorted), or having an index on the primary key with INCLUDE option for the other columns required to cover the queries (Units, Values, etc).



Question 1)



In postgresql, is there an equivalent to the SQL Server Clustered index? A way to actually sort the entire table? I suppose it might be difficult because postgresql does not do updates "in place", hence it might make sorting expensive...



Or, is there a way to create something like a SQL Server Index WITH INCLUDE(units, values)?



update: I came across the SQL CLUSTER command, which is the closest thing I suppose.
It would be suitable for us



Question 2



With the query below



EXPLAIN (ANALYZE, BUFFERS)
WITH "rank_query" AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY "year" ORDER BY SUM("main"."units") DESC) AS "rank_by",
"year",
"main"."product_group_key" AS "productgroupkey",
SUM("main"."units") AS "salesunits",
SUM("main"."sales_value_eur") AS "salesvalue",
SUM("sales_value_eur")/SUM("units") AS "asp"
FROM "milly"."dfo_by_quarter" AS "main"

WHERE
"release_key" = 17 AND
"main"."year" >= 2010
GROUP BY
"year",
"main"."product_group_key"
)
,BeforeLookup
AS (
SELECT
"year" AS date,
SUM("salesunits") AS "salesunits",
SUM("salesvalue") AS "salesvalue",
SUM("salesvalue")/SUM("salesunits") AS "asp",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END AS "productgroupkey"
FROM
"rank_query"
GROUP BY
"year",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END
)
SELECT BL.date, BL.salesunits, BL.salesvalue, BL.asp
FROM BeforeLookup AS BL
INNER JOIN milly.dim_dfo_product_group PG ON PG.product_group_key = BL.productgroupkey;


I get this



Hash Join (cost=40883.82..40896.46 rows=558 width=98) (actual time=676.565..678.308 rows=663 loops=1)
Hash Cond: (bl.productgroupkey = pg.product_group_key)
Buffers: shared hit=483 read=22719
CTE rank_query
-> WindowAgg (cost=40507.15..40632.63 rows=5577 width=108) (actual time=660.076..668.272 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> Sort (cost=40507.15..40521.09 rows=5577 width=68) (actual time=660.062..661.226 rows=5418 loops=1)
Sort Key: main.year, (sum(main.units)) DESC
Sort Method: quicksort Memory: 616kB
Buffers: shared hit=480 read=22719
-> Finalize HashAggregate (cost=40076.46..40160.11 rows=5577 width=68) (actual time=648.762..653.227 rows=5418 loops=1)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Gather (cost=38710.09..39909.15 rows=11154 width=68) (actual time=597.878..622.379 rows=11938 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=480 read=22719
-> Partial HashAggregate (cost=37710.09..37793.75 rows=5577 width=68) (actual time=594.044..600.494 rows=3979 loops=3)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Parallel Seq Scan on dfo_by_quarter main (cost=0.00..36019.74 rows=169035 width=22) (actual time=106.916..357.071 rows=137171 loops=3)
Filter: ((year >= 2010) AND (release_key = 17))
Rows Removed by Filter: 546602
Buffers: shared hit=480 read=22719
CTE beforelookup
-> HashAggregate (cost=223.08..238.43 rows=558 width=102) (actual time=676.293..677.167 rows=663 loops=1)
Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
Buffers: shared hit=480 read=22719
-> CTE Scan on rank_query (cost=0.00..139.43 rows=5577 width=70) (actual time=660.079..672.978 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> CTE Scan on beforelookup bl (cost=0.00..11.16 rows=558 width=102) (actual time=676.296..677.665 rows=663 loops=1)
Buffers: shared hit=480 read=22719
-> Hash (cost=7.34..7.34 rows=434 width=4) (actual time=0.253..0.253 rows=435 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 24kB
Buffers: shared hit=3
-> Seq Scan on dim_dfo_product_group pg (cost=0.00..7.34 rows=434 width=4) (actual time=0.017..0.121 rows=435 loops=1)
Buffers: shared hit=3
Planning time: 0.319 ms
Execution time: 678.714 ms


Does anything spring to mind?



If I read it properly, it means my biggest cost by far is the initial scanof the table... but I don't manage to make it use an index...



I had created an index I hoped would help but it got ignored...



CREATE INDEX eric_silly_index ON milly.dfo_by_quarter(release_key, YEAR, date, product_group_key, units, sales_value_eur);

ANALYZE milly.dfo_by_quarter;


I also tried to cluster the table but no visible effect either



CLUSTER milly.dfo_by_quarter USING pk_milly_dfo_by_quarter; -- took 30 seconds (uidev)

ANALYZE milly.dfo_by_quarter;


Many thanks



Eric










share|improve this question
















I have a PostgreSQL 10.6 database on Amazon RDS. My table is like this:



CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country varchar(100) NOT NULL,
product_group varchar(100) NOT NULL,
distribution_type varchar(100) NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
category varchar(100) NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
data_status bpchar(1) NOT NULL,
panel_market_units numeric(38,6) NOT NULL,
panel_market_sales_value_eur numeric(38,6) NOT NULL,
panel_market_sales_value_usd numeric(38,6) NOT NULL,
panel_market_sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_dpretailer_dfo_by_quarter PRIMARY KEY (release_key, country, category, product_group, distribution_type, year, quarter),
CONSTRAINT fk_dpretailer_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id)
);


I understand Primary Key implies a unique index



If I simply ask how many rows I have when filtering on non existing data (release_key = 1 returns nothing), I can see it uses the index



EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 1

Aggregate (cost=6.32..6.33 rows=1 width=8)
-> Index Only Scan using pk_dpretailer_dfo_by_quarter on dfo_by_quarter (cost=0.55..6.32 rows=1 width=0)
Index Cond: (release_key = 1)


But if I run the same query on a value that returns data, it scans the table, which is bound to be more expensive...



EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 2

Finalize Aggregate (cost=47611.07..47611.08 rows=1 width=8)
-> Gather (cost=47610.86..47611.07 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=46610.86..46610.87 rows=1 width=8)
-> Parallel Seq Scan on dfo_by_quarter (cost=0.00..46307.29 rows=121428 width=0)
Filter: (release_key = 2)


I get it that using the index when there is no data makes sense and is driven by the stats on the table (I ran ANALYSE before the tests)



But why not using my index if there is data?



Surely, it must be quicker to scan part of an index (because release_key is the first column) rather than scanning an entire table???



I must be missing something...?



Update 2019-03-07



Thank You for your comments, which are very useful.



This simple query was just me trying to understand why the index was not used...



But I should have known better (I am new to postgresql but have MANY years experience with SQL Server) and it makes sense that it is not, as you commented about.



  • bad selectivity because my criteria only filters about 20% of the rows

  • bad table design (too fat, which we knew and are now addressing)

  • index not "covering" the query, etc...

So let me change "slightly" my question if I may...



Our table will be normalised in facts/dimensions (no more varchars in the wrong place).



We do only inserts, never updates and so few deletes that we can ignore it.



The table size will not be huge (tens of million of rows order).



Our queries will ALWAYS specify an exact release_key value.



Our new version of the table would look like this



CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country_key int2 NOT NULL,
product_group_key int2 NOT NULL,
distribution_type_key int2 NOT NULL,
category_key int2 NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_milly_dfo_by_quarter PRIMARY KEY (release_key, country_key, category_key, product_group_key, distribution_type_key, year, quarter),
CONSTRAINT fk_milly_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id),
CONSTRAINT fk_milly_dim_dfo_category FOREIGN KEY (category_key) REFERENCES milly.dim_dfo_category(category_key),
CONSTRAINT fk_milly_dim_dfo_country FOREIGN KEY (country_key) REFERENCES milly.dim_dfo_country(country_key),
CONSTRAINT fk_milly_dim_dfo_distribution_type FOREIGN KEY (distribution_type_key) REFERENCES milly.dim_dfo_distribution_type(distribution_type_key),
CONSTRAINT fk_milly_dim_dfo_product_group FOREIGN KEY (product_group_key) REFERENCES milly.dim_dfo_product_group(product_group_key)
);


With that in mind, in a SQL Server environment, I could solve this by having a "Clustered" primary key (the entire table being sorted), or having an index on the primary key with INCLUDE option for the other columns required to cover the queries (Units, Values, etc).



Question 1)



In postgresql, is there an equivalent to the SQL Server Clustered index? A way to actually sort the entire table? I suppose it might be difficult because postgresql does not do updates "in place", hence it might make sorting expensive...



Or, is there a way to create something like a SQL Server Index WITH INCLUDE(units, values)?



update: I came across the SQL CLUSTER command, which is the closest thing I suppose.
It would be suitable for us



Question 2



With the query below



EXPLAIN (ANALYZE, BUFFERS)
WITH "rank_query" AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY "year" ORDER BY SUM("main"."units") DESC) AS "rank_by",
"year",
"main"."product_group_key" AS "productgroupkey",
SUM("main"."units") AS "salesunits",
SUM("main"."sales_value_eur") AS "salesvalue",
SUM("sales_value_eur")/SUM("units") AS "asp"
FROM "milly"."dfo_by_quarter" AS "main"

WHERE
"release_key" = 17 AND
"main"."year" >= 2010
GROUP BY
"year",
"main"."product_group_key"
)
,BeforeLookup
AS (
SELECT
"year" AS date,
SUM("salesunits") AS "salesunits",
SUM("salesvalue") AS "salesvalue",
SUM("salesvalue")/SUM("salesunits") AS "asp",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END AS "productgroupkey"
FROM
"rank_query"
GROUP BY
"year",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END
)
SELECT BL.date, BL.salesunits, BL.salesvalue, BL.asp
FROM BeforeLookup AS BL
INNER JOIN milly.dim_dfo_product_group PG ON PG.product_group_key = BL.productgroupkey;


I get this



Hash Join (cost=40883.82..40896.46 rows=558 width=98) (actual time=676.565..678.308 rows=663 loops=1)
Hash Cond: (bl.productgroupkey = pg.product_group_key)
Buffers: shared hit=483 read=22719
CTE rank_query
-> WindowAgg (cost=40507.15..40632.63 rows=5577 width=108) (actual time=660.076..668.272 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> Sort (cost=40507.15..40521.09 rows=5577 width=68) (actual time=660.062..661.226 rows=5418 loops=1)
Sort Key: main.year, (sum(main.units)) DESC
Sort Method: quicksort Memory: 616kB
Buffers: shared hit=480 read=22719
-> Finalize HashAggregate (cost=40076.46..40160.11 rows=5577 width=68) (actual time=648.762..653.227 rows=5418 loops=1)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Gather (cost=38710.09..39909.15 rows=11154 width=68) (actual time=597.878..622.379 rows=11938 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=480 read=22719
-> Partial HashAggregate (cost=37710.09..37793.75 rows=5577 width=68) (actual time=594.044..600.494 rows=3979 loops=3)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Parallel Seq Scan on dfo_by_quarter main (cost=0.00..36019.74 rows=169035 width=22) (actual time=106.916..357.071 rows=137171 loops=3)
Filter: ((year >= 2010) AND (release_key = 17))
Rows Removed by Filter: 546602
Buffers: shared hit=480 read=22719
CTE beforelookup
-> HashAggregate (cost=223.08..238.43 rows=558 width=102) (actual time=676.293..677.167 rows=663 loops=1)
Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
Buffers: shared hit=480 read=22719
-> CTE Scan on rank_query (cost=0.00..139.43 rows=5577 width=70) (actual time=660.079..672.978 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> CTE Scan on beforelookup bl (cost=0.00..11.16 rows=558 width=102) (actual time=676.296..677.665 rows=663 loops=1)
Buffers: shared hit=480 read=22719
-> Hash (cost=7.34..7.34 rows=434 width=4) (actual time=0.253..0.253 rows=435 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 24kB
Buffers: shared hit=3
-> Seq Scan on dim_dfo_product_group pg (cost=0.00..7.34 rows=434 width=4) (actual time=0.017..0.121 rows=435 loops=1)
Buffers: shared hit=3
Planning time: 0.319 ms
Execution time: 678.714 ms


Does anything spring to mind?



If I read it properly, it means my biggest cost by far is the initial scanof the table... but I don't manage to make it use an index...



I had created an index I hoped would help but it got ignored...



CREATE INDEX eric_silly_index ON milly.dfo_by_quarter(release_key, YEAR, date, product_group_key, units, sales_value_eur);

ANALYZE milly.dfo_by_quarter;


I also tried to cluster the table but no visible effect either



CLUSTER milly.dfo_by_quarter USING pk_milly_dfo_by_quarter; -- took 30 seconds (uidev)

ANALYZE milly.dfo_by_quarter;


Many thanks



Eric







postgresql indexing amazon-rds postgresql-performance






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 7 at 22:59







Eric Mamet

















asked Mar 6 at 17:50









Eric MametEric Mamet

808




808












  • How many rows in your table? How many with release_key = 2. Is there (concurrent) write load on the table? Do you see an index-only scan after VACUUM dpretailer.dfo_by_quarter?

    – Erwin Brandstetter
    Mar 6 at 18:44











  • Also, can you replace the EXPLAINs with output from EXPLAIN (ANALYZE, BUFFERS)? That would give us concrete timings and shared cache hits/misses.

    – Ancoron
    Mar 6 at 18:48

















  • How many rows in your table? How many with release_key = 2. Is there (concurrent) write load on the table? Do you see an index-only scan after VACUUM dpretailer.dfo_by_quarter?

    – Erwin Brandstetter
    Mar 6 at 18:44











  • Also, can you replace the EXPLAINs with output from EXPLAIN (ANALYZE, BUFFERS)? That would give us concrete timings and shared cache hits/misses.

    – Ancoron
    Mar 6 at 18:48
















How many rows in your table? How many with release_key = 2. Is there (concurrent) write load on the table? Do you see an index-only scan after VACUUM dpretailer.dfo_by_quarter?

– Erwin Brandstetter
Mar 6 at 18:44





How many rows in your table? How many with release_key = 2. Is there (concurrent) write load on the table? Do you see an index-only scan after VACUUM dpretailer.dfo_by_quarter?

– Erwin Brandstetter
Mar 6 at 18:44













Also, can you replace the EXPLAINs with output from EXPLAIN (ANALYZE, BUFFERS)? That would give us concrete timings and shared cache hits/misses.

– Ancoron
Mar 6 at 18:48





Also, can you replace the EXPLAINs with output from EXPLAIN (ANALYZE, BUFFERS)? That would give us concrete timings and shared cache hits/misses.

– Ancoron
Mar 6 at 18:48












3 Answers
3






active

oldest

votes


















1














Generally, while possible, a PK spanning 7 columns, several of which being varchar(100) is not optimized for performance, to say the least.



Such an index is large to begin with and tends to bloat quickly, if you have updates on involved columns.



I would operate with a surrogate PK, a serial (or bigserial if you have that many rows). Or IDENTITY. See:



  • Auto increment table column

And a UNIQUE constraint on all 7 to enforce uniqueness (all are NOT NULL anyway).



If you have lots of counting queries with the only predicate on release_key consider an additional plain btree index on just that column.



The data type varchar(100) for so many columns may not be optimal. Some normalization might help.



More advise depends on missing information ...






share|improve this answer























  • Voting +1 for this answer since adding an index on the filter column increases the chance of getting an index-only scan. Once that is in-place (remember that creating a foreign key does not create and index), the primary key should be irrelevant.

    – Ancoron
    Mar 6 at 19:08











  • * I ran this particular query purely as an exercise to try and get an index scan. The original query was far more complex. * In "real scenarios", I need the other columns. * There was no other operation concurrently * There is probably about 300,000 rows for this release_key value and a few millions overall * At this stage, my table is "fat" (varchar columns) because it was a quick and dirty implementation but I'll make it look more like fact/dimensions very soon (therefore surrogate keys instead of varchars)

    – Eric Mamet
    Mar 6 at 22:23












  • @EricMamet: Then the question might be misleading. All details matter. Postgres decides the query plan based on estimated costs. "A few millions" is too fuzzy. Could be 2 or 9 millions, which makes a big difference.

    – Erwin Brandstetter
    Mar 6 at 22:28












  • @ErwinBrandstetter I understand that and I'll come back with more details tomorrow. In SQL Server, a request like this would always use the index because it would not need to look at the table at all and would only sequentially scan a subset of the index (regardless of the number of rows). I was wondering whether postgresql would behave the same. Obviously not!

    – Eric Mamet
    Mar 6 at 22:43











  • @EricMamet: It depends. For index-only scans like we see in your first query plan, some preconditions have to be met.

    – Erwin Brandstetter
    Mar 7 at 0:13


















0














The answer to my initial question: why is postgresql not using my index on something like SELECT (*)... can be found in the documentation...



Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT



In particular: This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn't been deleted.



This explains a lot why I don't manage to get postgresql to use my indexes when, from a SQL Server perspective, it obviously "should".






share|improve this answer






























    0














    Because release_key isn't actually a unique column, it's not possible from the information you've provided to know whether or not the index should be used. If a high percentage of rows have release_key = 2 or even a smaller percentage of rows match on a large table, it may not be efficient to use the index.



    In part this is because Postgres indexes are indirect -- that is the index actually contains a pointer to the location on disk in the heap where the real tuple lives. So looping through an index requires reading an entry from the index, reading the tuple from the heap, and repeating. For a large number of tuples it's often more valuable to scan the heap directly and avoid the indirect disk access penalty.



    Edit:
    You generally don't want to be using CLUSTER in PostgreSQL; it's not how indexes are maintained, and it's rare to see that in the wild for that reason.



    Your updated query with no data gives this plan:



     QUERY PLAN 
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    CTE Scan on beforelookup bl (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
    Buffers: shared hit=4
    CTE rank_query
    -> WindowAgg (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
    Buffers: shared hit=4
    -> Sort (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
    Sort Key: main.year, (sum(main.units)) DESC
    Sort Method: quicksort Memory: 25kB
    Buffers: shared hit=4
    -> GroupAggregate (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
    Group Key: main.year, main.product_group_key
    Buffers: shared hit=1
    -> Sort (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
    Sort Key: main.year, main.product_group_key
    Sort Method: quicksort Memory: 25kB
    Buffers: shared hit=1
    -> Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
    Index Cond: ((release_key = 17) AND (year >= 2010))
    Buffers: shared hit=1
    CTE beforelookup
    -> HashAggregate (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
    Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
    Buffers: shared hit=4
    -> CTE Scan on rank_query (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
    Buffers: shared hit=4
    Planning Time: 0.723 ms
    Execution Time: 0.485 ms
    (27 rows)


    So PostgreSQL is entirely capable of using the index for your query, but the planner is deciding that it's not worth it (i.e., the costing for using the index directly is higher than the costing for using the parallel sequence scan).



    If you set enable_indexscan = off; with no data, you get a bitmap index scan (as I'd expect). If you set enable_bitmapscan = off; with no data you get an (non-parallel) sequence scan.



    You should see the plan change back (with large amounts of data) if you set max_parallel_workers = 0;.



    But looking at your query's explain results, I'd very much expect using the index to be more expensive and take longer than using the parallel sequence scan. In your updated query you're still scanning a very high percentage of the table and a large number of rows, and you're also forcing accessing the heap by accessing fields not in the index. Postgres 11 (I believe) adds covering indexes which would theoretically allow you to make this query be driven by the index alone, but I'm not at all convinced in this example it would actually be worth it.






    share|improve this answer

























    • I don't have access to the database right now so I won't be able to supply more details this evening. Coming from a SQL Server background, I understand the "indirect" aspect of the index and the same would apply to SQL Server if I referred to any column not covered by the index. But in my example, I made a point to do just a COUNT(*) that, in SQL Server, would be resolved without accessing the underlying table since the row count could be worked out from the index.

      – Eric Mamet
      Mar 6 at 22:31











    • @EricMamet I updated my answer for your updated question.

      – jcoleman
      Mar 8 at 13:03










    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55029323%2fpostgresql-is-not-using-a-straight-forward-index%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Generally, while possible, a PK spanning 7 columns, several of which being varchar(100) is not optimized for performance, to say the least.



    Such an index is large to begin with and tends to bloat quickly, if you have updates on involved columns.



    I would operate with a surrogate PK, a serial (or bigserial if you have that many rows). Or IDENTITY. See:



    • Auto increment table column

    And a UNIQUE constraint on all 7 to enforce uniqueness (all are NOT NULL anyway).



    If you have lots of counting queries with the only predicate on release_key consider an additional plain btree index on just that column.



    The data type varchar(100) for so many columns may not be optimal. Some normalization might help.



    More advise depends on missing information ...






    share|improve this answer























    • Voting +1 for this answer since adding an index on the filter column increases the chance of getting an index-only scan. Once that is in-place (remember that creating a foreign key does not create and index), the primary key should be irrelevant.

      – Ancoron
      Mar 6 at 19:08











    • * I ran this particular query purely as an exercise to try and get an index scan. The original query was far more complex. * In "real scenarios", I need the other columns. * There was no other operation concurrently * There is probably about 300,000 rows for this release_key value and a few millions overall * At this stage, my table is "fat" (varchar columns) because it was a quick and dirty implementation but I'll make it look more like fact/dimensions very soon (therefore surrogate keys instead of varchars)

      – Eric Mamet
      Mar 6 at 22:23












    • @EricMamet: Then the question might be misleading. All details matter. Postgres decides the query plan based on estimated costs. "A few millions" is too fuzzy. Could be 2 or 9 millions, which makes a big difference.

      – Erwin Brandstetter
      Mar 6 at 22:28












    • @ErwinBrandstetter I understand that and I'll come back with more details tomorrow. In SQL Server, a request like this would always use the index because it would not need to look at the table at all and would only sequentially scan a subset of the index (regardless of the number of rows). I was wondering whether postgresql would behave the same. Obviously not!

      – Eric Mamet
      Mar 6 at 22:43











    • @EricMamet: It depends. For index-only scans like we see in your first query plan, some preconditions have to be met.

      – Erwin Brandstetter
      Mar 7 at 0:13















    1














    Generally, while possible, a PK spanning 7 columns, several of which being varchar(100) is not optimized for performance, to say the least.



    Such an index is large to begin with and tends to bloat quickly, if you have updates on involved columns.



    I would operate with a surrogate PK, a serial (or bigserial if you have that many rows). Or IDENTITY. See:



    • Auto increment table column

    And a UNIQUE constraint on all 7 to enforce uniqueness (all are NOT NULL anyway).



    If you have lots of counting queries with the only predicate on release_key consider an additional plain btree index on just that column.



    The data type varchar(100) for so many columns may not be optimal. Some normalization might help.



    More advise depends on missing information ...






    share|improve this answer























    • Voting +1 for this answer since adding an index on the filter column increases the chance of getting an index-only scan. Once that is in-place (remember that creating a foreign key does not create and index), the primary key should be irrelevant.

      – Ancoron
      Mar 6 at 19:08











    • * I ran this particular query purely as an exercise to try and get an index scan. The original query was far more complex. * In "real scenarios", I need the other columns. * There was no other operation concurrently * There is probably about 300,000 rows for this release_key value and a few millions overall * At this stage, my table is "fat" (varchar columns) because it was a quick and dirty implementation but I'll make it look more like fact/dimensions very soon (therefore surrogate keys instead of varchars)

      – Eric Mamet
      Mar 6 at 22:23












    • @EricMamet: Then the question might be misleading. All details matter. Postgres decides the query plan based on estimated costs. "A few millions" is too fuzzy. Could be 2 or 9 millions, which makes a big difference.

      – Erwin Brandstetter
      Mar 6 at 22:28












    • @ErwinBrandstetter I understand that and I'll come back with more details tomorrow. In SQL Server, a request like this would always use the index because it would not need to look at the table at all and would only sequentially scan a subset of the index (regardless of the number of rows). I was wondering whether postgresql would behave the same. Obviously not!

      – Eric Mamet
      Mar 6 at 22:43











    • @EricMamet: It depends. For index-only scans like we see in your first query plan, some preconditions have to be met.

      – Erwin Brandstetter
      Mar 7 at 0:13













    1












    1








    1







    Generally, while possible, a PK spanning 7 columns, several of which being varchar(100) is not optimized for performance, to say the least.



    Such an index is large to begin with and tends to bloat quickly, if you have updates on involved columns.



    I would operate with a surrogate PK, a serial (or bigserial if you have that many rows). Or IDENTITY. See:



    • Auto increment table column

    And a UNIQUE constraint on all 7 to enforce uniqueness (all are NOT NULL anyway).



    If you have lots of counting queries with the only predicate on release_key consider an additional plain btree index on just that column.



    The data type varchar(100) for so many columns may not be optimal. Some normalization might help.



    More advise depends on missing information ...






    share|improve this answer













    Generally, while possible, a PK spanning 7 columns, several of which being varchar(100) is not optimized for performance, to say the least.



    Such an index is large to begin with and tends to bloat quickly, if you have updates on involved columns.



    I would operate with a surrogate PK, a serial (or bigserial if you have that many rows). Or IDENTITY. See:



    • Auto increment table column

    And a UNIQUE constraint on all 7 to enforce uniqueness (all are NOT NULL anyway).



    If you have lots of counting queries with the only predicate on release_key consider an additional plain btree index on just that column.



    The data type varchar(100) for so many columns may not be optimal. Some normalization might help.



    More advise depends on missing information ...







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Mar 6 at 18:50









    Erwin BrandstetterErwin Brandstetter

    350k68637817




    350k68637817












    • Voting +1 for this answer since adding an index on the filter column increases the chance of getting an index-only scan. Once that is in-place (remember that creating a foreign key does not create and index), the primary key should be irrelevant.

      – Ancoron
      Mar 6 at 19:08











    • * I ran this particular query purely as an exercise to try and get an index scan. The original query was far more complex. * In "real scenarios", I need the other columns. * There was no other operation concurrently * There is probably about 300,000 rows for this release_key value and a few millions overall * At this stage, my table is "fat" (varchar columns) because it was a quick and dirty implementation but I'll make it look more like fact/dimensions very soon (therefore surrogate keys instead of varchars)

      – Eric Mamet
      Mar 6 at 22:23












    • @EricMamet: Then the question might be misleading. All details matter. Postgres decides the query plan based on estimated costs. "A few millions" is too fuzzy. Could be 2 or 9 millions, which makes a big difference.

      – Erwin Brandstetter
      Mar 6 at 22:28












    • @ErwinBrandstetter I understand that and I'll come back with more details tomorrow. In SQL Server, a request like this would always use the index because it would not need to look at the table at all and would only sequentially scan a subset of the index (regardless of the number of rows). I was wondering whether postgresql would behave the same. Obviously not!

      – Eric Mamet
      Mar 6 at 22:43











    • @EricMamet: It depends. For index-only scans like we see in your first query plan, some preconditions have to be met.

      – Erwin Brandstetter
      Mar 7 at 0:13

















    • Voting +1 for this answer since adding an index on the filter column increases the chance of getting an index-only scan. Once that is in-place (remember that creating a foreign key does not create and index), the primary key should be irrelevant.

      – Ancoron
      Mar 6 at 19:08











    • * I ran this particular query purely as an exercise to try and get an index scan. The original query was far more complex. * In "real scenarios", I need the other columns. * There was no other operation concurrently * There is probably about 300,000 rows for this release_key value and a few millions overall * At this stage, my table is "fat" (varchar columns) because it was a quick and dirty implementation but I'll make it look more like fact/dimensions very soon (therefore surrogate keys instead of varchars)

      – Eric Mamet
      Mar 6 at 22:23












    • @EricMamet: Then the question might be misleading. All details matter. Postgres decides the query plan based on estimated costs. "A few millions" is too fuzzy. Could be 2 or 9 millions, which makes a big difference.

      – Erwin Brandstetter
      Mar 6 at 22:28












    • @ErwinBrandstetter I understand that and I'll come back with more details tomorrow. In SQL Server, a request like this would always use the index because it would not need to look at the table at all and would only sequentially scan a subset of the index (regardless of the number of rows). I was wondering whether postgresql would behave the same. Obviously not!

      – Eric Mamet
      Mar 6 at 22:43











    • @EricMamet: It depends. For index-only scans like we see in your first query plan, some preconditions have to be met.

      – Erwin Brandstetter
      Mar 7 at 0:13
















    Voting +1 for this answer since adding an index on the filter column increases the chance of getting an index-only scan. Once that is in-place (remember that creating a foreign key does not create and index), the primary key should be irrelevant.

    – Ancoron
    Mar 6 at 19:08





    Voting +1 for this answer since adding an index on the filter column increases the chance of getting an index-only scan. Once that is in-place (remember that creating a foreign key does not create and index), the primary key should be irrelevant.

    – Ancoron
    Mar 6 at 19:08













    * I ran this particular query purely as an exercise to try and get an index scan. The original query was far more complex. * In "real scenarios", I need the other columns. * There was no other operation concurrently * There is probably about 300,000 rows for this release_key value and a few millions overall * At this stage, my table is "fat" (varchar columns) because it was a quick and dirty implementation but I'll make it look more like fact/dimensions very soon (therefore surrogate keys instead of varchars)

    – Eric Mamet
    Mar 6 at 22:23






    * I ran this particular query purely as an exercise to try and get an index scan. The original query was far more complex. * In "real scenarios", I need the other columns. * There was no other operation concurrently * There is probably about 300,000 rows for this release_key value and a few millions overall * At this stage, my table is "fat" (varchar columns) because it was a quick and dirty implementation but I'll make it look more like fact/dimensions very soon (therefore surrogate keys instead of varchars)

    – Eric Mamet
    Mar 6 at 22:23














    @EricMamet: Then the question might be misleading. All details matter. Postgres decides the query plan based on estimated costs. "A few millions" is too fuzzy. Could be 2 or 9 millions, which makes a big difference.

    – Erwin Brandstetter
    Mar 6 at 22:28






    @EricMamet: Then the question might be misleading. All details matter. Postgres decides the query plan based on estimated costs. "A few millions" is too fuzzy. Could be 2 or 9 millions, which makes a big difference.

    – Erwin Brandstetter
    Mar 6 at 22:28














    @ErwinBrandstetter I understand that and I'll come back with more details tomorrow. In SQL Server, a request like this would always use the index because it would not need to look at the table at all and would only sequentially scan a subset of the index (regardless of the number of rows). I was wondering whether postgresql would behave the same. Obviously not!

    – Eric Mamet
    Mar 6 at 22:43





    @ErwinBrandstetter I understand that and I'll come back with more details tomorrow. In SQL Server, a request like this would always use the index because it would not need to look at the table at all and would only sequentially scan a subset of the index (regardless of the number of rows). I was wondering whether postgresql would behave the same. Obviously not!

    – Eric Mamet
    Mar 6 at 22:43













    @EricMamet: It depends. For index-only scans like we see in your first query plan, some preconditions have to be met.

    – Erwin Brandstetter
    Mar 7 at 0:13





    @EricMamet: It depends. For index-only scans like we see in your first query plan, some preconditions have to be met.

    – Erwin Brandstetter
    Mar 7 at 0:13













    0














    The answer to my initial question: why is postgresql not using my index on something like SELECT (*)... can be found in the documentation...



    Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT



    In particular: This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn't been deleted.



    This explains a lot why I don't manage to get postgresql to use my indexes when, from a SQL Server perspective, it obviously "should".






    share|improve this answer



























      0














      The answer to my initial question: why is postgresql not using my index on something like SELECT (*)... can be found in the documentation...



      Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT



      In particular: This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn't been deleted.



      This explains a lot why I don't manage to get postgresql to use my indexes when, from a SQL Server perspective, it obviously "should".






      share|improve this answer

























        0












        0








        0







        The answer to my initial question: why is postgresql not using my index on something like SELECT (*)... can be found in the documentation...



        Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT



        In particular: This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn't been deleted.



        This explains a lot why I don't manage to get postgresql to use my indexes when, from a SQL Server perspective, it obviously "should".






        share|improve this answer













        The answer to my initial question: why is postgresql not using my index on something like SELECT (*)... can be found in the documentation...



        Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT



        In particular: This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn't been deleted.



        This explains a lot why I don't manage to get postgresql to use my indexes when, from a SQL Server perspective, it obviously "should".







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 8 at 12:14









        Eric MametEric Mamet

        808




        808





















            0














            Because release_key isn't actually a unique column, it's not possible from the information you've provided to know whether or not the index should be used. If a high percentage of rows have release_key = 2 or even a smaller percentage of rows match on a large table, it may not be efficient to use the index.



            In part this is because Postgres indexes are indirect -- that is the index actually contains a pointer to the location on disk in the heap where the real tuple lives. So looping through an index requires reading an entry from the index, reading the tuple from the heap, and repeating. For a large number of tuples it's often more valuable to scan the heap directly and avoid the indirect disk access penalty.



            Edit:
            You generally don't want to be using CLUSTER in PostgreSQL; it's not how indexes are maintained, and it's rare to see that in the wild for that reason.



            Your updated query with no data gives this plan:



             QUERY PLAN 
            ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
            CTE Scan on beforelookup bl (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
            Buffers: shared hit=4
            CTE rank_query
            -> WindowAgg (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
            Buffers: shared hit=4
            -> Sort (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
            Sort Key: main.year, (sum(main.units)) DESC
            Sort Method: quicksort Memory: 25kB
            Buffers: shared hit=4
            -> GroupAggregate (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
            Group Key: main.year, main.product_group_key
            Buffers: shared hit=1
            -> Sort (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
            Sort Key: main.year, main.product_group_key
            Sort Method: quicksort Memory: 25kB
            Buffers: shared hit=1
            -> Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
            Index Cond: ((release_key = 17) AND (year >= 2010))
            Buffers: shared hit=1
            CTE beforelookup
            -> HashAggregate (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
            Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
            Buffers: shared hit=4
            -> CTE Scan on rank_query (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
            Buffers: shared hit=4
            Planning Time: 0.723 ms
            Execution Time: 0.485 ms
            (27 rows)


            So PostgreSQL is entirely capable of using the index for your query, but the planner is deciding that it's not worth it (i.e., the costing for using the index directly is higher than the costing for using the parallel sequence scan).



            If you set enable_indexscan = off; with no data, you get a bitmap index scan (as I'd expect). If you set enable_bitmapscan = off; with no data you get an (non-parallel) sequence scan.



            You should see the plan change back (with large amounts of data) if you set max_parallel_workers = 0;.



            But looking at your query's explain results, I'd very much expect using the index to be more expensive and take longer than using the parallel sequence scan. In your updated query you're still scanning a very high percentage of the table and a large number of rows, and you're also forcing accessing the heap by accessing fields not in the index. Postgres 11 (I believe) adds covering indexes which would theoretically allow you to make this query be driven by the index alone, but I'm not at all convinced in this example it would actually be worth it.






            share|improve this answer

























            • I don't have access to the database right now so I won't be able to supply more details this evening. Coming from a SQL Server background, I understand the "indirect" aspect of the index and the same would apply to SQL Server if I referred to any column not covered by the index. But in my example, I made a point to do just a COUNT(*) that, in SQL Server, would be resolved without accessing the underlying table since the row count could be worked out from the index.

              – Eric Mamet
              Mar 6 at 22:31











            • @EricMamet I updated my answer for your updated question.

              – jcoleman
              Mar 8 at 13:03















            0














            Because release_key isn't actually a unique column, it's not possible from the information you've provided to know whether or not the index should be used. If a high percentage of rows have release_key = 2 or even a smaller percentage of rows match on a large table, it may not be efficient to use the index.



            In part this is because Postgres indexes are indirect -- that is the index actually contains a pointer to the location on disk in the heap where the real tuple lives. So looping through an index requires reading an entry from the index, reading the tuple from the heap, and repeating. For a large number of tuples it's often more valuable to scan the heap directly and avoid the indirect disk access penalty.



            Edit:
            You generally don't want to be using CLUSTER in PostgreSQL; it's not how indexes are maintained, and it's rare to see that in the wild for that reason.



            Your updated query with no data gives this plan:



             QUERY PLAN 
            ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
            CTE Scan on beforelookup bl (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
            Buffers: shared hit=4
            CTE rank_query
            -> WindowAgg (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
            Buffers: shared hit=4
            -> Sort (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
            Sort Key: main.year, (sum(main.units)) DESC
            Sort Method: quicksort Memory: 25kB
            Buffers: shared hit=4
            -> GroupAggregate (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
            Group Key: main.year, main.product_group_key
            Buffers: shared hit=1
            -> Sort (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
            Sort Key: main.year, main.product_group_key
            Sort Method: quicksort Memory: 25kB
            Buffers: shared hit=1
            -> Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
            Index Cond: ((release_key = 17) AND (year >= 2010))
            Buffers: shared hit=1
            CTE beforelookup
            -> HashAggregate (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
            Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
            Buffers: shared hit=4
            -> CTE Scan on rank_query (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
            Buffers: shared hit=4
            Planning Time: 0.723 ms
            Execution Time: 0.485 ms
            (27 rows)


            So PostgreSQL is entirely capable of using the index for your query, but the planner is deciding that it's not worth it (i.e., the costing for using the index directly is higher than the costing for using the parallel sequence scan).



            If you set enable_indexscan = off; with no data, you get a bitmap index scan (as I'd expect). If you set enable_bitmapscan = off; with no data you get an (non-parallel) sequence scan.



            You should see the plan change back (with large amounts of data) if you set max_parallel_workers = 0;.



            But looking at your query's explain results, I'd very much expect using the index to be more expensive and take longer than using the parallel sequence scan. In your updated query you're still scanning a very high percentage of the table and a large number of rows, and you're also forcing accessing the heap by accessing fields not in the index. Postgres 11 (I believe) adds covering indexes which would theoretically allow you to make this query be driven by the index alone, but I'm not at all convinced in this example it would actually be worth it.






            share|improve this answer

























            • I don't have access to the database right now so I won't be able to supply more details this evening. Coming from a SQL Server background, I understand the "indirect" aspect of the index and the same would apply to SQL Server if I referred to any column not covered by the index. But in my example, I made a point to do just a COUNT(*) that, in SQL Server, would be resolved without accessing the underlying table since the row count could be worked out from the index.

              – Eric Mamet
              Mar 6 at 22:31











            • @EricMamet I updated my answer for your updated question.

              – jcoleman
              Mar 8 at 13:03













            0












            0








            0







            Because release_key isn't actually a unique column, it's not possible from the information you've provided to know whether or not the index should be used. If a high percentage of rows have release_key = 2 or even a smaller percentage of rows match on a large table, it may not be efficient to use the index.



            In part this is because Postgres indexes are indirect -- that is the index actually contains a pointer to the location on disk in the heap where the real tuple lives. So looping through an index requires reading an entry from the index, reading the tuple from the heap, and repeating. For a large number of tuples it's often more valuable to scan the heap directly and avoid the indirect disk access penalty.



            Edit:
            You generally don't want to be using CLUSTER in PostgreSQL; it's not how indexes are maintained, and it's rare to see that in the wild for that reason.



            Your updated query with no data gives this plan:



             QUERY PLAN 
            ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
            CTE Scan on beforelookup bl (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
            Buffers: shared hit=4
            CTE rank_query
            -> WindowAgg (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
            Buffers: shared hit=4
            -> Sort (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
            Sort Key: main.year, (sum(main.units)) DESC
            Sort Method: quicksort Memory: 25kB
            Buffers: shared hit=4
            -> GroupAggregate (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
            Group Key: main.year, main.product_group_key
            Buffers: shared hit=1
            -> Sort (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
            Sort Key: main.year, main.product_group_key
            Sort Method: quicksort Memory: 25kB
            Buffers: shared hit=1
            -> Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
            Index Cond: ((release_key = 17) AND (year >= 2010))
            Buffers: shared hit=1
            CTE beforelookup
            -> HashAggregate (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
            Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
            Buffers: shared hit=4
            -> CTE Scan on rank_query (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
            Buffers: shared hit=4
            Planning Time: 0.723 ms
            Execution Time: 0.485 ms
            (27 rows)


            So PostgreSQL is entirely capable of using the index for your query, but the planner is deciding that it's not worth it (i.e., the costing for using the index directly is higher than the costing for using the parallel sequence scan).



            If you set enable_indexscan = off; with no data, you get a bitmap index scan (as I'd expect). If you set enable_bitmapscan = off; with no data you get an (non-parallel) sequence scan.



            You should see the plan change back (with large amounts of data) if you set max_parallel_workers = 0;.



            But looking at your query's explain results, I'd very much expect using the index to be more expensive and take longer than using the parallel sequence scan. In your updated query you're still scanning a very high percentage of the table and a large number of rows, and you're also forcing accessing the heap by accessing fields not in the index. Postgres 11 (I believe) adds covering indexes which would theoretically allow you to make this query be driven by the index alone, but I'm not at all convinced in this example it would actually be worth it.






            share|improve this answer















            Because release_key isn't actually a unique column, it's not possible from the information you've provided to know whether or not the index should be used. If a high percentage of rows have release_key = 2 or even a smaller percentage of rows match on a large table, it may not be efficient to use the index.



            In part this is because Postgres indexes are indirect -- that is the index actually contains a pointer to the location on disk in the heap where the real tuple lives. So looping through an index requires reading an entry from the index, reading the tuple from the heap, and repeating. For a large number of tuples it's often more valuable to scan the heap directly and avoid the indirect disk access penalty.



            Edit:
            You generally don't want to be using CLUSTER in PostgreSQL; it's not how indexes are maintained, and it's rare to see that in the wild for that reason.



            Your updated query with no data gives this plan:



             QUERY PLAN 
            ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
            CTE Scan on beforelookup bl (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
            Buffers: shared hit=4
            CTE rank_query
            -> WindowAgg (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
            Buffers: shared hit=4
            -> Sort (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
            Sort Key: main.year, (sum(main.units)) DESC
            Sort Method: quicksort Memory: 25kB
            Buffers: shared hit=4
            -> GroupAggregate (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
            Group Key: main.year, main.product_group_key
            Buffers: shared hit=1
            -> Sort (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
            Sort Key: main.year, main.product_group_key
            Sort Method: quicksort Memory: 25kB
            Buffers: shared hit=1
            -> Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
            Index Cond: ((release_key = 17) AND (year >= 2010))
            Buffers: shared hit=1
            CTE beforelookup
            -> HashAggregate (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
            Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
            Buffers: shared hit=4
            -> CTE Scan on rank_query (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
            Buffers: shared hit=4
            Planning Time: 0.723 ms
            Execution Time: 0.485 ms
            (27 rows)


            So PostgreSQL is entirely capable of using the index for your query, but the planner is deciding that it's not worth it (i.e., the costing for using the index directly is higher than the costing for using the parallel sequence scan).



            If you set enable_indexscan = off; with no data, you get a bitmap index scan (as I'd expect). If you set enable_bitmapscan = off; with no data you get an (non-parallel) sequence scan.



            You should see the plan change back (with large amounts of data) if you set max_parallel_workers = 0;.



            But looking at your query's explain results, I'd very much expect using the index to be more expensive and take longer than using the parallel sequence scan. In your updated query you're still scanning a very high percentage of the table and a large number of rows, and you're also forcing accessing the heap by accessing fields not in the index. Postgres 11 (I believe) adds covering indexes which would theoretically allow you to make this query be driven by the index alone, but I'm not at all convinced in this example it would actually be worth it.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Mar 8 at 13:02

























            answered Mar 6 at 22:12









            jcolemanjcoleman

            428314




            428314












            • I don't have access to the database right now so I won't be able to supply more details this evening. Coming from a SQL Server background, I understand the "indirect" aspect of the index and the same would apply to SQL Server if I referred to any column not covered by the index. But in my example, I made a point to do just a COUNT(*) that, in SQL Server, would be resolved without accessing the underlying table since the row count could be worked out from the index.

              – Eric Mamet
              Mar 6 at 22:31











            • @EricMamet I updated my answer for your updated question.

              – jcoleman
              Mar 8 at 13:03

















            • I don't have access to the database right now so I won't be able to supply more details this evening. Coming from a SQL Server background, I understand the "indirect" aspect of the index and the same would apply to SQL Server if I referred to any column not covered by the index. But in my example, I made a point to do just a COUNT(*) that, in SQL Server, would be resolved without accessing the underlying table since the row count could be worked out from the index.

              – Eric Mamet
              Mar 6 at 22:31











            • @EricMamet I updated my answer for your updated question.

              – jcoleman
              Mar 8 at 13:03
















            I don't have access to the database right now so I won't be able to supply more details this evening. Coming from a SQL Server background, I understand the "indirect" aspect of the index and the same would apply to SQL Server if I referred to any column not covered by the index. But in my example, I made a point to do just a COUNT(*) that, in SQL Server, would be resolved without accessing the underlying table since the row count could be worked out from the index.

            – Eric Mamet
            Mar 6 at 22:31





            I don't have access to the database right now so I won't be able to supply more details this evening. Coming from a SQL Server background, I understand the "indirect" aspect of the index and the same would apply to SQL Server if I referred to any column not covered by the index. But in my example, I made a point to do just a COUNT(*) that, in SQL Server, would be resolved without accessing the underlying table since the row count could be worked out from the index.

            – Eric Mamet
            Mar 6 at 22:31













            @EricMamet I updated my answer for your updated question.

            – jcoleman
            Mar 8 at 13:03





            @EricMamet I updated my answer for your updated question.

            – jcoleman
            Mar 8 at 13:03

















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55029323%2fpostgresql-is-not-using-a-straight-forward-index%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            1928 у кіно

            Захаров Федір Захарович

            Ель Греко