Sunday, March 09, 2008

Simulating procedural logic

Sometimes I see people having great difficulties in describing how to fetch data for a report.
They are unable to reason by sets and tend to describe things in procedural terms.
Here I'm posting a small example of how you can write a query that reproduces that procedural reasoning and lets the optimizer do the work of translating it into efficient SQL.
Say someone has a table structure like this, a main table named guys holding their id and name and two tables bads_attributes and goods_attributes, if you are a bad guy your attributes will be in the bads_attributes table and vice versa.
Looks ugly? It is, but you'll find it around, sooner or later :-(
And it's not even the worst case scenario, confronted with a similar requirement I've heard that stored procedures where proposed as the ideal solution.
The table structure:

DROP TABLE IF EXISTS `test`.`guys`;
CREATE TABLE `test`.`guys` (
`guy_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`guy_name` varchar(45) NOT NULL,
PRIMARY KEY (`guy_id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8;

DROP TABLE IF EXISTS `test`.`bads_attributes`;
CREATE TABLE `test`.`bads_attributes` (
`attribute_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`guy_id` int(10) unsigned NOT NULL,
`attribute_name` varchar(45) NOT NULL,
PRIMARY KEY (`attribute_id`,`guy_id`),
KEY `FK_bads_attributes_1` (`guy_id`),
CONSTRAINT `FK_bads_attributes_1` FOREIGN KEY (`guy_id`) REFERENCES `guys` (`guy_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8;

DROP TABLE IF EXISTS `test`.`goods_attributes`;
CREATE TABLE `test`.`goods_attributes` (
`attribute_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`guy_id` int(10) unsigned NOT NULL,
`attribute_name` varchar(45) NOT NULL,
PRIMARY KEY (`attribute_id`,`guy_id`),
KEY `FK_goods_attributes_1` (`guy_id`),
CONSTRAINT `FK_goods_attributes_1` FOREIGN KEY (`guy_id`) REFERENCES `guys` (`guy_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;



Now let's retrieve those attributes with a query that mimics the logic described above

  1. SELECT
  2. g.guy_id,
  3. g.guy_name,
  4. CASE
  5. WHEN EXISTS (SELECT 0 FROM bads_attributes b WHERE b.guy_id = g.guy_id)
  6. THEN (SELECT group_concat(b.attribute_name separator ', ') FROM bads_attributes b WHERE b.guy_id = g.guy_id GROUP BY b.guy_id)
  7. WHEN EXISTS (SELECT 0 FROM goods_attributes a WHERE a.guy_id = g.guy_id)
  8. THEN (SELECT group_concat(a.attribute_name separator ', ') FROM goods_attributes a WHERE a.guy_id = g.guy_id GROUP BY a.guy_id)
  9. ELSE 'no attributes for this guy'
  10. END right_attributes
  11. FROM
  12. guys g

This goes after the reasoning described above, if you are found in the bads_attributes then your data is retrieved from there, the same for goods_attributes.
Output is

+--------+----------+----------------------------+
| guy_id | guy_name | right_attributes |
+--------+----------+----------------------------+
| 1 | Paolo | Fichissimo |
| 2 | Carlo | Fico |
| 3 | Ciccio | Ugly, Uglier |
| 4 | Bender | Ugliest |
| 5 | New kid | no attributes for this guy |
+--------+----------+----------------------------+
5 rows in set (0.02 sec)

It's explain plan is

+----+--------------------+-------+------+-----------------------+-----------------------+---------+---------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+------+-----------------------+-----------------------+---------+---------------+------+-------------+
| 1 | PRIMARY | g | ALL | NULL | NULL | NULL | NULL | 5 | |
| 5 | DEPENDENT SUBQUERY | a | ref | FK_goods_attributes_1 | FK_goods_attributes_1 | 4 | test.g.guy_id | 1 | Using where |
| 4 | DEPENDENT SUBQUERY | a | ref | FK_goods_attributes_1 | FK_goods_attributes_1 | 4 | test.g.guy_id | 1 | Using index |
| 3 | DEPENDENT SUBQUERY | b | ref | FK_bads_attributes_1 | FK_bads_attributes_1 | 4 | test.g.guy_id | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | b | ref | FK_bads_attributes_1 | FK_bads_attributes_1 | 4 | test.g.guy_id | 1 | Using index |
+----+--------------------+-------+------+-----------------------+-----------------------+---------+---------------+------+-------------+
5 rows in set (0.00 sec)

Well, it could be worse, note that I'm using InnoDB tables and I've declared foreing keys, whose indexes are picked up by the optimizer.

Let's a more SQLish version of this query

  1. SELECT
  2. g.guy_id,
  3. g.guy_name,
  4. COALESCE(
  5. group_concat(b.attribute_name separator ', '),
  6. group_concat(a.attribute_name separator ', ')
  7. ) right_attributes
  8. FROM
  9. guys g LEFT OUTER JOIN bads_attributes b
  10. ON g.guy_id = b.guy_id
  11. LEFT OUTER JOIN goods_attributes a
  12. ON g.guy_id = a.guy_id
  13. GROUP BY
  14. g.guy_id,
  15. g.guy_name

Output is the same

+--------+----------+---------------------------+
| guy_id | guy_name | right_attributes |
+--------+----------+---------------------------+
| 1 | Paolo | Fichissimo |
| 2 | Carlo | Fico |
| 3 | Ciccio | Ugly, Uglier |
| 4 | Bender | Ugliest |
| 5 | New kid | NULL |
+--------+----------+---------------------------+
5 rows in set (0.00 sec)

And the explain plan

+----+-------------+-------+------+-----------------------+-----------------------+---------+---------------+------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------------+-----------------------+---------+---------------+------+----------------+
| 1 | SIMPLE | g | ALL | NULL | NULL | NULL | NULL | 5 | Using filesort |
| 1 | SIMPLE | b | ref | FK_bads_attributes_1 | FK_bads_attributes_1 | 4 | test.g.guy_id | 1 | |
| 1 | SIMPLE | a | ref | FK_goods_attributes_1 | FK_goods_attributes_1 | 4 | test.g.guy_id | 1 | |
+----+-------------+-------+------+-----------------------+-----------------------+---------+---------------+------+----------------+
3 rows in set (0.00 sec)

Well it looks certainly better, but there's a nasty filesort (this is a very small data set) which should be checked against a much larger dataset.

Doing the same on Firebird 2.1 beta 2 (which supports LIST() a function similar to MySQL's group_concat()) leads to:

  1. SELECT
  2. g.guy_id,
  3. g.guy_name,
  4. CASE
  5. WHEN EXISTS (SELECT 0 FROM bads_attributes b WHERE b.guy_id = g.guy_id)
  6. THEN (SELECT CAST(list(b.attribute_name) AS varchar(5000)) FROM bads_attributes b WHERE b.guy_id = g.guy_id
  7. GROUP BY b.guy_id
  8. )
  9. WHEN EXISTS (SELECT 0 FROM goods_attributes a WHERE a.guy_id = g.guy_id)
  10. THEN (SELECT CAST(list(a.attribute_name) AS varchar(5000)) FROM goods_attributes a WHERE a.guy_id = g.guy_id
  11. GROUP BY a.guy_id
  12. )
  13. ELSE 'no attributes for this guy'
  14. END right_attributes
  15. FROM
  16. guys g

and

  1. SELECT
  2. g.guy_id,
  3. g.guy_name,
  4. CAST(
  5. COALESCE(
  6. list(b.attribute_name),
  7. list(a.attribute_name)
  8. ) AS varchar(5000)) right_attributes
  9. FROM
  10. guys g
  11. LEFT OUTER JOIN bads_attributes b
  12. ON g.guy_id = b.guy_id
  13. LEFT OUTER JOIN goods_attributes a
  14. ON g.guy_id = a.guy_id
  15. GROUP BY
  16. g.guy_id,
  17. g.guy_name


Note that both queries need an explicit cast as list's results in Firebird are blobs.
The respective explain plans show that the set oriented one is better.

Prepare time: 00:00:00.
Field #01: GUYS.GUY_ID Alias:GUY_ID Type:INTEGER
Field #02: GUYS.GUY_NAME Alias:GUY_NAME Type:STRING(45)
Field #03: . Alias:RIGHT_ATTRIBUTES Type:STRING(5000)
PLAN (B INDEX (FK_BADS_ATTRIBUTES_1))
PLAN (B ORDER FK_BADS_ATTRIBUTES_1 INDEX (FK_BADS_ATTRIBUTES_1))
PLAN (A INDEX (FK_GOODS_ATTRIBUTES_1))
PLAN (A ORDER FK_GOODS_ATTRIBUTES_1 INDEX (FK_GOODS_ATTRIBUTES_1))
PLAN (G NATURAL)


Executing...
Done.
116 fetches, 0 marks, 0 reads, 0 writes.
0 inserts, 0 updates, 0 deletes, 21 index, 5 seq.
Delta memory: 54852 bytes.
Execute time: 00:00:00.

and

Prepare time: 00:00:00.
Field #01: GUYS.GUY_ID Alias:GUY_ID Type:INTEGER
Field #02: GUYS.GUY_NAME Alias:GUY_NAME Type:STRING(45)
Field #03: . Alias:RIGHT_ATTRIBUTES Type:STRING(5000)
PLAN JOIN (SORT (JOIN (G NATURAL, B INDEX (FK_BADS_ATTRIBUTES_1))), A INDEX (FK_GOODS_ATTRIBUTES_1))


Executing...
Done.
66 fetches, 0 marks, 0 reads, 0 writes.
0 inserts, 0 updates, 0 deletes, 9 index, 5 seq.
Delta memory: 38168 bytes.
Execute time: 00:00:00.

2 comments:

Edwin F. López A. said...

Great!!!! I've been looking for a similar solution for ages!! Thank you. One question: Does Postgresql have a similar function for correlated querys??

pabloj said...

Yes, query should work unchanged on PostgreSQL apart from the LIST() function.