so many trails ... so little time: Simulating procedural logic

Sometimes I see people having great difficulties in describing how to fetch data for a report.
They are unable to reason by sets and tend to describe things in procedural terms.
Here I'm posting a small example of how you can write a query that reproduces that procedural reasoning and lets the optimizer do the work of translating it into efficient SQL.
Say someone has a table structure like this, a main table named guys holding their id and name and two tables bads_attributes and goods_attributes, if you are a bad guy your attributes will be in the bads_attributes table and vice versa.
Looks ugly? It is, but you'll find it around, sooner or later :-(
And it's not even the worst case scenario, confronted with a similar requirement I've heard that stored procedures where proposed as the ideal solution.
The table structure:

DROP TABLE IF EXISTS `test`.`guys`;
CREATE TABLE  `test`.`guys` (
`guy_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`guy_name` varchar(45) NOT NULL,
PRIMARY KEY (`guy_id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8;

DROP TABLE IF EXISTS `test`.`bads_attributes`;
CREATE TABLE  `test`.`bads_attributes` (
`attribute_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`guy_id` int(10) unsigned NOT NULL,
`attribute_name` varchar(45) NOT NULL,
PRIMARY KEY (`attribute_id`,`guy_id`),
KEY `FK_bads_attributes_1` (`guy_id`),
CONSTRAINT `FK_bads_attributes_1` FOREIGN KEY (`guy_id`) REFERENCES `guys` (`guy_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8;

DROP TABLE IF EXISTS `test`.`goods_attributes`;
CREATE TABLE  `test`.`goods_attributes` (
`attribute_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`guy_id` int(10) unsigned NOT NULL,
`attribute_name` varchar(45) NOT NULL,
PRIMARY KEY (`attribute_id`,`guy_id`),
KEY `FK_goods_attributes_1` (`guy_id`),
CONSTRAINT `FK_goods_attributes_1` FOREIGN KEY (`guy_id`) REFERENCES `guys` (`guy_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;

Now let's retrieve those attributes with a query that mimics the logic described above

SELECT
 g.guy_id,
 g.guy_name,
 CASE
  WHEN EXISTS (SELECT 0 FROM bads_attributes b WHERE b.guy_id = g.guy_id)
   THEN (SELECT group_concat(b.attribute_name separator ', ') FROM bads_attributes b WHERE b.guy_id = g.guy_id GROUP BY b.guy_id)
  WHEN EXISTS (SELECT 0 FROM goods_attributes a WHERE a.guy_id = g.guy_id)
   THEN (SELECT group_concat(a.attribute_name separator ', ') FROM goods_attributes a WHERE a.guy_id = g.guy_id GROUP BY a.guy_id)
  ELSE 'no attributes for this guy'
 END right_attributes
FROM
 guys g

This goes after the reasoning described above, if you are found in the bads_attributes then your data is retrieved from there, the same for goods_attributes.
Output is

+--------+----------+----------------------------+
| guy_id | guy_name | right_attributes           |
+--------+----------+----------------------------+
|      1 | Paolo    | Fichissimo                 |
|      2 | Carlo    | Fico                       |
|      3 | Ciccio   | Ugly, Uglier               |
|      4 | Bender   | Ugliest                    |
|      5 | New kid  | no attributes for this guy |
+--------+----------+----------------------------+
5 rows in set (0.02 sec)

It's explain plan is

+----+--------------------+-------+------+-----------------------+-----------------------+---------+---------------+------+-------------+
| id | select_type        | table | type | possible_keys         | key         | key_len | ref           | rows | Extra       |
+----+--------------------+-------+------+-----------------------+-----------------------+---------+---------------+------+-------------+
|  1 | PRIMARY            | g     | ALL  | NULL                  | NULL         | NULL    | NULL          |    5 |             |
|  5 | DEPENDENT SUBQUERY | a     | ref  | FK_goods_attributes_1 | FK_goods_attributes_1 | 4       | test.g.guy_id |    1 | Using where |
|  4 | DEPENDENT SUBQUERY | a     | ref  | FK_goods_attributes_1 | FK_goods_attributes_1 | 4       | test.g.guy_id |    1 | Using index |
|  3 | DEPENDENT SUBQUERY | b     | ref  | FK_bads_attributes_1  | FK_bads_attributes_1  | 4       | test.g.guy_id |    1 | Using where |
|  2 | DEPENDENT SUBQUERY | b     | ref  | FK_bads_attributes_1  | FK_bads_attributes_1  | 4       | test.g.guy_id |    1 | Using index |
+----+--------------------+-------+------+-----------------------+-----------------------+---------+---------------+------+-------------+
5 rows in set (0.00 sec)

Well, it could be worse, note that I'm using InnoDB tables and I've declared foreing keys, whose indexes are picked up by the optimizer.

Let's a more SQLish version of this query

SELECT
 g.guy_id,
 g.guy_name,
 COALESCE(
  group_concat(b.attribute_name separator ', '),
  group_concat(a.attribute_name separator ', ')
 ) right_attributes
FROM 
 guys g LEFT OUTER JOIN bads_attributes b 
  ON g.guy_id = b.guy_id
 LEFT OUTER JOIN goods_attributes a 
  ON g.guy_id = a.guy_id
GROUP BY
 g.guy_id,
 g.guy_name

Output is the same

+--------+----------+---------------------------+
| guy_id | guy_name | right_attributes          |
+--------+----------+---------------------------+
|      1 | Paolo    | Fichissimo                |
|      2 | Carlo    | Fico                      |
|      3 | Ciccio   | Ugly, Uglier              |
|      4 | Bender   | Ugliest                   |
|      5 | New kid  | NULL                      |
+--------+----------+---------------------------+
5 rows in set (0.00 sec)

And the explain plan

+----+-------------+-------+------+-----------------------+-----------------------+---------+---------------+------+----------------+
| id | select_type | table | type | possible_keys         | key  | key_len | ref           | rows | Extra          |
+----+-------------+-------+------+-----------------------+-----------------------+---------+---------------+------+----------------+
|  1 | SIMPLE      | g     | ALL  | NULL                  | NULL  | NULL    | NULL          |    5 | Using filesort |
|  1 | SIMPLE      | b     | ref  | FK_bads_attributes_1  | FK_bads_attributes_1  | 4       | test.g.guy_id |    1 |                |
|  1 | SIMPLE      | a     | ref  | FK_goods_attributes_1 | FK_goods_attributes_1 | 4       | test.g.guy_id |    1 |                |
+----+-------------+-------+------+-----------------------+-----------------------+---------+---------------+------+----------------+
3 rows in set (0.00 sec)

Well it looks certainly better, but there's a nasty filesort (this is a very small data set) which should be checked against a much larger dataset.

Doing the same on Firebird 2.1 beta 2 (which supports LIST() a function similar to MySQL's group_concat()) leads to:

SELECT
 g.guy_id,
 g.guy_name,
 CASE
  WHEN EXISTS (SELECT 0 FROM bads_attributes b WHERE b.guy_id = g.guy_id)
   THEN (SELECT CAST(list(b.attribute_name) AS varchar(5000)) FROM bads_attributes b WHERE b.guy_id = g.guy_id 
GROUP BY b.guy_id
)
  WHEN EXISTS (SELECT 0 FROM goods_attributes a WHERE a.guy_id = g.guy_id)
   THEN (SELECT CAST(list(a.attribute_name) AS varchar(5000))  FROM goods_attributes a WHERE a.guy_id = g.guy_id 
GROUP BY a.guy_id
)
  ELSE 'no attributes for this guy'
 END right_attributes
FROM
 guys g

and

SELECT
 g.guy_id,
 g.guy_name,
 CAST(
  COALESCE(
   list(b.attribute_name),
   list(a.attribute_name)
 ) AS varchar(5000)) right_attributes
FROM 
 guys g
LEFT OUTER JOIN bads_attributes b 
 ON g.guy_id = b.guy_id
LEFT OUTER JOIN goods_attributes a 
 ON g.guy_id = a.guy_id
GROUP BY
 g.guy_id,
 g.guy_name

Note that both queries need an explicit cast as list's results in Firebird are blobs.
The respective explain plans show that the set oriented one is better.

Prepare time: 00:00:00.
Field #01: GUYS.GUY_ID Alias:GUY_ID Type:INTEGER
Field #02: GUYS.GUY_NAME Alias:GUY_NAME Type:STRING(45)
Field #03: . Alias:RIGHT_ATTRIBUTES Type:STRING(5000)
PLAN (B INDEX (FK_BADS_ATTRIBUTES_1))
PLAN (B ORDER FK_BADS_ATTRIBUTES_1 INDEX (FK_BADS_ATTRIBUTES_1))
PLAN (A INDEX (FK_GOODS_ATTRIBUTES_1))
PLAN (A ORDER FK_GOODS_ATTRIBUTES_1 INDEX (FK_GOODS_ATTRIBUTES_1))
PLAN (G NATURAL)


Executing...
Done.
116 fetches, 0 marks, 0 reads, 0 writes.
0 inserts, 0 updates, 0 deletes, 21 index, 5 seq.
Delta memory: 54852 bytes.
Execute time: 00:00:00.

and

Prepare time: 00:00:00.
Field #01: GUYS.GUY_ID Alias:GUY_ID Type:INTEGER
Field #02: GUYS.GUY_NAME Alias:GUY_NAME Type:STRING(45)
Field #03: . Alias:RIGHT_ATTRIBUTES Type:STRING(5000)
PLAN JOIN (SORT (JOIN (G NATURAL, B INDEX (FK_BADS_ATTRIBUTES_1))), A INDEX (FK_GOODS_ATTRIBUTES_1))


Executing...
Done.
66 fetches, 0 marks, 0 reads, 0 writes.
0 inserts, 0 updates, 0 deletes, 9 index, 5 seq.
Delta memory: 38168 bytes.
Execute time: 00:00:00.

2 comments:

Edwin F. López A. said...: Great!!!! I've been looking for a similar solution for ages!! Thank you. One question: Does Postgresql have a similar function for correlated querys??; 7:06 PM
pabloj said...: Yes, query should work unchanged on PostgreSQL apart from the LIST() function.; 7:12 PM

so many trails ... so little time

Sunday, March 09, 2008

Simulating procedural logic

2 comments:

About Me

Blog Archive

Links

Labels

Followers