Monday, September 08, 2008

Random selection, with a bias ...

Say you want to randomly select your employee of the month, but not so randomly, better, you'd like to give your best employees a bigger chance to be selected based on their rating.
This is just an example, you could be randomly displaying ads from your customers, but giving an higher chance to be displayed to those who are paying more, there can be a million other example, but I hope you got the sense of this.
In other terms this means to add some skew to your data, so that it's distribution is no more uniform.
Apart from the statistical implications of this, let's see the sql in action, I'll be using Firebird's Employee sample database, first of all we'll add a column to the employee table, holding the employee rating, then set the rating to 5 for employee 2, 10 for employee 4, 15 for employee 5, 1 for all other employees.
Now that we have rated each employee we need the magic to have them randomly selected with a bias ...
This is accomplished with a "sequence table" which will do just one thing, hold a list of numbers starting from the minimum rating value to the max rating (you'll probably want to preload it with a lot of numbers starting from 1).
Here is the sql:

CREATE TABLE CONSECUTIVE_NUMBER(
NUM Integer NOT NULL,
CONSTRAINT PK_CONSECUTIVE_NUMBER_1 PRIMARY KEY (NUM)
);

And load it:

insert Into Consecutive_number (num) Values (1);
Insert Into Consecutive_number (num) Values (2);
Insert Into Consecutive_number (num) Values (3);
Insert Into Consecutive_number (num) Values (4);
Insert Into Consecutive_number (num) Values (5);
Insert Into Consecutive_number (num) Values (6);
Insert Into Consecutive_number (num) Values (7);
Insert Into Consecutive_number (num) Values (8);
Insert Into Consecutive_number (num) Values (9);
Insert Into Consecutive_number (num) Values (10);
Insert Into Consecutive_number (num) Values (11);
Insert Into Consecutive_number (num) Values (12);
Insert Into Consecutive_number (num) Values (13);
Insert Into Consecutive_number (num) Values (14);
Insert Into Consecutive_number (num) Values (15);
Insert Into Consecutive_number (num) Values (16);
Insert Into Consecutive_number (num) Values (17);
Insert Into Consecutive_number (num) Values (18);
Insert Into Consecutive_number (num) Values (19);
Insert Into Consecutive_number (num) Values (20);

Now we are ready for the magic, we'll be joining our employee table with this sequence table, generating a specific cross join which will hold as many rows for each employee as it's rating, thus, when doing a random select on this data there will be a bigger chance to choose some employees, i.e. employee number 5 will have 15 times the chance to be selected than employee number 8 and 3 times the chances of employee number 2.
Here is the query, embedded in a view:

CREATE VIEW
RANDOM_SKEWED_EMPLOYEE (EMP_NO)
AS
/* write select statement here */
SELECT
FIRST 1
a.EMP_NO
FROM
EMPLOYEE a
INNER JOIN
CONSECUTIVE_NUMBER c
ON
c.NUM BETWEEN 1 AND a.RATING
ORDER BY
rand();

The join clause does the magic, as I said.
Now extract a small sample:

24
5
36
85
5
2
12
24
144
5
5
20
94
5
4
127
5
94
85
4
2
34
136
4
9

Let's see it's distribution, even on such a small sample you'll see that those 3 employees appear more often than others and with more or less the expected relative ratio:
Emp_no Times
2 5
4 9
5 15
9 4
14 1
20 1
28 2
29 2
36 1
37 1
52 2
71 1
72 1
83 1
113 1
114 1
118 2
121 1
127 1
138 2
144 1

Of course a larger sample will be closer to the desired result.
This example is buit on Firebird, but it should work in any database.
HTH

Note that PostgreSQL's generate_series() instruction might help a lot ;)

No comments: