Archive

Archive for the ‘MySQL’ Category

Fetching random rows from a database table

August 24, 2009 1 comment

Let’s assume that you have the following problem:  There is a relation (table) in your database, from which you want to fetch a random row.

Why care: It could be the product table and you want to present to the users random suggestions. I could think of numerous more examples where fetching a random row is needed; maybe you want to fetch a random post from the blog_posts table, feature a random user, etc.

If you search in the web for a solution, most people (e.g. [1], [2]) suggest a variant of the following code:

SELECT column FROM table
ORDER BY RAND()
LIMIT 1

My experience dictates that this solution cannot be fast. If you have a table with just 500 tuples, whichever solution you choose (even the worst option of all: bringing everything in memory and then choosing a random tuple), would not make a big difference in your running times. But what happens when you have really large data sets? That’s when the problems begin. In fact, I am not the only one who believes that “ORDER BY RAND()” is the wrong approach to this problem ([3], [4], [5]).

I could just use the alternatives that are presented in the posts just mentioned, but I really like experimenting with that kind of problems, checking my own results and be 100% sure (you can never be sure with the numerous different implementations of DB Engines out there). Also, in my case, I want to check generic solutions that (1) don’t assume that the selection of the random tuples is based on the primary key and (2) can use attributes that may have holes in their domain (i.e. not all values from 1 to max(ID) are in the table).

I’ll use a MySQL server version 5.1 (the database most of the people not in big corporations use nowadays) and create a sufficiently large data set for my experiments. I define “sufficiently large” as a table that will allow me to check the differences between the possible queries, so a test table with 10,000,000 tuples and some dummy columns in order to increase the tuple size will be enough. Also, I’ll use a modified version of the code that was posted [here], as it is better presented than my initial tests.

This is a good example why I do not trust results that can be found on the web:  Even though the solution proposed by Peter is really fast, it does not by any means access only one tuple! Running the code given by Peter (and not my modified version), one can easily see that even though it is fast, it accesses a lot more than 1 tuple!

Let’s build a test table:

DELIMITER $$

DROP PROCEDURE IF EXISTS `testp` $$
CREATE PROCEDURE `testp`( in size INT )
BEGIN
DECLARE iterator_v INT DEFAULT 0;

DROP TABLE IF EXISTS test;
CREATE TABLE test (
id INT NOT NULL auto_increment,
a  INT NOT NULL,
b  INT NOT NULL,
c  INT NOT NULL,

PRIMARY KEY (id)
) ENGINE=MyISAM;

REPEAT
INSERT INTO test
VALUES ( 0, ROUND(1000000000*RAND(),0), ROUND(1000000000*RAND(),0), ROUND(1000000000*RAND(),0));
SET iterator_v = iterator_v + 1;
UNTIL iterator_v = size
END REPEAT;

END $$

DELIMITER ;

Table test is just a dummy table with an ID and three additional columns in order to make the size of rows more than 4 bytes. I also have an InnoDB version of the script, but I would never suggest to insert 10,000,000 tuples in an InnoDB table just for testing (just create a MyISAM table and then convert it to InnoDB or turn off indexes, logging, etc – but this is too much work for a simple test). Either way, I can assure you that the results are similar for both engines.

Attribute a is filled with random values from 0 to 1,000,000,000 and does not have a unique constraint, so I am going to use ‘a’ in my examples.

Let’s fill table test with 10 million random rows:

CALL testp(10000000);

The resulting table is 162MB in size and has a 98MB index on id, so I think that it should be sufficient for my experiments.

Fetching a single random row

Let’s see which one of the following queries is better for fetching a random row:

1. Q1 =

SELECT * FROM test ORDER BY RAND() LIMIT 1;

In my system, it needs ~7.5 seconds in order to return the results.

Why? If you execute “EXPLAIN Q1”, you’ll see that all 10,000,000 rows of the table are accessed in order to add a temporary Rand() column and then short them by this column.

2.  Q2 =

SELECT * FROM test
WHERE a >= (SELECT FLOOR( MAX(a) * RAND()) FROM test )
ORDER BY a LIMIT 1;

This is a query that I have seen somewhere, that’s why I am presenting it here.

I din’t even think for a moment to run it, as if you execute “EXPLAIN Q2”, you’ll see that 10,000,000 * 10,000,000 rows must be accessed by the DB Engine.

3. Q3a =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(a) FROM test )) AS a
) AS r2 ON r1.a >= r2.a
ORDER BY r1.a
LIMIT 1

In my system, it needs ~2.2 seconds in order to return the result. This should be much faster, but I am paying here for my decision to try and use an attribute different than the primary key! If the ID was not defined or attribute ‘a’ was the primary key, then I could send to the DB the following query:

Q3b =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(a) FROM test )) AS a
) AS r2 ON r1.a >= r2.a
LIMIT 1

In my modified DB (without ID in table test), it only needs ~0.83 seconds in order to return the result.

And what if I use the primary key (ID) in the original DB?

Q3c =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(ID) FROM test )) AS ID
) AS r2 ON r1.ID >= r2.ID
LIMIT 1

The result is always returned in less than 0.001 seconds (actually, 0.0003 seconds, but don’t trust any metric bellow 0.001 seconds)!

So why is query Q3c so faster than Q3b? The index on ID is used instead of a full table scan 😉

Lesson learned:

(a) Just use the primary key for the selection! Using another attribute can offer nothing when trying to fetch a single random row, while it slows down significantly the query…

(b) Repeat after me: Indexes are the second best thing ever implemented in DB systems. Don’t forget to use them!

4. Splitting Q3a and Q3b in two queries in a stored procedure, as also proposed in [3]:
(No reason to split Q3c, as it is so fast, that it will slow down from the function calls)

DELIMITER $$

DROP PROCEDURE IF EXISTS `test`.`fetchRandom` $$
CREATE PROCEDURE `test`.`fetchRandom` ()
BEGIN
DECLARE maxId INT DEFAULT 0;
DECLARE randID INT DEFAULT 0;

SELECT MAX(`a`)
Into maxId
FROM `test`;

SET randID = FLOOR(1 + (RAND() * (maxId – 1)));

SELECT *
FROM `test`
WHERE `a` >= randID
ORDER BY `a` ASC
LIMIT 1;

END $$

DELIMITER ;

Q4 = CALL fetchRandom();

The above call results in (more or less) the same execution time as the two variants of Q3. That’s why I believe that there is no reason to get into the trouble of building the stored procedure.

Fetching multiple random rows

So, is my approach in Q3a (using a different attribute than the primary key) that bad? No if you are interested in fetching multiple random rows with no consecutive IDs!

If you are using the ID, then extending the fastest query Q3c to “Limit 5”:

Q5a =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(ID) FROM test )) AS ID
) AS r2 ON r1.ID >= r2.ID
LIMIT 5

will result not in 5 completely random tuples, but in 5 consecutive tuples starting from a random ID. For example, a random call returned:

id a b c
3737792 417407769 478633193 140942687
3737793 268813889 921241413 799765428
3737794 235102934 776218416 175783308
3737795 550261325 223956729 468999704
3737796 673128362 958642727 773828580

In contrast, extending query Q3a:

Q5b =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(a) FROM test )) AS a
) AS r2 ON r1.a >= r2.a
ORDER BY r1.a
LIMIT 5

will result in 5 completely random tuples (with respect to their ID) with an execution time of ~2.3 seconds. The stored procedure in Q4, can also be altered, but will result in more or less the same execution time as Q5b.

Finally, even though from an academic point of view I like the approach presented in [5], I don’t really think that it is a realistic solution: The overhead of the two triggers presented would slow down update operations in a really large data set. In contrast, the solution presented in Q5b, could be implemented in any relation schema with minimal overhead:

a) Extend the table with an additional attribute RAND_Attr.

b) Assign a random value expected_cardinality_of_table*RAND() to attribute RAND_Attr in every insert (you should no worry about duplicate values, as Q5b handles them).
This step could be implemented in the corresponding stored procedure that is used by the model in order to hide the database schema (always a wise approach) or implemented through a ‘light’ trigger (more expensive, but better than the option of hard coded assignment in the application code).

c) Use query Q5b with ‘randA’ instead of ‘a’ in order to fetch multiple random rows.

So, in order to summarize, to the best of my knowledge, the best approaches to fetching random rows from a table TABLE, with primary key (and index on) ID and an attribute RAND_Attr with random values, are:

A. Fetching a single random row:

SELECT *
FROM TABLE r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(ID) FROM TABLE)) AS ID
) AS r2 ON r1.ID >= r2.ID
LIMIT 1

B. Fetching multiple random rows:

SELECT *
FROM TABLE r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(RAND_Attr) FROM TABLE )) AS RAND_Attr
) AS r2 ON r1.RAND_Attr >= r2.RAND_Attr
ORDER BY r1.RAND_Attr
LIMIT N

Or even better, as you know the maximum value MAX_RANDA_VAL that attribute RAND_Attr can have:

SELECT *
FROM TABLE r1
JOIN (
SELECT CEIL(RAND() *  MAX_RANDA_VAL ) AS RAND_Attr
) AS r2 ON r1.RAND_Attr >= r2.RAND_Attr
ORDER BY r1.RAND_Attr
LIMIT N

This second query is virtually the same as the first for two reasons:  (a) rand() is used in both the assignment of values and the query itself and (b) a good rand() function is random (sic), so after a couple thousands of tuples inserted, the real max value of RAND_Attr will be close enough to MAX_RANDA_VAL that no real difference can be noticed between the two queries.

PS. Note that in MySQL 5.1.1 there is a bug and you cannot use variables in Limit clauses. A work around for extending Q4 follows:

CREATE PROCEDURE `test`.`fetchRandom` (in _LIMIT INT)

PREPARE STMT FROM

SELECT ID
INTO … your var here …
FROM `test`
WHERE `a` >= @randID
ORDER BY `a` ASC
LIMIT ?
“;
SET @LIMIT = _LIMIT;
EXECUTE STMT USING @LIMIT;

….

Categories: Databases, MySQL Tags: ,