Archive

Posts Tagged ‘Databases’

(web design) Clients from hell

January 5, 2010 Leave a comment

Lisa just sent me a blog with a very funny collection of anonymously contributed client horror stories from designers:

http://clientsfromhell.tumblr.com

The really funny part about those stories is that anyone who has worked with non-technical clients, has heard more or less the same kind of silly comments:

  • Client: We like the design, but could you make the blues all the same.
    Me:
    It’s the same blue throughout the design.
    Client:
    It looks like different blues.
    Me:
    That’s because colors are perceived differently dependent on neighboring colors.
    Client:
    That’s stupid.
  • Me: A basic web page will cost around $1000.
    Client:
    Oh my, that is more than what we want to pay. My nephew is in Vo-Tech and I can get him to do it for $100.
  • Can you make each page on our website fade in and fade out when you click on a link? You know…like PowerPoint?

And if you are working as a software engineer/architect, the stories get even funnier: I once worked as a freelancer for a large Organization. Even though they had all the money in the world, they refused to install Oracle, so we had to develop their OLAP solutions over {a free DBMS with no support for anything that resembles a data warehouse}. The arranged payment was more than satisfying, so I had no problem implementing everything from scratch. A few months later, our system was able to perform roll ups and drill downs, and produce complicated aggregation reports using numerous dimensions.. The problem? A month after deployment, they complained because the (super heavy) yearly report (by week, product type, department and some other parameters) was taking a couple of minutes to generate! How can you explain to them that typical OLAP operations take more than 10 secs in order to finish even in the best tuned DBMS????

If our daily encounters were similar to what we are facing in the “software development business realm”, the result would be hilarious:

The “NoSQL” dispute: A performance argument

December 10, 2009 1 comment

NoSQL is a database movement that began in early to mid 2009 and which promotes non-relational data stores that do not need a fixed schema, and that usually avoid join operations. From [1]:

Next Generation Databases mostly address some of the points: being non-relational, distributed, open-source and horizontal scalable. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, replication support, easy API, eventually consistency, and more. So the misleading term “nosql” (some call it “not only sql”) should be seen as an alias to something like the definition above …

I just read a very insightful article from Michael Stonebraker [2], one of the pioneers of modern database theory and Relational Database Management Systems (RDBMS) and “father” of systems like Ingres and Postgres [3, 4]: ‘The “NoSQL” Discussion has Nothing to Do With SQL‘ [5]. In this article, published in the blog of the Communications of the ACM, Dr. Stonebraker responds to some of the arguments from the supporters of the NoSQL movement. There are two possible reasons to move away from structured Relational Database Management Systems and adopt an alternate DBMS technology: performance and flexibility:

The performance argument goes something like the following. I started with MySQL for my data storage needs and over time found performance to be inadequate. My options were: … 2. Abandon MySQL and pay big licensing fees for an enterprise SQL DBMS or move to something other than a SQL DBMS.

The flexibility argument goes something like the following. My data does not conform to a rigid relational schema. Hence, I can’t be bound by the structure of a RDBMS and need something more flexible. ….

Focusing on the performance argument, he explains what is more or less a common knowledge in the database community: The major performance burden of modern RDBMS  comes from all those extra features like transaction possessing (especially insuring the ACID properties), logging, etc and not from the core engine for executing an SQL query:

… However, the net-net is that the single-node performance of a NoSQL, disk-based, non-ACID, multithreaded system is limited to be a modest factor faster than a well-designed stored-procedure SQL OLTP engine. In essence, ACID transactions are jettisoned for a modest performance boost, and this performance boost has nothing to do with SQL.

In summary, blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context. …

I believe that anyone interested in the inner workings of modern RDBMS should read this short post from Dr. Stonebraker and continue with a very interesting paper from Harizopoulos, et.al “OLTP through the looking glass, and what we found there” [6]. I am getting tired of discussing with people outside the database community who after writing a couple of SQL queries think that they know all about modern RDBMS and insist that the MySQL’s MYISAM engine is the best engine out there just because it is fast(er) or question my motives when I consult them that a “Lite SQL” (no reference to any real product)  solution is just not enough..

PS. Dr. Stonebraker is not just an exceptional database researcher, but a visionary whose ideas have shaped the modern Database landscape. A short abstract from an article in SIGMOD Record when he received the IEEE John von Neumann Medal [7]:

… The relational data model and its associated benefits of “data independence” and “non-procedural access” were first invented by Tedd Codd. However, more than any other individual, Mike is responsible for making Codd’s vision of independence a reality through the architectures and algorithms embodied in the series of open-source prototypes and commercial systems that he has initiated and led. While many others have certainly made important contributions to the field, no one else comes close to his continuous sequence of landmark innovations over a period of almost 30 years.

… Mike has been the primary driving force behind major shifts in the research agenda of the database community, including two occasions where he launched entire new directions for the field to follow (the implementation of relational systems and object-relational systems).

Categories: Databases Tags: ,

Fetching random rows from a database table

August 24, 2009 1 comment

Let’s assume that you have the following problem:  There is a relation (table) in your database, from which you want to fetch a random row.

Why care: It could be the product table and you want to present to the users random suggestions. I could think of numerous more examples where fetching a random row is needed; maybe you want to fetch a random post from the blog_posts table, feature a random user, etc.

If you search in the web for a solution, most people (e.g. [1], [2]) suggest a variant of the following code:

SELECT column FROM table
ORDER BY RAND()
LIMIT 1

My experience dictates that this solution cannot be fast. If you have a table with just 500 tuples, whichever solution you choose (even the worst option of all: bringing everything in memory and then choosing a random tuple), would not make a big difference in your running times. But what happens when you have really large data sets? That’s when the problems begin. In fact, I am not the only one who believes that “ORDER BY RAND()” is the wrong approach to this problem ([3], [4], [5]).

I could just use the alternatives that are presented in the posts just mentioned, but I really like experimenting with that kind of problems, checking my own results and be 100% sure (you can never be sure with the numerous different implementations of DB Engines out there). Also, in my case, I want to check generic solutions that (1) don’t assume that the selection of the random tuples is based on the primary key and (2) can use attributes that may have holes in their domain (i.e. not all values from 1 to max(ID) are in the table).

I’ll use a MySQL server version 5.1 (the database most of the people not in big corporations use nowadays) and create a sufficiently large data set for my experiments. I define “sufficiently large” as a table that will allow me to check the differences between the possible queries, so a test table with 10,000,000 tuples and some dummy columns in order to increase the tuple size will be enough. Also, I’ll use a modified version of the code that was posted [here], as it is better presented than my initial tests.

This is a good example why I do not trust results that can be found on the web:  Even though the solution proposed by Peter is really fast, it does not by any means access only one tuple! Running the code given by Peter (and not my modified version), one can easily see that even though it is fast, it accesses a lot more than 1 tuple!

Let’s build a test table:

DELIMITER $$

DROP PROCEDURE IF EXISTS `testp` $$
CREATE PROCEDURE `testp`( in size INT )
BEGIN
DECLARE iterator_v INT DEFAULT 0;

DROP TABLE IF EXISTS test;
CREATE TABLE test (
id INT NOT NULL auto_increment,
a  INT NOT NULL,
b  INT NOT NULL,
c  INT NOT NULL,

PRIMARY KEY (id)
) ENGINE=MyISAM;

REPEAT
INSERT INTO test
VALUES ( 0, ROUND(1000000000*RAND(),0), ROUND(1000000000*RAND(),0), ROUND(1000000000*RAND(),0));
SET iterator_v = iterator_v + 1;
UNTIL iterator_v = size
END REPEAT;

END $$

DELIMITER ;

Table test is just a dummy table with an ID and three additional columns in order to make the size of rows more than 4 bytes. I also have an InnoDB version of the script, but I would never suggest to insert 10,000,000 tuples in an InnoDB table just for testing (just create a MyISAM table and then convert it to InnoDB or turn off indexes, logging, etc – but this is too much work for a simple test). Either way, I can assure you that the results are similar for both engines.

Attribute a is filled with random values from 0 to 1,000,000,000 and does not have a unique constraint, so I am going to use ‘a’ in my examples.

Let’s fill table test with 10 million random rows:

CALL testp(10000000);

The resulting table is 162MB in size and has a 98MB index on id, so I think that it should be sufficient for my experiments.

Fetching a single random row

Let’s see which one of the following queries is better for fetching a random row:

1. Q1 =

SELECT * FROM test ORDER BY RAND() LIMIT 1;

In my system, it needs ~7.5 seconds in order to return the results.

Why? If you execute “EXPLAIN Q1”, you’ll see that all 10,000,000 rows of the table are accessed in order to add a temporary Rand() column and then short them by this column.

2.  Q2 =

SELECT * FROM test
WHERE a >= (SELECT FLOOR( MAX(a) * RAND()) FROM test )
ORDER BY a LIMIT 1;

This is a query that I have seen somewhere, that’s why I am presenting it here.

I din’t even think for a moment to run it, as if you execute “EXPLAIN Q2”, you’ll see that 10,000,000 * 10,000,000 rows must be accessed by the DB Engine.

3. Q3a =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(a) FROM test )) AS a
) AS r2 ON r1.a >= r2.a
ORDER BY r1.a
LIMIT 1

In my system, it needs ~2.2 seconds in order to return the result. This should be much faster, but I am paying here for my decision to try and use an attribute different than the primary key! If the ID was not defined or attribute ‘a’ was the primary key, then I could send to the DB the following query:

Q3b =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(a) FROM test )) AS a
) AS r2 ON r1.a >= r2.a
LIMIT 1

In my modified DB (without ID in table test), it only needs ~0.83 seconds in order to return the result.

And what if I use the primary key (ID) in the original DB?

Q3c =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(ID) FROM test )) AS ID
) AS r2 ON r1.ID >= r2.ID
LIMIT 1

The result is always returned in less than 0.001 seconds (actually, 0.0003 seconds, but don’t trust any metric bellow 0.001 seconds)!

So why is query Q3c so faster than Q3b? The index on ID is used instead of a full table scan 😉

Lesson learned:

(a) Just use the primary key for the selection! Using another attribute can offer nothing when trying to fetch a single random row, while it slows down significantly the query…

(b) Repeat after me: Indexes are the second best thing ever implemented in DB systems. Don’t forget to use them!

4. Splitting Q3a and Q3b in two queries in a stored procedure, as also proposed in [3]:
(No reason to split Q3c, as it is so fast, that it will slow down from the function calls)

DELIMITER $$

DROP PROCEDURE IF EXISTS `test`.`fetchRandom` $$
CREATE PROCEDURE `test`.`fetchRandom` ()
BEGIN
DECLARE maxId INT DEFAULT 0;
DECLARE randID INT DEFAULT 0;

SELECT MAX(`a`)
Into maxId
FROM `test`;

SET randID = FLOOR(1 + (RAND() * (maxId – 1)));

SELECT *
FROM `test`
WHERE `a` >= randID
ORDER BY `a` ASC
LIMIT 1;

END $$

DELIMITER ;

Q4 = CALL fetchRandom();

The above call results in (more or less) the same execution time as the two variants of Q3. That’s why I believe that there is no reason to get into the trouble of building the stored procedure.

Fetching multiple random rows

So, is my approach in Q3a (using a different attribute than the primary key) that bad? No if you are interested in fetching multiple random rows with no consecutive IDs!

If you are using the ID, then extending the fastest query Q3c to “Limit 5”:

Q5a =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(ID) FROM test )) AS ID
) AS r2 ON r1.ID >= r2.ID
LIMIT 5

will result not in 5 completely random tuples, but in 5 consecutive tuples starting from a random ID. For example, a random call returned:

id a b c
3737792 417407769 478633193 140942687
3737793 268813889 921241413 799765428
3737794 235102934 776218416 175783308
3737795 550261325 223956729 468999704
3737796 673128362 958642727 773828580

In contrast, extending query Q3a:

Q5b =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(a) FROM test )) AS a
) AS r2 ON r1.a >= r2.a
ORDER BY r1.a
LIMIT 5

will result in 5 completely random tuples (with respect to their ID) with an execution time of ~2.3 seconds. The stored procedure in Q4, can also be altered, but will result in more or less the same execution time as Q5b.

Finally, even though from an academic point of view I like the approach presented in [5], I don’t really think that it is a realistic solution: The overhead of the two triggers presented would slow down update operations in a really large data set. In contrast, the solution presented in Q5b, could be implemented in any relation schema with minimal overhead:

a) Extend the table with an additional attribute RAND_Attr.

b) Assign a random value expected_cardinality_of_table*RAND() to attribute RAND_Attr in every insert (you should no worry about duplicate values, as Q5b handles them).
This step could be implemented in the corresponding stored procedure that is used by the model in order to hide the database schema (always a wise approach) or implemented through a ‘light’ trigger (more expensive, but better than the option of hard coded assignment in the application code).

c) Use query Q5b with ‘randA’ instead of ‘a’ in order to fetch multiple random rows.

So, in order to summarize, to the best of my knowledge, the best approaches to fetching random rows from a table TABLE, with primary key (and index on) ID and an attribute RAND_Attr with random values, are:

A. Fetching a single random row:

SELECT *
FROM TABLE r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(ID) FROM TABLE)) AS ID
) AS r2 ON r1.ID >= r2.ID
LIMIT 1

B. Fetching multiple random rows:

SELECT *
FROM TABLE r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(RAND_Attr) FROM TABLE )) AS RAND_Attr
) AS r2 ON r1.RAND_Attr >= r2.RAND_Attr
ORDER BY r1.RAND_Attr
LIMIT N

Or even better, as you know the maximum value MAX_RANDA_VAL that attribute RAND_Attr can have:

SELECT *
FROM TABLE r1
JOIN (
SELECT CEIL(RAND() *  MAX_RANDA_VAL ) AS RAND_Attr
) AS r2 ON r1.RAND_Attr >= r2.RAND_Attr
ORDER BY r1.RAND_Attr
LIMIT N

This second query is virtually the same as the first for two reasons:  (a) rand() is used in both the assignment of values and the query itself and (b) a good rand() function is random (sic), so after a couple thousands of tuples inserted, the real max value of RAND_Attr will be close enough to MAX_RANDA_VAL that no real difference can be noticed between the two queries.

PS. Note that in MySQL 5.1.1 there is a bug and you cannot use variables in Limit clauses. A work around for extending Q4 follows:

CREATE PROCEDURE `test`.`fetchRandom` (in _LIMIT INT)

PREPARE STMT FROM

SELECT ID
INTO … your var here …
FROM `test`
WHERE `a` >= @randID
ORDER BY `a` ASC
LIMIT ?
“;
SET @LIMIT = _LIMIT;
EXECUTE STMT USING @LIMIT;

….

Categories: Databases, MySQL Tags: ,