New Technorati algorithm..

October 15, 2009 Leave a comment

If you are searching for new blogs to read, then Technorati is one of the best sites to start. Its top 100 blogs page reflects more or less what is happening in the web and the calculated rank is, in my opinion, quite accurate.

Moreover, during its October revamp, the site updated the algorithm for their main metric: Technorati Authority.

“…

  • Authority is calculated based on a site’s linking behavior, categorization and other associated data over a short, finite period of time. A site’s authority may rapidly rise and fall depending on what the blogosphere is discussing at the moment, and how often a site produces content being referenced by other sites.
  • The new Authority calculation differs from the past version, which measured linking behavior over a longer 6 month timeframe. Please note that links in blogrolls don’t count towards Authority, as they are not indicative of interest in relevant content; we stopped including blogroll links in August 2008.

…”

As Michael Arrington writes in his techcrunch post:

“… Until today, …, the top list was fairly static. Now they are focusing much more on recent data within the last month and giving blogs an authority rank between 1 – 1,000. Scoring factors include posting frequency, context, linking behavior and “other inputs.” The result, says the company, is a lot more volatility in the lists as blogs surge up and down. …”

I think that this is one more (small) step on the direction of results that reflect the real time and volatile nature of web.

Categories: Tech, web Tags: , ,

Randomness in game design

October 14, 2009 Leave a comment

An interesting presentation on randomness in game design: “Randomness: Blight or Bane?”

http://playthisthing.com/randomness-blight-or-bane

Categories: Tech Tags: ,

Tech Podcasts

September 18, 2009 Leave a comment

According to the New Oxford American Dictionary:

a podcast is “a digital recording of a radio broadcast or similar program, made available on the Internet for downloading to a personal audio player,” but the generally accepted definition has expanded to include video as well as audio. Originally derived from a combination of “broadcasting” and “iPod ™,” the word was declared “word of the year” when it was added to the dictionary at the end of 2005.

Podcasts have evolved during the recent years from amateur recordings using cheap microphones to programs that can be easily compared to the best radio or TV shows.  They cover a wide range of genres and topics and, in my opinion, are a great (even though a little bit geeky) way to spend your time during long commutes. Most people will never understand this line of thought, but why lose 2-3 hours per day doing nothing when you can listen to this week’s tech news or a presentation about a new framework?

I have been listening to podcasts for a couple of years now and I am currently subscribing to about 10-15 (mostly tech oriented) podcasts, so I think that I have a pretty good idea on which are the best tech podcasts out there. The following list in not by any means complete, it just contains (in random order) the podcasts that I follow every week and that I think are worth mentioning:

Categories: Tech Tags: ,

Don’t Copy That Floppy

September 12, 2009 Leave a comment

SIIA (The Software & Information Industry Association) released its new anti-piracy campaign [1], called “Don’t Copy That 2”.

You can find the video in youtube:
http://www.youtube.com/watch?v=hUCyvw4w_yk

I was laughing out loud for 10 minutes after I saw the clip. I am sure that whoever is responsible for this campaign is either someone locked in a cage or he doesn’t have kids/thinks everybody is an idiot. Even the best comedian would never think about such a lousy add (in order to make fun of SIAA). As Nick Summers writes [2]: “Rap, Klingons, and Jailhouse-Rape-by-Broomstick Aren’t the Best Way To Teach Kids About Piracy … ”

In their next campaign, they should hire the guys from “IT Crowd” or even use their anti-piracy spot [3],[4] (it is a must-see). I would say that it is the funniest computer related add (with tones of money invested in it – not just a parody), but unfortunately it’s prequel from 1992: “Don’t Copy That Floppy” would easily take the “Worst Advertisement” Emmy Award.

Categories: Tech History Tags: ,

Zend Framework Tutorials

September 2, 2009 Leave a comment

Zend framework is a PHP framework that allows PHP developers to design their projects using the Model-View-Controller (MVC) architectural pattern. Together with the Object Oriented features of PHP 5+, it allows PHP to stop being a “Personal Home Page” scripting tool and become a language in which you can really design big projects. I am not going to explain further why Object Oriented Programming, tiered/modular architectures and separation of Model, View and Controller are critical for project designing, development and maintenance, as thorough analysis can be found in any textbook. (I also believe that those concepts should be obvious to any senior software architect/engineer that plans to design and/or implement a medium/large project).

Noteworthy Introductory Tutorials:

  • Zend framework’s quick start tutorial: This is a great starting tutorial in order to understand the fundamental concepts.
  • Akra’s tutorial: Somewhat more advanced than the quick start tutorial. It is a nice introduction to some more concepts, even though Arka’s approach does not always follow the MVC pattern.
  • Pádraic Brady’s tutorial and on-line book. I must note that his writing style is quite enjoyable and very easy to follow (for the geeks among us – exaggerating a little bit – he has a “Tanenbaum” style of writing):
    • Example Zend Framework Blog Application Tutorial: Even though it is written for Zend framework version 1.5 (which has many differences with respect to version 1.9.1), it is quite useful as it introduces some new techniques, it presents in depth the basic concepts (a good supplement to the first two tutorials) and it adds authentication and authorization examples, modules, plugins and lots of other features. Just remember that he is using an older version of the framework, so I advise you to use Zend_Application and config files in order to setup your environment (paths, views, controllers, layouts)  instead of the methodology presented in part 3 and some of the part 4 (you can use my simple template app if you want a different approach).
      Edit: For some reason, I could not find Parts 9 and 10 of the tutorial in Brady’s blog. You can find a version of part 9 [here].
    • Zend framework – Surviving the deep end: This is a work in progress that can serve as a thorough introduction to the framework. It is partially based in the blog application tutorial, but covers more aspects as the author has more space than a blog post in order to discuss about design decisions, etc. I believe that there are many more advanced features to be added, but even now one could find some interesting chapters like the ones about performance optimisation and unit tests.
  • A small tutorial on how to modularize the guest book application that was presented in the quick start tutorial: [1] [2]. I shouldn’t add it together with the general overall tutorials that I mention in this post (and that’s why I haven’t linked to many other, very helpful, specific tutorials), but I think that adding modules to an application is (1) a very important concept and, even though it is trivial, (2) it is not adequately explained in the on-line documentation of Zend Framework.

Video Tutorials (VidCasts):

  • I liked the video tutorials in the Zend framework’s website. It’s a great 20-30 min way to understand the basic concepts of Zend framework before diving into the tutorials.
  • Even better, Zendcasts has more than 40 in depth video tutorials. Just skip the first 5-6 tutorials, which are covered by the official video tutorials. Also the first vidcats in Zendcasts site use the older 1.7 version (a different approach is used for the bootstrap, etc) and would be a wrong starting point for someone working with the newer versions of the framework.

Other than the few available tutorials, the best resource for all the available features is the extended Programmer’s Reference Guide, which can be found in Zend framework’s web site, and Zend’s Dev Zone. In order not to get lost trying to figure out where to begin (as the chapters are alphabetically ordered according to the name of the different components) , you must have some basic understanding about the framework and how it works, so I would propose first reading a couple of the abovementioned tutorials.
In my opinion, a nice set of starting chapters are:

Also, if you are going to build a lot of Zend apps, learn how to use the Zend command line tool and always create your projects and add controllers and actions by using it. If you are a windows user and have installed Zend Framework 1.9 or 1.9.1 (latest release at the time of writing this post), know that the Zend tool does not work properly in windows. The solution is to download version “1.8.4 patch 1″ from the zend archives and use its command line tool instead. You will save a lot of time and frustration 😉

Finally, after building a couple of projects, I realized that there is always a “startup repetition” phase where I must update the bootstrap.php, create a layout, add a dummy CSS, etc , so I  created a simple template app, which I use when I want to do fast 2 min tests. You can download it if you want from here (you must use Zend Framework 1.8 and above)!

Wikipedia

August 30, 2009 Leave a comment

According to an article that was posted yesterday in NY Times blog [1], Wikipedia is the fifth most popular web site with more than 330 million visitors per month. That’s a really impressive accomplishment for a community edited encyclopedia.

I am really happy that there are so many people who will take the time to check an article in an encyclopedia, even if we are talking about articles without editorial supervision from an expert [*]. Most of those visitors would never open an encyclopedia in the “pre-internet” era, or even spend the money to own a good encyclopedia. I believe that such practices are beneficial to a community as a whole and wikipedia is a great example of how internet is helping in the faster spread of knowledge.

But what happens when the habit of relying solely on wikipedia articles for accumulating knowledge on a specific subject becomes a norm? During the last years, I have been seeing more and more students in my university and people that I work with in the industry referencing wikipedia as their sole source for writing a report or even making decisions. So, as I was reading the NY Time’s article,  I remembered a very interesting article that I have read in Communications of the ACM, entitled “Why you can’t cite Wikipedia in my class“:

The online encyclopedia’s method of adding information risks conflating facts with popular opinion….

[*] In the case of popular articles, the editing process will reach an equilibrium after some time, resulting in an article which is sufficiently close to what most people would want to know about the subject.

Edit 1:  Some interesting articles on wikipedia statistics and a different view of the problem with the editorial process [2], [3], [4].

… The often mentioned reasons for decay are:

  1. the rules and guidelines for contributing become more and more elaborated, the learning curve worsens, occasional contributors are scared off, recruitment slows down
  2. contributions of newcomers and outsiders are often reverted by article’s “shepherds” or just random passers-by, thus contributing to despair; arguing with anonymouses on the Internet is generally a miserable activity busy people try to avoid
  3. all trivial topics are already covered; due to the “no original research” policy, significant coverage in the mainstream press is necessary to prove a contribution’s notability; thus, much less topics might be covered in the future …
Categories: web Tags:

Fetching random rows from a database table

August 24, 2009 1 comment

Let’s assume that you have the following problem:  There is a relation (table) in your database, from which you want to fetch a random row.

Why care: It could be the product table and you want to present to the users random suggestions. I could think of numerous more examples where fetching a random row is needed; maybe you want to fetch a random post from the blog_posts table, feature a random user, etc.

If you search in the web for a solution, most people (e.g. [1], [2]) suggest a variant of the following code:

SELECT column FROM table
ORDER BY RAND()
LIMIT 1

My experience dictates that this solution cannot be fast. If you have a table with just 500 tuples, whichever solution you choose (even the worst option of all: bringing everything in memory and then choosing a random tuple), would not make a big difference in your running times. But what happens when you have really large data sets? That’s when the problems begin. In fact, I am not the only one who believes that “ORDER BY RAND()” is the wrong approach to this problem ([3], [4], [5]).

I could just use the alternatives that are presented in the posts just mentioned, but I really like experimenting with that kind of problems, checking my own results and be 100% sure (you can never be sure with the numerous different implementations of DB Engines out there). Also, in my case, I want to check generic solutions that (1) don’t assume that the selection of the random tuples is based on the primary key and (2) can use attributes that may have holes in their domain (i.e. not all values from 1 to max(ID) are in the table).

I’ll use a MySQL server version 5.1 (the database most of the people not in big corporations use nowadays) and create a sufficiently large data set for my experiments. I define “sufficiently large” as a table that will allow me to check the differences between the possible queries, so a test table with 10,000,000 tuples and some dummy columns in order to increase the tuple size will be enough. Also, I’ll use a modified version of the code that was posted [here], as it is better presented than my initial tests.

This is a good example why I do not trust results that can be found on the web:  Even though the solution proposed by Peter is really fast, it does not by any means access only one tuple! Running the code given by Peter (and not my modified version), one can easily see that even though it is fast, it accesses a lot more than 1 tuple!

Let’s build a test table:

DELIMITER $$

DROP PROCEDURE IF EXISTS `testp` $$
CREATE PROCEDURE `testp`( in size INT )
BEGIN
DECLARE iterator_v INT DEFAULT 0;

DROP TABLE IF EXISTS test;
CREATE TABLE test (
id INT NOT NULL auto_increment,
a  INT NOT NULL,
b  INT NOT NULL,
c  INT NOT NULL,

PRIMARY KEY (id)
) ENGINE=MyISAM;

REPEAT
INSERT INTO test
VALUES ( 0, ROUND(1000000000*RAND(),0), ROUND(1000000000*RAND(),0), ROUND(1000000000*RAND(),0));
SET iterator_v = iterator_v + 1;
UNTIL iterator_v = size
END REPEAT;

END $$

DELIMITER ;

Table test is just a dummy table with an ID and three additional columns in order to make the size of rows more than 4 bytes. I also have an InnoDB version of the script, but I would never suggest to insert 10,000,000 tuples in an InnoDB table just for testing (just create a MyISAM table and then convert it to InnoDB or turn off indexes, logging, etc – but this is too much work for a simple test). Either way, I can assure you that the results are similar for both engines.

Attribute a is filled with random values from 0 to 1,000,000,000 and does not have a unique constraint, so I am going to use ‘a’ in my examples.

Let’s fill table test with 10 million random rows:

CALL testp(10000000);

The resulting table is 162MB in size and has a 98MB index on id, so I think that it should be sufficient for my experiments.

Fetching a single random row

Let’s see which one of the following queries is better for fetching a random row:

1. Q1 =

SELECT * FROM test ORDER BY RAND() LIMIT 1;

In my system, it needs ~7.5 seconds in order to return the results.

Why? If you execute “EXPLAIN Q1”, you’ll see that all 10,000,000 rows of the table are accessed in order to add a temporary Rand() column and then short them by this column.

2.  Q2 =

SELECT * FROM test
WHERE a >= (SELECT FLOOR( MAX(a) * RAND()) FROM test )
ORDER BY a LIMIT 1;

This is a query that I have seen somewhere, that’s why I am presenting it here.

I din’t even think for a moment to run it, as if you execute “EXPLAIN Q2”, you’ll see that 10,000,000 * 10,000,000 rows must be accessed by the DB Engine.

3. Q3a =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(a) FROM test )) AS a
) AS r2 ON r1.a >= r2.a
ORDER BY r1.a
LIMIT 1

In my system, it needs ~2.2 seconds in order to return the result. This should be much faster, but I am paying here for my decision to try and use an attribute different than the primary key! If the ID was not defined or attribute ‘a’ was the primary key, then I could send to the DB the following query:

Q3b =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(a) FROM test )) AS a
) AS r2 ON r1.a >= r2.a
LIMIT 1

In my modified DB (without ID in table test), it only needs ~0.83 seconds in order to return the result.

And what if I use the primary key (ID) in the original DB?

Q3c =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(ID) FROM test )) AS ID
) AS r2 ON r1.ID >= r2.ID
LIMIT 1

The result is always returned in less than 0.001 seconds (actually, 0.0003 seconds, but don’t trust any metric bellow 0.001 seconds)!

So why is query Q3c so faster than Q3b? The index on ID is used instead of a full table scan 😉

Lesson learned:

(a) Just use the primary key for the selection! Using another attribute can offer nothing when trying to fetch a single random row, while it slows down significantly the query…

(b) Repeat after me: Indexes are the second best thing ever implemented in DB systems. Don’t forget to use them!

4. Splitting Q3a and Q3b in two queries in a stored procedure, as also proposed in [3]:
(No reason to split Q3c, as it is so fast, that it will slow down from the function calls)

DELIMITER $$

DROP PROCEDURE IF EXISTS `test`.`fetchRandom` $$
CREATE PROCEDURE `test`.`fetchRandom` ()
BEGIN
DECLARE maxId INT DEFAULT 0;
DECLARE randID INT DEFAULT 0;

SELECT MAX(`a`)
Into maxId
FROM `test`;

SET randID = FLOOR(1 + (RAND() * (maxId – 1)));

SELECT *
FROM `test`
WHERE `a` >= randID
ORDER BY `a` ASC
LIMIT 1;

END $$

DELIMITER ;

Q4 = CALL fetchRandom();

The above call results in (more or less) the same execution time as the two variants of Q3. That’s why I believe that there is no reason to get into the trouble of building the stored procedure.

Fetching multiple random rows

So, is my approach in Q3a (using a different attribute than the primary key) that bad? No if you are interested in fetching multiple random rows with no consecutive IDs!

If you are using the ID, then extending the fastest query Q3c to “Limit 5”:

Q5a =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(ID) FROM test )) AS ID
) AS r2 ON r1.ID >= r2.ID
LIMIT 5

will result not in 5 completely random tuples, but in 5 consecutive tuples starting from a random ID. For example, a random call returned:

id a b c
3737792 417407769 478633193 140942687
3737793 268813889 921241413 799765428
3737794 235102934 776218416 175783308
3737795 550261325 223956729 468999704
3737796 673128362 958642727 773828580

In contrast, extending query Q3a:

Q5b =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(a) FROM test )) AS a
) AS r2 ON r1.a >= r2.a
ORDER BY r1.a
LIMIT 5

will result in 5 completely random tuples (with respect to their ID) with an execution time of ~2.3 seconds. The stored procedure in Q4, can also be altered, but will result in more or less the same execution time as Q5b.

Finally, even though from an academic point of view I like the approach presented in [5], I don’t really think that it is a realistic solution: The overhead of the two triggers presented would slow down update operations in a really large data set. In contrast, the solution presented in Q5b, could be implemented in any relation schema with minimal overhead:

a) Extend the table with an additional attribute RAND_Attr.

b) Assign a random value expected_cardinality_of_table*RAND() to attribute RAND_Attr in every insert (you should no worry about duplicate values, as Q5b handles them).
This step could be implemented in the corresponding stored procedure that is used by the model in order to hide the database schema (always a wise approach) or implemented through a ‘light’ trigger (more expensive, but better than the option of hard coded assignment in the application code).

c) Use query Q5b with ‘randA’ instead of ‘a’ in order to fetch multiple random rows.

So, in order to summarize, to the best of my knowledge, the best approaches to fetching random rows from a table TABLE, with primary key (and index on) ID and an attribute RAND_Attr with random values, are:

A. Fetching a single random row:

SELECT *
FROM TABLE r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(ID) FROM TABLE)) AS ID
) AS r2 ON r1.ID >= r2.ID
LIMIT 1

B. Fetching multiple random rows:

SELECT *
FROM TABLE r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(RAND_Attr) FROM TABLE )) AS RAND_Attr
) AS r2 ON r1.RAND_Attr >= r2.RAND_Attr
ORDER BY r1.RAND_Attr
LIMIT N

Or even better, as you know the maximum value MAX_RANDA_VAL that attribute RAND_Attr can have:

SELECT *
FROM TABLE r1
JOIN (
SELECT CEIL(RAND() *  MAX_RANDA_VAL ) AS RAND_Attr
) AS r2 ON r1.RAND_Attr >= r2.RAND_Attr
ORDER BY r1.RAND_Attr
LIMIT N

This second query is virtually the same as the first for two reasons:  (a) rand() is used in both the assignment of values and the query itself and (b) a good rand() function is random (sic), so after a couple thousands of tuples inserted, the real max value of RAND_Attr will be close enough to MAX_RANDA_VAL that no real difference can be noticed between the two queries.

PS. Note that in MySQL 5.1.1 there is a bug and you cannot use variables in Limit clauses. A work around for extending Q4 follows:

CREATE PROCEDURE `test`.`fetchRandom` (in _LIMIT INT)

PREPARE STMT FROM

SELECT ID
INTO … your var here …
FROM `test`
WHERE `a` >= @randID
ORDER BY `a` ASC
LIMIT ?
“;
SET @LIMIT = _LIMIT;
EXECUTE STMT USING @LIMIT;

….

Categories: Databases, MySQL Tags: ,