(web design) Clients from hell

Lisa just sent me a blog with a very funny collection of anonymously contributed client horror stories from designers:

http://clientsfromhell.tumblr.com

The really funny part about those stories is that anyone who has worked with non-technical clients, has heard more or less the same kind of silly comments:

  • Client: We like the design, but could you make the blues all the same.
    Me:
    It’s the same blue throughout the design.
    Client:
    It looks like different blues.
    Me:
    That’s because colors are perceived differently dependent on neighboring colors.
    Client:
    That’s stupid.
  • Me: A basic web page will cost around $1000.
    Client:
    Oh my, that is more than what we want to pay. My nephew is in Vo-Tech and I can get him to do it for $100.
  • Can you make each page on our website fade in and fade out when you click on a link? You know…like PowerPoint?

And if you are working as a software engineer/architect, the stories get even funnier: I once worked as a freelancer for a large Organization. Even though they had all the money in the world, they refused to install Oracle, so we had to develop their OLAP solutions over {a free DBMS with no support for anything that resembles a data warehouse}. The arranged payment was more than satisfying, so I had no problem implementing everything from scratch. A few months later, our system was able to perform roll ups and drill downs, and produce complicated aggregation reports using numerous dimensions.. The problem? A month after deployment, they complained because the (super heavy) yearly report (by week, product type, department and some other parameters) was taking a couple of minutes to generate! How can you explain to them that typical OLAP operations take more than 10 secs in order to finish even in the best tuned DBMS????

If our daily encounters were similar to what we are facing in the “software development business realm”, the result would be hilarious:

The “NoSQL” dispute: A performance argument

December 10, 2009 1 comment

NoSQL is a database movement that began in early to mid 2009 and which promotes non-relational data stores that do not need a fixed schema, and that usually avoid join operations. From [1]:

Next Generation Databases mostly address some of the points: being non-relational, distributed, open-source and horizontal scalable. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, replication support, easy API, eventually consistency, and more. So the misleading term “nosql” (some call it “not only sql”) should be seen as an alias to something like the definition above …

I just read a very insightful article from Michael Stonebraker [2], one of the pioneers of modern database theory and Relational Database Management Systems (RDBMS) and “father” of systems like Ingres and Postgres [3, 4]: ‘The “NoSQL” Discussion has Nothing to Do With SQL‘ [5]. In this article, published in the blog of the Communications of the ACM, Dr. Stonebraker responds to some of the arguments from the supporters of the NoSQL movement. There are two possible reasons to move away from structured Relational Database Management Systems and adopt an alternate DBMS technology: performance and flexibility:

The performance argument goes something like the following. I started with MySQL for my data storage needs and over time found performance to be inadequate. My options were: … 2. Abandon MySQL and pay big licensing fees for an enterprise SQL DBMS or move to something other than a SQL DBMS.

The flexibility argument goes something like the following. My data does not conform to a rigid relational schema. Hence, I can’t be bound by the structure of a RDBMS and need something more flexible. ….

Focusing on the performance argument, he explains what is more or less a common knowledge in the database community: The major performance burden of modern RDBMS  comes from all those extra features like transaction possessing (especially insuring the ACID properties), logging, etc and not from the core engine for executing an SQL query:

… However, the net-net is that the single-node performance of a NoSQL, disk-based, non-ACID, multithreaded system is limited to be a modest factor faster than a well-designed stored-procedure SQL OLTP engine. In essence, ACID transactions are jettisoned for a modest performance boost, and this performance boost has nothing to do with SQL.

In summary, blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context. …

I believe that anyone interested in the inner workings of modern RDBMS should read this short post from Dr. Stonebraker and continue with a very interesting paper from Harizopoulos, et.al “OLTP through the looking glass, and what we found there” [6]. I am getting tired of discussing with people outside the database community who after writing a couple of SQL queries think that they know all about modern RDBMS and insist that the MySQL’s MYISAM engine is the best engine out there just because it is fast(er) or question my motives when I consult them that a “Lite SQL” (no reference to any real product)  solution is just not enough..

PS. Dr. Stonebraker is not just an exceptional database researcher, but a visionary whose ideas have shaped the modern Database landscape. A short abstract from an article in SIGMOD Record when he received the IEEE John von Neumann Medal [7]:

… The relational data model and its associated benefits of “data independence” and “non-procedural access” were first invented by Tedd Codd. However, more than any other individual, Mike is responsible for making Codd’s vision of independence a reality through the architectures and algorithms embodied in the series of open-source prototypes and commercial systems that he has initiated and led. While many others have certainly made important contributions to the field, no one else comes close to his continuous sequence of landmark innovations over a period of almost 30 years.

… Mike has been the primary driving force behind major shifts in the research agenda of the database community, including two occasions where he launched entire new directions for the field to follow (the implementation of relational systems and object-relational systems).

Categories: Databases Tags: ,

Preacquired Account Marketing

December 3, 2009 Leave a comment

Tens of millions of consumers have fallen prey to “free” trial offers and membership clubs offered by preacquired account marketers. These companies insert themselves into your everyday transactions, hoping to trick you into letting them charge your account. In simple words, Preacquired Account Marketing can be defined as [1]:

… The practice at issue is called post-transaction marketing, which involves the presentation of offers during the online checkout process. When done to deceive, these offers typically appear to be a required part of the checkout process, in order to trick consumers into accepting charges for unwanted products or services.

A particularly pernicious form of post-transaction marketing is known as “preacquired account marketing,” a process by which the third-party marketer acquires a customer’s credit card information from the online merchant where the customer is making a purchase.

Using this tactic, the third-party marketer only needs an e-mail address or a click as purchase authorization. The retailer, in effect, sells the customers’ credit card information because the retailer, as a partner, will get a cut of whatever extra charge the customer can be duped or pushed into accepting. …

As Michael Arrington explains in TechCrunch [2]:

… Background: hundreds of well known ecommerce companies add post transaction marketing offers to consumers immediately after something is purchased on the site. Consumers are usually offered cash back if they just hit a confirmation button. But when they do, their credit card information is automatically passed through to a marketing company that signs them up for a credit card subscription to a package of useless services. The “rebate” is rarely paid. …

What shocked me was the report from the USA Senate hearing some weeks ago, which focused on the controversial marketing companies that allegedly dupe consumers into paying monthly fees to join online loyalty programs [3]:

… Vertrue, Webloyalty, and Affinion generated more than $1.4 billion by “misleading” Web shoppers, said members of the U.S. Senate Committee on Commerce, Science and Transportation, which called the hearing. Lawmakers saved their harshest rebuke for Web retailers that accepted big money–a combined sum of $792 million–to share their customers’ credit-card information with the marketers. …

… Many of those who complained say they don’t fear the ad because they aren’t being asked to turn over credit-card information, according to the Senate report. But buried in the ad’s fine print is notification that by entering their e-mail address, the shopper is agreeing to join a loyalty program and allowing the store to authorize marketers to charge their card each month, between $9 and $12. …

Fraud, even of this magnitude, is a commonplace in the internet. That kind of scams rely on the fact that most consumers usually don’t notice 10-20$ charges in their monthly credit card statements. So, why bother? If you check the list of companies that are working with Affinion, Vertrue, and Webloyalty, you’ll find some highly reputable web sites that I would never think that they could be part of such a scam: Expedia, Hotels.com, US Airways, Classmates.com, MovieTickets.com, etc…

… Retailers doing business with the companies are also aware that customers are likely to be angered once they notice the charges but do it because they are paid big bucks. Classmates.com has pocketed $70 million from partnering with all three companies, according to the report. The government says that 19 retailers have made more than $10 million through the partnerships with e-loyalty programs, while 88 retailers have made more than $1 million ….

I have used Expedia and Hotels.com in the past and, even if they are not guilty, I’ll think twice before I’ll use them again. A huge reputation hit for such businesses (in the end, they are not just a fast growing gaming company [4], so they should be more careful), but a good lesson to learn for web 2.0 startups.. Unfortunately, electronic transactions are still defined and dealt through a ‘lighter’ moral viewpoint from many companies. I am sure that, for example, a traditional hotel booking company would risk less its reputation through that kind of deals. I believe that the transfer of credit card data in post-transaction offers must be restricted by law, and improved disclosure requirements and easier charge reversal must be regulated.

Categories: Business, web Tags: ,

Ranking of paid and non-paid advertisments

November 26, 2009 Leave a comment

Ranking paid advertisements

The solution to the problem of ranking paid advertisements is more or less trivial nowadays (after all the work from Google et.al. in this field). I am not saying that it is easy, but you can make a good start just by using a fairly simple modeling scheme. Let’s study for example a simplified approach to AdWords algorithm:

Ad Rank = {MAX CPC bid OR CPM bid OR just content bid} × Quality Score

OK, so if you pay more per click or per thousand impressions (M is the roman numeral for “thousand” – CPM is a metric that we have inherited from “old fashion” media like TV or radio) and you have a good QS (Quality Score), then your add will be ranked higher than other adds and (eventually) it will be presented to the users. As you can imagine, the tricky part of this algorithm is on the calculation of the QS and, like search algorithms, you can make it better the more users you have giving you indirect (and some times direct) feedback [1 – check end of post]. Of course, Google gives a rough sketch of how this QS works [2]:

… Quality Score helps ensure that only the most relevant ads appear to users on Google and the Google Network. …

Quality Score for Google and the Search Network:

  • The historical clickthrough rate (CTR) of the keyword and the matched ad on Google; note that CTR on the Google Network only ever impacts Quality Score on the Google Network — not on Google
  • Your account history, which is measured by the CTR of all the ads and keywords in your account
  • The historical CTR of the display URLs in the ad group
  • The quality of your landing page
  • The relevance of the keyword to the ads in its ad group
  • The relevance of the keyword and the matched ad to the search query
  • Your account’s performance in the geographical region where the ad will be shown
  • Other relevance factors

Quality Score for the Content Network

  • The ad’s past performance on this and similar sites
  • The relevance of the ads and keywords in the ad group to the site
  • The quality of your landing page
  • Other relevance factors

So, the more users click your adds (historical CTR, overall CTR, etc), and the better the quality of your landing page, the better you QS and more users will follow… Recursion [3] in its finest form 🙂 . Again, the tricky part here is the “quality of your landing page” and it is the way Google monetizes its powerful ranking algorithm and, of course, its position as the highest traffic site on the web [again, check 1].

Also, for completeness, from a post on Google Blog [4]:

… You might also wonder: “But image ads usually take up multiple ad slots — does this limit the amount of money that I can make?” Good question, but no — for an image ad to appear on your site, it has to produce an effective CPM greater than the sum of the individual text ads. …

Ranking NON-paid advertisements

Ok, that was easy, but what about non-paid advertisements? First of all, Ι must define what I am talking about: Why anybody would want to support/offer non-paid ads??? The core definition of advertisement has {payment, paid} in it! Non-paid advertisements is for a {enter a commodity/type of business/job here} listing or a local classifieds web site (like for example Yelp) what organic search results are for a search engine like Google, Yahoo or Bing. I know that you can not call Yelp non-paid results advertisement, but that’s what they are! And most of the time they are more valuable than any form of advertisement and generate a higher clickthrough rate. I bet that if, for example, my restaurant ranked 1st in Fast food category for San Francisco (I have no affiliation with “In-N-Out Burger” 😉 ),  I would never need to advertise it anywhere else in the web…

Definition of the problem: Assume a website like Yelp: People add their location and {business/job/offered commodity} (call it whatever you like – I’ll stick with the non-paid advertisement term) and users of the site can search for XXX “commodity” in YYY location. The most important factor for the success of such a website is that businesses can not buy their ranking. They may be able to buy {sponsorships/adds/referred posts} but they must not be able to buy a good ranking in the search results.

How such a site can rank results? Think a little bit about the problem: When the user has decided what he is searching for (let’s say fast-food restaurants) and the location(s) he is interested into (“near Union Square, San Francisco”), a successful website must decide how to order those 100s of results that satisfy this simple categorical query. There are no keywords in order to find the best matches (using well known information retrieval algorithms) and no CPC/CPM as all are free “posts”/ads. Only QS (Quality Score) is left from the AdWords algorithm.. So, what are the options?

Results can be ranked according to their reviews, which generate some kind of a rating. This is the best solution for businesses/restaurants/etc: in the end, a restaurant must be good if thousands of  people say that it is good. And this is the approach that most big players, like Yelp, choose to take. Note that in the base case scenario (i.e. if the implementation is reasonably efficient), it is not that much different from the sophisticated ranking algorithm of Slashdot for ranking news.

But not all listings/classifieds can be ranked like restaurants, bars or news stories.. Either because most people do not have the expertise to rank the posts (what do you think about your doctor?) or because not so many people have an opinion about each post (how many apartments or doctors do you check each month?). How should a service presenting {doctors / lawyers / etc} or apartments in San Francisco rank the results?

The easy solution is by distance: You give your exact location and results are returned by how far the doctor’s office is. It is a nice visual representation and someone could think that it must be the clear winner, especially if the implementation uses an acceptable map representation (most sites I have seen don’t – they just use a super small google map). I must say that I don’t like this solution at all, as I prefer to travel 2 more kms in order to find a doctor that will not accidentally kill me (I live in a “not so reputable” area)… So, we can keep the map visual representation as an interesting component of a hybrid solution, but the problem of defining some kind of a Quality Score is still present…

Another solution that old-fashioned media and most web listings/classifieds (like “the one to rule them all” – craigslist) are offering is reverse chronological order. “Order By Date Desc” is the perfect solution for most classifieds (jobs, selling a kitty, searching for a brunette now, etc) but how about doctors, plumbers or apartments?  Is an apartment worse than others if it was post 2 weeks ago? I believe that this one can be used only for limit time offers and marketplaces.

The solution? I don’t have a clear solution right now, but in our new project saini.gr (still in alpha – nothing to see there + it is in greek), which can be summarized as a Yelp for tutors, language teachers, etc, we are experimenting with some new forms of ranking algorithms that will not depend solely on human ratings/reviews or on spatial parameters. I hope that we’ll soon end up with an interesting -non-trivial- algorithm, which will be worth sharing 🙂

Bellow, I quote some interesting ideas from teachstreet [5], one of the leading sites in finding tutors, classes, etc:

  • Claim your profile — teachers who are registered with TeachStreet automatically rank higher!
  • Add more content to your Teacher profile and class pages
  • Include photos, location, time, and skill level for your classes — the more details you provide, the better the search filters work for you.
  • Tell us why you are an expert by adding an extra sentence on your expert list.
  • Write articles (including photos or videos) to show up on our Home Page or on the articles page for your subject.
  • Be active on the site — accept enrollments, update listings, respond to students quickly, and just be an all-around good TeachStreet neighbor — remember that it’s in our best interest to promote our most active teachers, because they’ll provide a great experience for our student visitors.

here are also actions that a Teacher can do on a regular basis to give their listings little, short term boosts in results. These actions include:

  • Accept enrollments
  • Replying to messages
  • Creating and sending invoices
  • Adding/Updating sessions
  • Requesting reviews
  • Writing articles
  • Adding photos and other media

[1] That’s why it is so difficult for other companies to overthrow Google in this not so fair race.

[2] What is ‘Quality Score’ and how is it calculated?

[3] As we are talking about Google, have you ever tried to search for “Recursion” on Google? A suggestion is presented in the top of the results: “Did you mean: Recursion “. Press it and you can continue requesting for the same results and getting the same suggestion over and over again.. I love geeky engineer humor!

[4] Ad Rank explained!

[5] TeachStreet: http://www.teachstreet.com/
All About TeachStreet Search blog post
SEO & TeachStreet Search Rank

What is that?

November 26, 2009 Leave a comment

A very sweet short film from Constantin Pilavios:

http://www.cpil.info/2008/12/14/what-is-that-the-movie/

Categories: Life Tags:

Mobile Web 2.0: Opportunities and problems

November 8, 2009 Leave a comment

Today’s smart phones are promoted by their manufacturers as lifestyle tools to enable sharing experiences and social networking via Web 2.0 sites and mobile friendly media portals. So, what Mobile Web 2.0 is? Just as Web 2.0, it can be defined as the network as a platform, spanning all connected devices. Effectively, the definition for Mobile Web 2.0 boils down to three key verbs : ‘share’, ‘collaborate’ and ‘exploit’.

The emerging Mobile Web 2.0 opportunities can be summarized as:

  • (Context aware) Advertising. In web environments, web sites have available only your IP in order to identify you, but in mobile web your location, unique ID or profile (don’t forget: most users own just one phone) are also always available.
  • Anywhere, anytime accessibility that will allow real-time social-networking and sharing of user-generated content (information, videos or photos).
  • Voice over IP and instant messaging. Why call your friends or send them a really expensive SMS (if you think about $$ per byte) when you can make a Skype conference call or send them a message through MSN, GTalk or any other platform available?
  • Off-portal services.
  • Location based services and context aware (mobile) search.

Numerous start-ups have entered the field, but even established social-networking companies are getting involved. “People want to take Facebook with them,” said Michael Sharon, a product manager with the company. “So we work with [device makers] to create applications for their phones.” As George Lawton writes in his very interesting article for the IEEE’s Computing Now Exclusive Content [1]:

eMarketer estimates that in the US alone, the number of mobile Internet users will rise from 59.5 million in 2008 to 134.3 million in 2013.

Juniper Research predicts revenue generated globally by Mobile Web 2.0 will grow from $5.5 billion in 2008 to $22.4 billion in 2013.

Social networking is the fastest growing type of mobile Web application, with US usage increasing 187 percent between July 2008 and July 2009, according to Elkin.

But there are also numerous challenges:

  • Bandwidth and device limitations
  • Platform compatibility
  • Viable business models

I believe that Mobile Web 2.0’s future will be location-based services, supported by context aware advertising.

What is missing? A (working) mobile platform a la Google AdWords/AdSense that will exploit not only keywords and user profiles, but also the rich available context aware information about the users.  I am talking about context aware data management, presentation of information and (of course) advertising. What we need is a framework for modeling that kind of volatile information and the thousands of different views/viewpoints that we can use in order to present it. As this is a topic that needs some preparatory discussion, I’ll return with a following post in order to describe the state of the art in context aware computing and my thoughts on context aware advertising.

Categories: web Tags:

The Status of the P Versus NP Problem

October 17, 2009 Leave a comment

I just read a very insightful article on the status of the P versus NP problem, which was published in the Communications of the ACM:

http://cacm.acm.org/magazines/2009/9/38904-the-status-of-the-p-versus-np-problem/fulltext

The author not only explains the problem in detail and the implications of solving it in modern computer science, but also presents some of the research attempts to solve it during the past decades and a compact survey of what is happening right now in this field.

New Technorati algorithm..

October 15, 2009 Leave a comment

If you are searching for new blogs to read, then Technorati is one of the best sites to start. Its top 100 blogs page reflects more or less what is happening in the web and the calculated rank is, in my opinion, quite accurate.

Moreover, during its October revamp, the site updated the algorithm for their main metric: Technorati Authority.

“…

  • Authority is calculated based on a site’s linking behavior, categorization and other associated data over a short, finite period of time. A site’s authority may rapidly rise and fall depending on what the blogosphere is discussing at the moment, and how often a site produces content being referenced by other sites.
  • The new Authority calculation differs from the past version, which measured linking behavior over a longer 6 month timeframe. Please note that links in blogrolls don’t count towards Authority, as they are not indicative of interest in relevant content; we stopped including blogroll links in August 2008.

…”

As Michael Arrington writes in his techcrunch post:

“… Until today, …, the top list was fairly static. Now they are focusing much more on recent data within the last month and giving blogs an authority rank between 1 – 1,000. Scoring factors include posting frequency, context, linking behavior and “other inputs.” The result, says the company, is a lot more volatility in the lists as blogs surge up and down. …”

I think that this is one more (small) step on the direction of results that reflect the real time and volatile nature of web.

Categories: Tech, web Tags: , ,

Randomness in game design

October 14, 2009 Leave a comment

An interesting presentation on randomness in game design: “Randomness: Blight or Bane?”

http://playthisthing.com/randomness-blight-or-bane

Categories: Tech Tags: ,

Tech Podcasts

September 18, 2009 Leave a comment

According to the New Oxford American Dictionary:

a podcast is “a digital recording of a radio broadcast or similar program, made available on the Internet for downloading to a personal audio player,” but the generally accepted definition has expanded to include video as well as audio. Originally derived from a combination of “broadcasting” and “iPod ™,” the word was declared “word of the year” when it was added to the dictionary at the end of 2005.

Podcasts have evolved during the recent years from amateur recordings using cheap microphones to programs that can be easily compared to the best radio or TV shows.  They cover a wide range of genres and topics and, in my opinion, are a great (even though a little bit geeky) way to spend your time during long commutes. Most people will never understand this line of thought, but why lose 2-3 hours per day doing nothing when you can listen to this week’s tech news or a presentation about a new framework?

I have been listening to podcasts for a couple of years now and I am currently subscribing to about 10-15 (mostly tech oriented) podcasts, so I think that I have a pretty good idea on which are the best tech podcasts out there. The following list in not by any means complete, it just contains (in random order) the podcasts that I follow every week and that I think are worth mentioning:

Categories: Tech Tags: ,

Don’t Copy That Floppy

September 12, 2009 Leave a comment

SIIA (The Software & Information Industry Association) released its new anti-piracy campaign [1], called “Don’t Copy That 2”.

You can find the video in youtube:
http://www.youtube.com/watch?v=hUCyvw4w_yk

I was laughing out loud for 10 minutes after I saw the clip. I am sure that whoever is responsible for this campaign is either someone locked in a cage or he doesn’t have kids/thinks everybody is an idiot. Even the best comedian would never think about such a lousy add (in order to make fun of SIAA). As Nick Summers writes [2]: “Rap, Klingons, and Jailhouse-Rape-by-Broomstick Aren’t the Best Way To Teach Kids About Piracy … ”

In their next campaign, they should hire the guys from “IT Crowd” or even use their anti-piracy spot [3],[4] (it is a must-see). I would say that it is the funniest computer related add (with tones of money invested in it – not just a parody), but unfortunately it’s prequel from 1992: “Don’t Copy That Floppy” would easily take the “Worst Advertisement” Emmy Award.

Categories: Tech History Tags: ,

Zend Framework Tutorials

September 2, 2009 Leave a comment

Zend framework is a PHP framework that allows PHP developers to design their projects using the Model-View-Controller (MVC) architectural pattern. Together with the Object Oriented features of PHP 5+, it allows PHP to stop being a “Personal Home Page” scripting tool and become a language in which you can really design big projects. I am not going to explain further why Object Oriented Programming, tiered/modular architectures and separation of Model, View and Controller are critical for project designing, development and maintenance, as thorough analysis can be found in any textbook. (I also believe that those concepts should be obvious to any senior software architect/engineer that plans to design and/or implement a medium/large project).

Noteworthy Introductory Tutorials:

  • Zend framework’s quick start tutorial: This is a great starting tutorial in order to understand the fundamental concepts.
  • Akra’s tutorial: Somewhat more advanced than the quick start tutorial. It is a nice introduction to some more concepts, even though Arka’s approach does not always follow the MVC pattern.
  • Pádraic Brady’s tutorial and on-line book. I must note that his writing style is quite enjoyable and very easy to follow (for the geeks among us – exaggerating a little bit – he has a “Tanenbaum” style of writing):
    • Example Zend Framework Blog Application Tutorial: Even though it is written for Zend framework version 1.5 (which has many differences with respect to version 1.9.1), it is quite useful as it introduces some new techniques, it presents in depth the basic concepts (a good supplement to the first two tutorials) and it adds authentication and authorization examples, modules, plugins and lots of other features. Just remember that he is using an older version of the framework, so I advise you to use Zend_Application and config files in order to setup your environment (paths, views, controllers, layouts)  instead of the methodology presented in part 3 and some of the part 4 (you can use my simple template app if you want a different approach).
      Edit: For some reason, I could not find Parts 9 and 10 of the tutorial in Brady’s blog. You can find a version of part 9 [here].
    • Zend framework – Surviving the deep end: This is a work in progress that can serve as a thorough introduction to the framework. It is partially based in the blog application tutorial, but covers more aspects as the author has more space than a blog post in order to discuss about design decisions, etc. I believe that there are many more advanced features to be added, but even now one could find some interesting chapters like the ones about performance optimisation and unit tests.
  • A small tutorial on how to modularize the guest book application that was presented in the quick start tutorial: [1] [2]. I shouldn’t add it together with the general overall tutorials that I mention in this post (and that’s why I haven’t linked to many other, very helpful, specific tutorials), but I think that adding modules to an application is (1) a very important concept and, even though it is trivial, (2) it is not adequately explained in the on-line documentation of Zend Framework.

Video Tutorials (VidCasts):

  • I liked the video tutorials in the Zend framework’s website. It’s a great 20-30 min way to understand the basic concepts of Zend framework before diving into the tutorials.
  • Even better, Zendcasts has more than 40 in depth video tutorials. Just skip the first 5-6 tutorials, which are covered by the official video tutorials. Also the first vidcats in Zendcasts site use the older 1.7 version (a different approach is used for the bootstrap, etc) and would be a wrong starting point for someone working with the newer versions of the framework.

Other than the few available tutorials, the best resource for all the available features is the extended Programmer’s Reference Guide, which can be found in Zend framework’s web site, and Zend’s Dev Zone. In order not to get lost trying to figure out where to begin (as the chapters are alphabetically ordered according to the name of the different components) , you must have some basic understanding about the framework and how it works, so I would propose first reading a couple of the abovementioned tutorials.
In my opinion, a nice set of starting chapters are:

Also, if you are going to build a lot of Zend apps, learn how to use the Zend command line tool and always create your projects and add controllers and actions by using it. If you are a windows user and have installed Zend Framework 1.9 or 1.9.1 (latest release at the time of writing this post), know that the Zend tool does not work properly in windows. The solution is to download version “1.8.4 patch 1″ from the zend archives and use its command line tool instead. You will save a lot of time and frustration 😉

Finally, after building a couple of projects, I realized that there is always a “startup repetition” phase where I must update the bootstrap.php, create a layout, add a dummy CSS, etc , so I  created a simple template app, which I use when I want to do fast 2 min tests. You can download it if you want from here (you must use Zend Framework 1.8 and above)!

Wikipedia

According to an article that was posted yesterday in NY Times blog [1], Wikipedia is the fifth most popular web site with more than 330 million visitors per month. That’s a really impressive accomplishment for a community edited encyclopedia.

I am really happy that there are so many people who will take the time to check an article in an encyclopedia, even if we are talking about articles without editorial supervision from an expert [*]. Most of those visitors would never open an encyclopedia in the “pre-internet” era, or even spend the money to own a good encyclopedia. I believe that such practices are beneficial to a community as a whole and wikipedia is a great example of how internet is helping in the faster spread of knowledge.

But what happens when the habit of relying solely on wikipedia articles for accumulating knowledge on a specific subject becomes a norm? During the last years, I have been seeing more and more students in my university and people that I work with in the industry referencing wikipedia as their sole source for writing a report or even making decisions. So, as I was reading the NY Time’s article,  I remembered a very interesting article that I have read in Communications of the ACM, entitled “Why you can’t cite Wikipedia in my class“:

The online encyclopedia’s method of adding information risks conflating facts with popular opinion….

[*] In the case of popular articles, the editing process will reach an equilibrium after some time, resulting in an article which is sufficiently close to what most people would want to know about the subject.

Edit 1:  Some interesting articles on wikipedia statistics and a different view of the problem with the editorial process [2], [3], [4].

… The often mentioned reasons for decay are:

  1. the rules and guidelines for contributing become more and more elaborated, the learning curve worsens, occasional contributors are scared off, recruitment slows down
  2. contributions of newcomers and outsiders are often reverted by article’s “shepherds” or just random passers-by, thus contributing to despair; arguing with anonymouses on the Internet is generally a miserable activity busy people try to avoid
  3. all trivial topics are already covered; due to the “no original research” policy, significant coverage in the mainstream press is necessary to prove a contribution’s notability; thus, much less topics might be covered in the future …
Categories: web Tags:

Fetching random rows from a database table

August 24, 2009 1 comment

Let’s assume that you have the following problem:  There is a relation (table) in your database, from which you want to fetch a random row.

Why care: It could be the product table and you want to present to the users random suggestions. I could think of numerous more examples where fetching a random row is needed; maybe you want to fetch a random post from the blog_posts table, feature a random user, etc.

If you search in the web for a solution, most people (e.g. [1], [2]) suggest a variant of the following code:

SELECT column FROM table
ORDER BY RAND()
LIMIT 1

My experience dictates that this solution cannot be fast. If you have a table with just 500 tuples, whichever solution you choose (even the worst option of all: bringing everything in memory and then choosing a random tuple), would not make a big difference in your running times. But what happens when you have really large data sets? That’s when the problems begin. In fact, I am not the only one who believes that “ORDER BY RAND()” is the wrong approach to this problem ([3], [4], [5]).

I could just use the alternatives that are presented in the posts just mentioned, but I really like experimenting with that kind of problems, checking my own results and be 100% sure (you can never be sure with the numerous different implementations of DB Engines out there). Also, in my case, I want to check generic solutions that (1) don’t assume that the selection of the random tuples is based on the primary key and (2) can use attributes that may have holes in their domain (i.e. not all values from 1 to max(ID) are in the table).

I’ll use a MySQL server version 5.1 (the database most of the people not in big corporations use nowadays) and create a sufficiently large data set for my experiments. I define “sufficiently large” as a table that will allow me to check the differences between the possible queries, so a test table with 10,000,000 tuples and some dummy columns in order to increase the tuple size will be enough. Also, I’ll use a modified version of the code that was posted [here], as it is better presented than my initial tests.

This is a good example why I do not trust results that can be found on the web:  Even though the solution proposed by Peter is really fast, it does not by any means access only one tuple! Running the code given by Peter (and not my modified version), one can easily see that even though it is fast, it accesses a lot more than 1 tuple!

Let’s build a test table:

DELIMITER $$

DROP PROCEDURE IF EXISTS `testp` $$
CREATE PROCEDURE `testp`( in size INT )
BEGIN
DECLARE iterator_v INT DEFAULT 0;

DROP TABLE IF EXISTS test;
CREATE TABLE test (
id INT NOT NULL auto_increment,
a  INT NOT NULL,
b  INT NOT NULL,
c  INT NOT NULL,

PRIMARY KEY (id)
) ENGINE=MyISAM;

REPEAT
INSERT INTO test
VALUES ( 0, ROUND(1000000000*RAND(),0), ROUND(1000000000*RAND(),0), ROUND(1000000000*RAND(),0));
SET iterator_v = iterator_v + 1;
UNTIL iterator_v = size
END REPEAT;

END $$

DELIMITER ;

Table test is just a dummy table with an ID and three additional columns in order to make the size of rows more than 4 bytes. I also have an InnoDB version of the script, but I would never suggest to insert 10,000,000 tuples in an InnoDB table just for testing (just create a MyISAM table and then convert it to InnoDB or turn off indexes, logging, etc – but this is too much work for a simple test). Either way, I can assure you that the results are similar for both engines.

Attribute a is filled with random values from 0 to 1,000,000,000 and does not have a unique constraint, so I am going to use ‘a’ in my examples.

Let’s fill table test with 10 million random rows:

CALL testp(10000000);

The resulting table is 162MB in size and has a 98MB index on id, so I think that it should be sufficient for my experiments.

Fetching a single random row

Let’s see which one of the following queries is better for fetching a random row:

1. Q1 =

SELECT * FROM test ORDER BY RAND() LIMIT 1;

In my system, it needs ~7.5 seconds in order to return the results.

Why? If you execute “EXPLAIN Q1”, you’ll see that all 10,000,000 rows of the table are accessed in order to add a temporary Rand() column and then short them by this column.

2.  Q2 =

SELECT * FROM test
WHERE a >= (SELECT FLOOR( MAX(a) * RAND()) FROM test )
ORDER BY a LIMIT 1;

This is a query that I have seen somewhere, that’s why I am presenting it here.

I din’t even think for a moment to run it, as if you execute “EXPLAIN Q2”, you’ll see that 10,000,000 * 10,000,000 rows must be accessed by the DB Engine.

3. Q3a =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(a) FROM test )) AS a
) AS r2 ON r1.a >= r2.a
ORDER BY r1.a
LIMIT 1

In my system, it needs ~2.2 seconds in order to return the result. This should be much faster, but I am paying here for my decision to try and use an attribute different than the primary key! If the ID was not defined or attribute ‘a’ was the primary key, then I could send to the DB the following query:

Q3b =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(a) FROM test )) AS a
) AS r2 ON r1.a >= r2.a
LIMIT 1

In my modified DB (without ID in table test), it only needs ~0.83 seconds in order to return the result.

And what if I use the primary key (ID) in the original DB?

Q3c =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(ID) FROM test )) AS ID
) AS r2 ON r1.ID >= r2.ID
LIMIT 1

The result is always returned in less than 0.001 seconds (actually, 0.0003 seconds, but don’t trust any metric bellow 0.001 seconds)!

So why is query Q3c so faster than Q3b? The index on ID is used instead of a full table scan 😉

Lesson learned:

(a) Just use the primary key for the selection! Using another attribute can offer nothing when trying to fetch a single random row, while it slows down significantly the query…

(b) Repeat after me: Indexes are the second best thing ever implemented in DB systems. Don’t forget to use them!

4. Splitting Q3a and Q3b in two queries in a stored procedure, as also proposed in [3]:
(No reason to split Q3c, as it is so fast, that it will slow down from the function calls)

DELIMITER $$

DROP PROCEDURE IF EXISTS `test`.`fetchRandom` $$
CREATE PROCEDURE `test`.`fetchRandom` ()
BEGIN
DECLARE maxId INT DEFAULT 0;
DECLARE randID INT DEFAULT 0;

SELECT MAX(`a`)
Into maxId
FROM `test`;

SET randID = FLOOR(1 + (RAND() * (maxId – 1)));

SELECT *
FROM `test`
WHERE `a` >= randID
ORDER BY `a` ASC
LIMIT 1;

END $$

DELIMITER ;

Q4 = CALL fetchRandom();

The above call results in (more or less) the same execution time as the two variants of Q3. That’s why I believe that there is no reason to get into the trouble of building the stored procedure.

Fetching multiple random rows

So, is my approach in Q3a (using a different attribute than the primary key) that bad? No if you are interested in fetching multiple random rows with no consecutive IDs!

If you are using the ID, then extending the fastest query Q3c to “Limit 5”:

Q5a =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(ID) FROM test )) AS ID
) AS r2 ON r1.ID >= r2.ID
LIMIT 5

will result not in 5 completely random tuples, but in 5 consecutive tuples starting from a random ID. For example, a random call returned:

id a b c
3737792 417407769 478633193 140942687
3737793 268813889 921241413 799765428
3737794 235102934 776218416 175783308
3737795 550261325 223956729 468999704
3737796 673128362 958642727 773828580

In contrast, extending query Q3a:

Q5b =

SELECT *
FROM test r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(a) FROM test )) AS a
) AS r2 ON r1.a >= r2.a
ORDER BY r1.a
LIMIT 5

will result in 5 completely random tuples (with respect to their ID) with an execution time of ~2.3 seconds. The stored procedure in Q4, can also be altered, but will result in more or less the same execution time as Q5b.

Finally, even though from an academic point of view I like the approach presented in [5], I don’t really think that it is a realistic solution: The overhead of the two triggers presented would slow down update operations in a really large data set. In contrast, the solution presented in Q5b, could be implemented in any relation schema with minimal overhead:

a) Extend the table with an additional attribute RAND_Attr.

b) Assign a random value expected_cardinality_of_table*RAND() to attribute RAND_Attr in every insert (you should no worry about duplicate values, as Q5b handles them).
This step could be implemented in the corresponding stored procedure that is used by the model in order to hide the database schema (always a wise approach) or implemented through a ‘light’ trigger (more expensive, but better than the option of hard coded assignment in the application code).

c) Use query Q5b with ‘randA’ instead of ‘a’ in order to fetch multiple random rows.

So, in order to summarize, to the best of my knowledge, the best approaches to fetching random rows from a table TABLE, with primary key (and index on) ID and an attribute RAND_Attr with random values, are:

A. Fetching a single random row:

SELECT *
FROM TABLE r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(ID) FROM TABLE)) AS ID
) AS r2 ON r1.ID >= r2.ID
LIMIT 1

B. Fetching multiple random rows:

SELECT *
FROM TABLE r1
JOIN (
SELECT CEIL(RAND() * ( SELECT MAX(RAND_Attr) FROM TABLE )) AS RAND_Attr
) AS r2 ON r1.RAND_Attr >= r2.RAND_Attr
ORDER BY r1.RAND_Attr
LIMIT N

Or even better, as you know the maximum value MAX_RANDA_VAL that attribute RAND_Attr can have:

SELECT *
FROM TABLE r1
JOIN (
SELECT CEIL(RAND() *  MAX_RANDA_VAL ) AS RAND_Attr
) AS r2 ON r1.RAND_Attr >= r2.RAND_Attr
ORDER BY r1.RAND_Attr
LIMIT N

This second query is virtually the same as the first for two reasons:  (a) rand() is used in both the assignment of values and the query itself and (b) a good rand() function is random (sic), so after a couple thousands of tuples inserted, the real max value of RAND_Attr will be close enough to MAX_RANDA_VAL that no real difference can be noticed between the two queries.

PS. Note that in MySQL 5.1.1 there is a bug and you cannot use variables in Limit clauses. A work around for extending Q4 follows:

CREATE PROCEDURE `test`.`fetchRandom` (in _LIMIT INT)

PREPARE STMT FROM

SELECT ID
INTO … your var here …
FROM `test`
WHERE `a` >= @randID
ORDER BY `a` ASC
LIMIT ?
“;
SET @LIMIT = _LIMIT;
EXECUTE STMT USING @LIMIT;

….

Categories: Databases, MySQL Tags: ,

On the meaning of Lorem ipsum

August 20, 2009 1 comment

Where have I seen this phrase before? I am sure that I remember it from somewhere…

Of course, now I remember! It is part of all those Latin “gibberish” that can be found in any web design template! You can even search Google for “lorem ipsum” and find all those (thousands of) sites that got uploaded in hurry and the creators forgot to replace everything in the pattern 🙂

But what does it mean? Why is it so widely used by every designer? Is it some secret inside joke?

I must admit that I never had the time to thoroughly think about those questions. Until I accidentally found today a web site dedicated to telling the story of “lorem ipsum” (and making some money while doing so, but who am I to judge?):

“… Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book … Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC …

Quite an interesting set of facts and a nice story to tell in (geek) meals…

Categories: Web Design Tags: ,