Posts Tagged ‘Scaling’

Reasons to use a no-sql database like mongoDB

Saturday, December 26th, 2009

In this article I’ll talk about reasons to use mongoDB and other no-sql databases.

Relational databases are often used in many web apps. But usually when the time comes to scale the app to a few millions users, you have to make choices on your architecture.

For instance, a common practice to handle high load is to put one or more of the most used tables in separate servers. In order to use this technique, a developer will have to resolve the issue of join queries. How do you query tables that aren’t in the same location and yet have high performance?

One solution is to denormalize data, thus duplicating content in all tables that may have a query that needs to know about it. Think of 2 tables: contacts and users. The tables are separated into 2 different servers. And you have a feature in your app, in which you show all contacts from a specific user. You need a join query, but you can’t use it. So you duplicate some of the fields from the contacts table, right inside your users table. For instance, you create a field called contact_names, in which you put only the names, of all contacts from that user, separated by commas. It’s a easy way to solve the problem, but it comes with a cost. You have to worry about syncing the contacts in all tables that know something about contacts.

Bottom line? You started developing your app with join queries, but at some point you had to give up on it.

So, if using a traditional database forces you to stop using some of its features somewhere down the road, why not start with a kind of database that avoids the things that are not scalable and sustainable in the long run?

In mongoDB a solution for this problem would be creating a document Contacts, and embed it inside the document Users. So, each user will have its contacts right there, inside each one of the User records. No need to use join queries.

However, there are times when you need to have a model that is connected to several others.

For example, let’s say you need to relate Contacts to several models such as Clients, Suppliers and Employees. So you create 4 collections: Clients, Suppliers, Employees and Contacts. You connect them all together via a db reference. This acts like a foreign key. But, this is not the mongoDB way to do things. Performance will penalized.

So the general question should always be “Why can’t I embed this document?“. Or even better: “Does this object merit its own collection, or rather should it embed in objects in other collections?“.

There are some general rules on when to embed, and when to reference (grabbed from mongodb website):

  • “First class” objects, that are at top level, typically have their own collection;
  • Line item detail objects typically are embedded;
  • Objects which follow an object modelling “contains” relationship should generally be embedded;
  • Many to many relationships are generally by reference;
  • Collections with only a few objects may safely exist as separate collections, as the whole collection is quickly cached in application server memory;
  • Embedded objects are harder to reference than “top level” objects in collections, as you cannot have a DBRef to an embedded object (at least not yet);
  • It is more difficult to get a system-level view for embedded objects. For example, it would be easier to query the top 100 scores across all students if Scores were not embedded;
  • If the amount of data to embed is huge (many megabytes), you may reach the limit on size of a single object;
  • If performance is an issue, embed;

The way I see it, you can still have more or less the best of both worlds: the flexibility of documents and the performance of embedded documents. And you still have a way to emulate foreign keys, like a relational database – but not without a penalty on performance. I don’t know how mongoDB and MySQL compare to each other in the long run, for the usual web app. It’d be cool if someone did some benchmarks on this subject.

Read more about mongoDB database schema design.

Hidden Costs of Scaling Up vs. Scaling Out

Thursday, June 25th, 2009

The well-known dilemma: scale vertically (buy hardware) or scale horizontally (add machines)?

Here’s an interesting point of view from an article at Coding Horror entitled

It’s fair to conclude that scaling out is only frictionless when you use open source software. Otherwise, you’re in a bit of a conundrum: scaling up means paying less for licenses and a lot more for hardware, while scaling out means paying less for the hardware, and a whole lot more for licenses.

All in all I still prefer scaling out, so you don’t have a single point of failure.

Flickr engineers do it offline

Monday, June 15th, 2009

parking_lot

Here’s a very interesting post on how Flickr engineers handle the scaling challenges of this well-known photo website: using queues to run slow things offline.

Here’s an excerpt:

“It seems that using queuing systems in web apps is the new hottness . While the basic idea itself certainly isn’t new, its application to modern, large, scalable sites seems to be. At the very least, it’s something that deserves talking about — so here’s how Flickr does it, to the tune of 11 million tasks a day.”

Check the full article out.

Boo Box web servers layout and application scaling tips

Saturday, May 30th, 2009

infra-boo-box2

Boo Box, the ad network, released a layout of their web servers’ infrastructure. It seems the beast is growing fast and this diagram shows how they’re coping with the challenge.

Here are some things I found very interesting:

  • Separate servers for reading and writing (MySQL). This way you can optimize servers for a specific purpose (read or write), since the reads and writes aren’t competing with each other anymore in the disk or memory;
  • Serve static files from a different domain to speed things up is well-known, but serving them right from the RAM is new to me. However some people disagree with caching files in memory, beyond what the OS already does in this field. The other good thing is that Nginx is a super fast web server, and it’s replacing Apache in many scenarios;
  • The use of a queue server for handling time-consuming tasks is paramount for horizontal scaling. Everything that takes more than a few miliseconds (or do some sort of processing) should be ran in an asynchronous way.

It’s very nice of them to share this layout. Thanks guys!

Scaling Rails

Friday, February 6th, 2009

Gregg Pollack and New Relic just launched a Screencast series on Scaling Rails. And, it’s free! :)

“Scaling Rails Screencast Series” produced by Gregg Pollack and supported by New Relic.

“Learn everything you need to know about Scaling your Rails app through 13 informative Screencasts produced by Gregg Pollack with the support of New Relic. In the next few weeks we’re going to bring you 13 educational videos, teaching you just about everything you need to know to create a Rails application that can scale.”

Topics covered:

Scaling microblogging services such as Twitter

Friday, December 19th, 2008

There’s a joke in the Ruby and Rails world that says ‘Rails doesn’t scale’, which it’s just a joke to really say: what is scallable is the way you create your app, and that does not depend whether is written in Rails, Perl or even Bash.

So it’s important to know what matters in terms of building a scallable web app. As you know:

Scalability: ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged.

And how do you accomplish that in Ruby/Rails/whatever app you’re writing?

Well, one way to gain further knowledge on the subject is by reading this article. The author explains what is envolved when you need to scale a microblogging service such as Twitter. The author created Nouncer (a developer platform for building microblogs and similar services) and ended up learning quite a lot about the inner-workings and challenges of such an application.

Please do read it, it’s a series of 3 articles:

My thoughts? Scalling is not a solved subject, and it depends on the expertise of the developer/architect on finding ways to identify the fine grained causes for bottlenecks and to deal with them in a rather creative way, cycling between endless monitoring and improvements down the road.

Fun, isn’t? :)