21 Jul 2010
MySpace is facing significant competition from rival social networking sites Facebook and Twitter.
Computing talked to MySpace's chief data architect Don Watters about the drive to position the company's data for the future.
Computing: Can MySpace survive the competition from Facebook?
Don Watters: Everybody knows we got wiped out by Facebook – we're basically done and we'll probably die off in about two weeks. They've been saying that for two years – but we're still here and we're still a very large site.
What's the total amount of data MySpace has to deal with?
We have a petabyte [one million gigabytes] of data stored on our hard drives.
We’re a Microsoft SQL Server database shop in general. We use it for our online transaction (OLTP) environment, running on clustered commodity HP servers.
For our data warehouse and advanced analytics, we use Aster Data, running on clustered commodity Dell servers.
Overall we have 310 clustered servers, with 226 of these in the main data warehouse.
The largest cluster is a 120-node cluster, for processing the 190TB [terabyte – 1,000 gigabytes] of main data. There are 90 nodes in the staging area dealing with 130TB, and 16 nodes used to handle about 10TB in our reporting environment.
How much data goes into the MySpace data warehouse daily?
We see about 2-3TB going into the data warehouse environment per day, but as soon as data is brought in, some is kicked out.
We keep a 35-day moving window of data allowing us to perform data analytics processing.
Who did you choose for the analytics platform?
We started the initial discussion with Aster Data in 2007, and we used Dell as our hardware partner. In April 2008, we rolled out our first prototype, which took eight weeks.
In November 2008 we launched into full production using Aster’s nCluster data analytics system built on a Linux platform built on top of a Postgres database.
Aster’s data analytics server is a massively parallel database with an integrated analytics engine, and we launched with many more components in place than originally planned.
What does MySpace use Aster's analytics platform for?
Aster nCluster allows us to traverse social graphs embedded in our data, showing what users are doing in relation to their friends and also their friends' friend.
For example, think about a social graph made up of friends. Recursively traversing those graphs embedded in our data is not something most relational database management systems can do.
Aster gives us a faster turnaround, which is a really big deal for us, because we need information about all the entry and exit points on our sites, which we can then combine with data on where users are actually coming from.
We write our own SQL-MapReduce functions, and our people point them at the MySpace data cluster to analyse problems.
[SQL-MapReduce is a framework for writing queries enabling parallel computation of those queries across hundreds of servers which work together as a single relational database]
How important will the analytics platform be to the phased re-launch of MySpace scheduled to take place soon?
Most of the changes we're contemplating are front-facing ones concerned with user navigation and interaction.
The data warehouse has already provided incredible value, in terms of making decisions about what needs to change, and how the changes may affect our users.
In addition to that, the warehouse will provide functionality for the website that will be new and different for our users, but I can't say more about that just yet.
With smartphone usage rocketing, how does MySpace optimise mobile device access to its web site?
We have made considerable efforts related to mobile access of MySpace. We have apps for Apple and Android products, and each device gets its own optimisation.
What type of network infrastructure does MySpace use to service the database and compute engines?
The network linkages between the racks in the couple of datacentres we have are 10Gb Ethernet [not Fibre Channel].
Communications between the rack-mounted nodes is 2Gbit/s in and out, and 4Gbit/s across the system itself [called the backplane].
The more difficult part for us was determining the network infrastructure we'd require. [We use] commodity network hardware. [This means we can] scale up the network at the same pace that our compute cluster scales.
With data warehousing vendors such as Teradata and Neteeza, all the networking is built intrinsically into the platform, and I think we pay a lot less for our networking [because we buy commodity hardware].
Have your say on this article
Newsletters
Latest stories from Storage
Latest videos
You may also like
Storage jobs
Will Facebook be able to continue its success as a public company?
Rubbish in... rubbish enterprise. Why proper data management is so important (video, 6 min)
This Forrester report compares the costs and benefits of legacy email and productivity software with Google Apps
Upcoming Events
The implementation of robust, relevant digital strategies is more crucial than ever to the success of insurance businesses
Date: 01 Mar 2012
Time: 09:00am
A showcase of the latest in the information content and management
Date: 20 Mar 2012
Time: 09:00am
Receive the latest jobs direct to your inbox
Are you being paid what you are worth?