推荐阅读:Foursquare outage post mortem
With over 400 million check-ins in the last year, it’s safe to say that our servers log a lot of data. We use that data to do a lot of interesting analysis, from finding the most popular local bars in any city, to recommending people you might know, and even for drawing pretty pictures . However, until recently, our data was only stored in production databases and log files. Most of the time this was fine, but whenever someone non-technical wanted to do some data exploration, it required them knowing scala and being able to query against production databases.
This has become a larger problem as of late, as many of our business development managers, venue specialists, and upper management eggheads need access to the data in order to inform some important decisions. For example, which venues are fakes or duplicates (so we can delete them), what areas of the country are drawn to which kinds of venues (so we can help them promote themselves), and what are the demographics of our users in Belgium (so we can surface useful information)?
In short, without easy access to this data, we are not able to make smart decisions at any level of the company.
Thus we needed two things:
- A set of high-level scheduled reports to inform general business decisions.
- A way for anyone in the company to do data-exploration without hurting our production systems or learning about scala, sbt, ssh, and mongo.
The Solution
We decided to use Apache Hadoop , and Apache Hive in combination with a custom data server (built in Ruby ), all running in Amazon EC2 .
For those who don’t know, Hadoop is an open-source Map-Reduce framework for parallel data processing, and Hive is a secondary service that allows you to interact with Hadoop by defining ‘virtual’ tables and using familiar SQL syntax.
The data server is built using Rails , MongoDB , Redis , and Resque and communicates with Hive using the ruby Thrift client.
We all like pictures, so here is a diagram:
The idea is simple: we run our own ‘data server’ to act as a gateway to reports. This allows us to:
- Provide an easy-to-use end-point to run data exploration queries (using SQL and simple web-forms).
- Cache the results of queries (in a database) to power reports, so that the data is available to everyone, whenever it is needed.
- Allow our hadoop cluster to be totally dynamic without having to move data around (we shut it down at night and on weekends).
- Add new data in a simple way (just put it in Amazon S3!).
- Analyse data from several data sources (mongodb, postgres, log-files).
Importing Data
The last two points are very important. In fact a large portion of the data-server’s code base is dedicated to data cleaning and importing. We found it best to represent all data in tab-delimited flat files. To turn mongo/log/postgres/json data into this format, each ‘table’ has a specification written in ruby. Here is a simple example:
include Foursquare::LocationLookup
mapped_attributes :id , :venue , :shout , :lat , :long
mapped_attributes :country , :state , :timezone
end
id: ‘ 123 ’,
venue: ‘ 456 ’,
shout: ‘ayup mum!’,
ll:’ 24.5 , - 50.4 ’,
something_else: ‘boo!’
} ”
checkin = Checkin. new ( data )
So now:
=> 123 456 ayup mum! 24.5 - 50.4 us New York America / New_York
The initialize method provided by Foursquare::MappedClass can interpret several data types, in this example JSON is used. By including the LocationLookup module, country, state, and timezone can be automatically added if a lat/long field exists (using a local Mongodb database). For all such transformations, tabs, newlines, and excess white-space are removed from field values to ensure that each record occupies only a single line.
We have rake tasks to run this as either a simple script, or as part of a hadoop streaming job .
Running Queries
Because we’re storing data away from the production system, we can run queries that generate 1,000,000,000,000 records if they want to (I’m looking at you @injust ), and the system simply emails them a link to their results when the query has finished (so they don’t have to wait around). In fact we can run all sorts of cool stats.
A Fun Example
Lets say we want to find the city with the rudest citizens, judged by how often a tip left in that city contains a curse word. We could run this query:
v . city ,
v . state ,
sum ( curse ) AS curses ,
sum ( any ) AS any_tip ,
sum ( curse ) / sum ( any ) AS percentage
FROM
(
SELECT
venueid ,
IF ( text LIKE ‘%curseword_here%’ , 1 , 0 ) AS curse ,
1 AS any
FROM tips
) tips
JOIN venues v ON tips . venueid = v . id
GROUP BY v . city , v . state
SORT BY percentage DESC
After 5 minutes of waiting, we have a list of top 20 offenders (highest % of tips containing curse words):
(I’ve filtered out cities that had less than 1000 tips total.)
Its good to see that the Mancunians truly are not only the rudest people in the UK, but the rudest people globally, only El Paso comes close. Although please keep in mind that this only evaluates the rudeness of English speaking countries (like that would make a difference?).
In Summary
Amazon’s Elastic MapReduce plus a simple Ruby on Rails server can make a powerful (and cheap) data analysis tool. By reducing the barrier to data-exploration we have been able to inform better business decisions, and even create a little fun .
- Matthew Rathbone , Foursquare Engineer (and a proud British midlander )