« Proof that 1 = 0 using a common logical fallacy | Main | Follow-up to "The mathematics behind Hadoop-based systems" »
Wednesday
Mar102010

Thrift + Graphs = Strong, flexible schemas on Hadoop

There are a lot of misconceptions about what Hadoop is useful for and what kind of data you can put in it. A lot of people think that Hadoop is meant for unstructured data like log files. While Hadoop is great for log files, it's also fantastic for strongly typed, structured data.

In this post I'll discuss how you can use a tool like Thrift to store strongly typed data in Hadoop while retaining the flexibility to evolve your schema. We'll look at graph-based schemas and see why they are an ideal fit for many Hadoop-based applications.

OK, so what kind of "structured" data can you put in Hadoop?

Anything! At BackType we put data about news, conversations, and people into Hadoop as structured objects. You can easily push structured information about social graphs, financial information, or anything you want into Hadoop. 

That sounds all well and good, but why not just use JSON as the data format?

JSON doesn't give you a real schema and doesn't protect against data inconsistency. For example, if you're storing data about tweets, you can't enforce that every tweet structure contains an "id" attribute of type "long". If someone makes a mistake and fills in a tweet structure with a string for the id, you won't know until you get a type error down the road when trying to use the string as a long value in a job.

A good schema protects you against these kinds of errors, keeps your data consistent, and gives you errors at the time of creating a bad object. JSON is also going to take up a lot more space than a binary representation of your data.

The problem with tightly schematic data is that it's hard to modify the schema. What happens if I need to add another attribute to the tweet structure that isn't present in the existing tweets?

This is one of the primary issues you have to keep in mind when designing a schema, and with a tool like Thrift, there are a number of ways to address this issue.

Let's take a quick detour and talk about Thrift. Thrift is an RPC/serialization framework that originated at Facebook. For creating schemas on top of Hadoop, we're only interested in the serialization features of Thrift.

Thrift lets you define language-neutral schemas, and Thrift will generate code to create objects of that schema in any language. For example, I could create objects in Java and read them in Python, or create objects in Ruby and read them in Erlang. Here's an example of a Thrift schema:

struct Person {
1: required string name;
2: required string twitter_username;
}

struct Tweet {
1: required string text;
2: required i64 id;
3: required i64 timestamp;
4: required Person person;
}

To create a Tweet object in Python, we would do the following:

tweet = Tweet(text="hello world!", 
id=12345,
timestamp=1293827,
person=Person(name="John Smith",
twitter_username="jsmith"))
bytes = thrift.TSerialization.serialize(tweet)

To read that object in Java, we would:

TDeserializer des = new TDeserializer();
Tweet tweet = new Tweet();
des.deserialize(tweet, bytes);
System.out.println(tweet.get_text());
//hello world!

Thrift allows you to mark fields as required or optional. If a field is marked as required and doesn't exist when serializing or deserializing (or is null), an error will be thrown. If a field is optional, you can safely omit the field or set it to null.

Thrift also supports creating "union" structures, like so:

union PersonID {
1: string email;
2: i64 facebook_id;
3: i64 twitter_id;
}

A union is like a struct except only one field can be filled in. Unions are useful for representing polymorphic data like "PersonID" above, where PersonID is one of email, facebook id, or twitter id.

Back to my question: how do we use Thrift to add an attribute to an existing schema?

There's a couple ways we can do this. The first approach is to make use of the optional attributes that I mentioned. If we want to add an attribute called "locationstring" to our Tweet structure, we modify our Tweet structure as follows:

struct Tweet {
1: required string text;
2: required i64 id;
3: required i64 timestamp;
4: required Person person;
5: optional string locationstring;
}

You'll still be able to use objects serialized under the old schema, but new objects can optionally have the "locationstring" field. Additionally, whenever you use one of these Tweet structures, you need to account for the possibility of locationstring being null.

Another approach is to model your Tweet not as one object, but as many objects - one for each attribute - connected via a common identifier. To understand this fully, we're going to have to discuss graph-based schemas. Graph-based schemas can be extended very easily, and we'll see that graph-based schemas have their own distinct set of advantages.

Graphs, huh? You mean those things with nodes and edges they can't stop talking about in computer science theory classes?

Yes. Except that instead of just nodes and edges, we'll have nodes, edges, and properties. Let's consider an example to clarify this.

Let's say we're trying to model people involved in open source projects. We will have two kinds of nodes in our schema: Person and Project. Person will have properties such as "age" and "gender", and Project will have properties such as "start_date" and "language_used". We will have two kinds of edges: an edge between two people indicates a friendship, and an edge between a Person and a Project indicates that Person as being a committer to that Project.

Pictorially, our schema will look like this:

Let's model this with a Thrift schema. Let's start with the nodes:

union PersonID {
1: string email;
}

union ProjectID {
1: string apache_project_name;
2: i64 github_id;
3: i64 sourceforge_id;
}

Next, the properties for Person:

enum Gender {
MALE = 1, FEMALE = 2
}

union PersonPropertyValue {
1: string name;
2: i16 age;
3: Gender gender;
}

struct PersonProperty {
1: PersonID id;
2: PersonPropertyValue property;
}

Notice how PersonProperty attaches the property value to a node, and PersonPropertyValue is a union structure for the various kinds of properties allowed in the schema. We can similarly define properties for Project:

union ProjectPropertyValue {
1: string name;
2: string programming_language;
}

struct ProjectProperty {
1: ProjectID id;
2: ProjectPropertyValue property;
}

Next, lets write the schema for friendship edges and committer edges:

struct FriendshipEdge {
1: required PersonID person1;
2: required PersonID person2;
}

struct CommitterEdge {
1: required PersonID person;
2: required ProjectID project;
}

Awesome. Let's add some finishing touches. You're not going to want to have to manage each type of data separately, so let's unify the data model into one structure called "DataUnit":

union DataUnit {
1: PersonProperty person_property;
2: ProjectProperty project_property;
3: FriendshipEdge friendship;
4: CommitterEdge committer;
}

As you can see, this schema is very flexible. To add a new kind of property, you just add a new field into one of the property structures. New kinds of nodes and edges can be added just as easily. The schema is also very tight - PersonProperty's can only be associated with People and ProjectProperty's can only be associated with Projects.

There are two major conceptual advantages to working with graph-based schemas. The first is the ability to add partial data about an entity in your system - for example, if I learn that a user is interested in "hadoop", I can easily add that one piece of data to the system independently of how other data is modeled.

Second, by splitting up data into a lot of small containers, it's easy to specify what data you want to work with. You'll find that you end up having to "check for null" a lot when working with objects that contain multiple properties. For example, let's say you want to get the average age of all the people in your system. With a graph-based schema, you just take the average of all the age DataUnits. With a more traditional schema where records contain all the attributes you know about someone (age, gender, etc.), not all the records will contain "age". In this case, you need to manually filter out the people who don't have ages from the computation.

Not all data is as simple as "age" or "gender". How do you model something more complex, like someone's address? Sometimes you may have the full street address, and sometimes you may only have a city.

Indeed, for this case I would recommend a structure like the following:

struct Address {
1: optional string street_component;
2: optional i32 zipcode;
3: optional string city_component;
4: optional string state_component;
5: optional string country_component;
}

union PersonPropertyValue {
...
3: Address address;
}

Ultimately all pieces of an address are very tightly related to each other, so they belong in the same structure. Since you won't always have all the components, it makes sense to make the fields optional.

It still seems like the schema isn't that tight. There's nothing preventing a Person from having multiple ages or a Project from having multiple names! This schema isn't nearly as strong as what I can create in a MySQL database!

That's absolutely true. With this schema structure, we can't do things like "has one" constraints. Here's the thing though: it doesn't matter. The data processing model is radically different from MySQL - so while you still need strong schema surrounding types, you can be more flexible when it comes to relationships between types.

Here's why. Hadoop lends itself towards storing all your raw data and performing computations on the complete dataset to produce consumables for the application layer. So if you have multiple gender DataUnits for someone, but only want to show one gender at the application layer, you can resolve this during a "full recompute" when choosing the data to ship to the application.

For example, you may have multiple gender DataUnits for someone on your social networking site because the person has changed the gender in their profile. When selecting a gender for that person for the application layer, you would select the most recent gender DataUnit.

OK, I'm starting to buy into the "idea" of graph-based schemas. It seems somewhat inefficient though - if I want to do a computation using the ages and genders of people I need to do a join because they're stored in separate objects - whereas if those were in the same object I could just do a map-only job.

This is a very good point. By emphasizing flexibility we have made certain kinds of jobs more expensive - in particular, simple queries which need to look at a set of attributes for a single node.

On the flip side though, by splitting up the data into many properties and different kinds of edges, we've gained the ability to do a lot of jobs a lot more efficiently. This is because you can partition your data in HDFS into different subdirectories, and jobs only have to read the attributes they need to read. So, if FriendshipEdge's dominate your dataset, you don't need to read those DataUnit's for a job which only needs age and gender. Here's an example of partitioning the dataset - in this example, /data is the root directory for all of the DataUnits:

/data/CommitterEdge/
/data/FriendshipEdge/
/data/PersonProperty/age/
/data/PersonProperty/gender/
/data/ProjectProperty/programming_language/
/data/ProjectProperty/date_started/

So, all the age DataUnits will be in the "PersonProperty/age/" subdir, and all the FriendshipEdge's will be in the "FriendshipEdge/" subdir. The ability to be very selective of what data you read because of our graph-based schema is a big win.

Not to say the super-flexible approach is always the best one - there may be some cases where performance is of absolute necessity. You still have the ability to make "super-properties" that contain multiple attributes.

What are situations in which you wouldn't store data in HDFS in a graph-based format?

The graph-based format is most appropriate for your "rawest" datastore - the datastore containing data that has been collected but not processed. You may be building other datastores from the raw datastore for particular applications, and in these cases it may be more appropriate to bundle lots of data into a single object.

Conclusion

Phew! That was a lot to cover. There's been a lot of talk about the virtues of schema-less databases and the evils of tight schemas lately, and I've tried to show in this article that you can have a strong schema while still retaining flexibility (on Hadoop, anyway). Ultimately the ability to do big computations on your dataset really changes the way you think about structuring your data, and I hope to have imparted some of those insights in this article.

You should follow me on Twitter here.

Reader Comments (5)

We use Avro on Hadoop, and are incorporating it into all related subprojects. Worth checking out.

March 10, 2010 | Unregistered CommenterJeff Hammerbacher

I've only looked into Avro a little bit, and while storing the schema with the data is interesting, it wasn't compelling enough for me to switch. I'd be curious to hear more about the advantages of Avro over Thrift.

March 10, 2010 | Unregistered Commenternathanmarz

I'm still trying to understand which data models this type of solution would be good for. How well would something like this work for a dimensionally modeled data warehouse, where ad-hoc queries and analysis is necessary on large amounts of data?

March 10, 2010 | Unregistered CommenterBrad

If you're looking for random access to the data, I recommend using a distributed datastore like Cassandra or HBase on top of your Hadoop data. I set things up so that my random access stores are "views" on top of my Hadoop dataset that can be regenerated from scratch.

Analyzing datasets on Hadoop is a pleasure - especially if you use a tool like Cascading. Being Hadoop-based means that you can scale to huge amounts of data. You're not going to have low latency on your queries, but for ad-hoc analysis that's generally not important. I've never had a query that I couldn't do on my dataset, and the only ones I can think of that are infeasible are things like "get the size of everyone's 4th degree friend graph", where the output is exponentially bigger than the input. Having a graph-based schema speeds things up as well since you can pick and choose what data to incorporate into your analysis.

March 10, 2010 | Unregistered Commenternathanmarz

I'm looking for a great quotes that i can use on my everyday task and duties. Lucky enough to drop by on your blog. Thanks. iznwkb iznwkb - 2012 moncler coats.

December 15, 2011 | Unregistered Commenterurgqjv urgqjv

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>