RDF – The Graph, the Truth (value) and the (sql) Lite


At the heart of Web2.0, and indeed the new digital Age, is data. Lots and lots of it.

Much of this data is data about OTHER data – metadata.

Turning data into information involves being able to make connections and inferences between disparate BITS of data, recognizing and analyzing emergent patterns.

Keep all of this in mind on the ride ahead…

Many years ago, the good folks at Mozilla (which, in the days before it became the organization which could build anything, was known as Netscape) had to make some choices about the data that lives inside a web browser. Roughly speaking, these choices are:

* How should we store it?
* What structure should we model it after?
* What interface (or metaphor) should it present to the (platform) rest of the world?

The structure they chose, was a directed graph. And the interface, RDF – a simple reflection of that graph. Finally, they chose to store it in a variety of ways – as XML files, in the general case, as HTML files in the case of bookmarks, as a hash map in memory, and, occasionally, in Mork.

The reasons for these decisions are, largely, lost in time. (Mork, especially, seems to be entirely lost as well – although I have the source code, I have no idea how it works or what it looks like internally. Fortunately, I don’t have to.) And many of today’s coders second-guess these choices, or reject them without review. But recently we have been faced with the same questions at Flock – what structure, what interface, and what storage mechanism? To talk about how we’ve answered that, let’s talk about the data again.

Flock is a browser about people. And when you talk about people and metadata in the same conversation, things can get very busy very very fast. “Jonnie wrote a new blog post.” “Sally commented on Jonnie’s blog post.” “Billy took a picture of Sally, and posted it on Flickr.” “Billy, Sally and Jonnie are now friends.”

One pattern immediately jumps out – lots of data. Lots of unique events. But very few unique ITEMS. A small number of people (hundreds, maybe thousands), doing a small number of things (friending, commenting, blogging, photoing), in a small number of ways. All of it, related to itself. Graph, anyone?

So mozilla’s structure seems like a good fit. Next we come to the interface. Ah, RDF. That great, evil, monstrosity. Kill it, bury it. Put it out of it’s misery. “It’s complicated,” they say. “It’s legacy, archaeic,” they whine. “What is it good for?”

And that’s a funny question. Because, really, if you do a quick google search for RDF -mozilla, you’ll find that the only major thing it’s been used for is… FOAF. Friends of a friend. People data. Huh.

Simple factors to recommend RDF:

1. Directed graph is a good fit for the data. RDF is a good semantic model of a directed graph.
2. RDF code is FAST.

Now, immediately folks will start taking issue with this. “Fast, you say?” “Look, it takes a full 3 SECONDS to write to disk!” “Look at how it hangs the browser when we’re trying to add things to the RDF!”

Let’s back up a minute and talk about storage. Because here, I believe, is where Mozilla made an understandable mistake. You see, with Mork holding browser history, and HTML holding the bookmarks, the only thing left in RDF was the localstore, a simple collection of miscellaneous UI-related facts that were serialized only on shutdown. If they were lost in the event of a crash, no big deal. So the XML serialization code was slow, who cares, right?

Not so.

Skipping ahead a bit, let’s look at what Flock has added to the equation, and why I think it matters.

Step 1: RDF is a bitch to use.

It’s true – too many interfaces, too many services, ASSERTING and UNASSERTING is a totally new grammar for most folks. Enter Ian McKellan, stage left, the author of Coop.js. (Don’t confuse this with Mozilla’s The Coop, which came out a year later and is a totally different beast).

So what is it? Think ActiveRecord, for the directed graph. It’s a javascript ORM (object-relational-model) that makes it easy to read and write from an RDF datasource, with a surprisingly small overhead. (Ian will be surprised by that last part, but we’ve done a few things to coop since he last saw it.)

Great, now I can read and write to the graph just like getting and setting properties of a javascript object. But what about the SPEED?

Step 2: Get rid of the XML.

As part of his coop efforts, Ian had prototyped a SQL-backed RDF implementation, hoping to use SQL statements directly to work around some of the more expensive computations against a traditional graph (such as SUMS and COUNTS). We (Bruno, actually) ported that to C++ for speed, finished it off, and glued an In-Memory HashMap to the side of it as a cache. Voila – now every change is written out immediately, there’s no periodic 3-4 second freeze while the entire graph is serialized, and we’ve got the framework upon which to hang further performance improvements via direct SQL query.

But there’s still something missing.

Step 3: Split it up.

Take another look at the data we’re dealing with, here. People data, hmmm. Much of this data, like browser sessions, is transient. It really shouldn’t get written to disk at all. Supposing we had a separate datasource, purely in memory, into which we could stuff all this TRANSIENT data – where it could magically go away when the browser shuts down? Supposing that we could somehow COMBINE these datasources so that, to the UI layer, they would appear as a single, COMPOSITE datasource?

Those of you who know RDF, know that I’m playing a bit tongue-in-cheek at the moment, since one of the beautiful parts of RDF is its ability to COMPOSITE various datasources together (although there were a half-dozen bugs in the mozilla implementation of this that we had to iron out first).

Step 4: Watch it carefully.

A slight digression here for the particularly geeky – the new-and-improved observer. (Yet another Ian invention, executed this time by Mr. Yosh). One of the most important parts of data-driven UI is that it should respond dynamically to changes in the data, and while the RDF Templates system does this quite well, there are cases where it’s not the right tool for the job. (Such as when you only want to show the first ‘n’ items of a list, or when you need pagination. A mammoth oversight requiring an equally herculean effort to resolve. Template code is not for the faint of heart.) The traditional approach to this was a simple nsIRDFObserver – with a catastrophic side effect. Calling into arbitrary javascript for EVERY CHANGE in the RDF can become prohibitively expensive almost immediately, and in almost every case – you only care about a small subset of the changes that are occuring. The ArcObserver allows you to specify which patterns your code is interested in, and receive notifications only of RDF events matching that pattern.

Step 5: Profit?

There are still bugs, of course. (Check out the flock bugzilla site, and do a search for RDF). And vast opportunities for optimization. But here’s the state of the union:

1. We have a rich, well-matched data model.
2. We have a convenient set of tools for changing (coop) and observing (ArcObserver) that model.
3. We have rapid (Templates) and sophisticated (coop + E4X) ways of driving UI from that model.
4. We have a storage mechanism that is flexible (through compositing, we can use ANY datastorage mechanism we’d like), and FAST (the current sqlite RDF datasource will accept 1000 asserts per second, and we expect to double that number with planned improvements).

So in the end, maybe Mozilla was right the first time. I’d like to think that Flock is proving out the promise of RDF – in a world of data and metadata, where much of it is homogenous or intimately related, there is yet another truth – some of it never is. Some service, or some person, will want to store a fact about themselves or their relationships that no other service or person employs.

How would you cram that into a relational table structure?
How would you try and derive value from it?
Would it be any better than a graph?

Next up – best practices for Templates and Bindings, or, how to make the UI extensible as part of your open API.

Blogged with Flock

, , , ,

  1. No comments yet.
(will not be published)


Close
E-mail It