Saturday, March 31, 2007

The quality of data is not strained

“The quality of data is not strained; It droppeth as the gentle rain from heaven Upon the place beneath. It is twice blessed- It blesseth him that gives, and him that takes.”– A bastardization of William Shakespeare.

I am often faced with the question; “Why don’t we just do this with our Web Services?” Generally when I’m asked that question it’s in relation to what I call Dataweb technologies. When I’m asked it in other contexts it makes even less sense.

There are many answers to this question and different ones tend to resonate with different people. One of the main qualities of the Dataweb that I strive for is the richness of interaction that one gets when accessing data, through ANSII SQL that is in a well designed schema. This is a quality that you only grock if you have spent time writing database reports or very data intensive apps. Those of us that have been there know that extracting information from a well written schema is a joy. In fact, given a little imagination and a reporting tool you can learn stuff from a well built data set that you didn’t know you knew. This phenomenon fueled a whole industry starting back in the mid 80’s when ODBC first hit our radar. We still build big data warehouses that we troll and derive new information and stats from, but only inside closed systems.

Back in the early 80’s all the data was locked up on the mainframes and we started writing PC apps that needed to access that data. Each time we wrote an app, we wrote a data driver to access the data we needed off the mainframe. There was very little reusability and no appreciation of the ‘value of data’. Then, along came ODBC, the first widely adopted manifestation of ANSII SQL, and everything changed. Now, you built an ODBC driver that could access your mainframe and went to town, you never had to write another custom driver again. This was the inflection point where we discovered that using a fluid, abstract, data access mechanism let us learn new things from the data we had already collected. The difference between those custom data drivers and the ODBC data access paradigm was that the drivers tightly bound the purpose of the access of the data to the mechanism for accessing it, while ODBC (SQL) provided an abstract mechanism that didn’t care what the data was or how it was going to be used. These qualities were inherent in the way we thought about those custom data drivers; when we designed and built them we built interface definitions getUser(), getInvoice(), etc… We used method invocation to access the data we needed. SQL provided us a way to query any schema in any way and ‘try new stuff’ without having to re-program our data access layer.

Given my example of getUser() and getInvoice(), what happened if I wanted to find out if there was any correlation between geographic region and annual total purchases… I was basically stuck waiting for the mainframe guys. With SQL in place I could slice and dice my 2 table schema (users and invoices) any way I wanted. I could look for patterns and play to my hearts content… but it wasn’t really play, it was the birth of business intelligence. Now that I could work out the profile of my best customers, I could target other people with that profile to become my new customers. How’s that as an unexpected outcome from a higher level of data access abstraction?

The way that we conventionally use Web Services today is not just akin to those old data drivers, it is the same thing. We know this, it’s inherent in the names of the protocols that we use, XML RPC, Remote Procedure Calls; method invocation. getUser() and getInvoice() would be very reasonable methods to see defined in a WSDL.

Now sometimes you need the quality of RPC, you don’t want people to be trolling through you data and deriving all sorts of stuff, you want to keep them on a tight leash, so use conventional Web Services. I call this integration pattern ‘application integration’, not data integration.

The protocols that support the Dataweb, XRI, Higgins, XDI, SAML, OpenID, WS-*, etc… provide mechanisms to access a distributed network of data with the same richness as if you were accessing a single data source via SQL, but with more control. Imagine doing a database join between two tables, now imagine doing a join between two heterogeneous, distributed systems… wouldn’t it be cool.

The qualities of an abstract data layer are; a well defined query language that can be used to access a well defined abstract data model that in turn returns a persistence-schema agnostic data representation. These qualities are shared by SQL, XDI and Higgins.

When contemplating a data abstraction for a distributed data network there are some other things that we have to add to the mix; trust frameworks, finer grain security, social and legal agreements, network optimization, fault tolerance, to name but a few… And that is what I spend a lot of my time thinking about.

So I hope that this describes somewhat why Dataweb technology is different from conventional Web Services implementations, although they run on the exact same infrastructure.

It is interesting to note, and I may be way off line here so if you know better please correct me, from what I’ve seen: SalesForce agrees with me. What I mean by that is that their new generation Web Services are some of the most abstract interfaces you are likely to see in a system that derives so much of its value from its programmatic interfaces. (Along with Kintera who we are working with). The only downside with the SalesForce approach is that it’s proprietary, which is a shame, when there are open standards that, appear, on the face of it, to satisfy their requirements. (SalesForce, I’d love to hear from you if you want to talk about this.)

No comments: