Wednesday, November 21, 2007

If only it were that simple...

Both Kim and Paul picked up this post by Francis Shanahan about the fragmentation of our online information. The center piece of his post is his diagram representing our information spheres... here it is:


I like the diagram in as much as it STARTS to show the problem we face. I dislike it because it implies a structure and solution that WAY over simplifies the problem.

Consider these 2 questions and you'll see what I mean...

Think about the next level out beyond the blue boxes... the attributes. You'll notice that there is massive duplication of information all around circle. This diagram totally fails to represent the interconnectedness of the data. The DataWeb is NOT a set of nicely ordered hierarchies and diagrams that lock us into that way of thinking, I think, do us a disservice.

While this diagram neatly implies that the blue boxes can be canonically categorized it is simply not true. My guess is if we gave each of you the job of categorizing the blue boxes you would come up with not only different groupings but different semantics for those groupings. Don't get me wrong; Francis's diagram is as valid a projection of order onto the mess as any. My complaint is that, we, the people trying to solve these problems must not get lulled into only seeing one dimension of this problem.

I think about the problem like this... Each of you do your version of Francis's diagram but include all the lines between the blue boxes that have data duplication. Also don't limit yourself to only putting each blue box as a child to only one green box... embrace the fact that World of Warcraft is a community, a social network and a gaming site. Once I have gathered all of you diagrams I make them all semi-transparent and put them on top of each other. That diagram is a fair representation of the DataWeb.

An interesting thing to notice is that the lines that go around the outer rim of the diagram are not, on the whole, subjective. We can build a 'rule' that says if two data points have the same value and the same update rules then they should be linked. In other words the lines at the third level should juxtapose fairly well from one persons diagram to the next. Notice that the linking rule is based on values not on labels as the semantic issues in looking at the labels adds another level of complexity.

So.... to the point.... What Francis described is exactly what I have been talking about for the last 3 years. If you take Farncis's diagram, with my radial additions, and put it into a linear form instead of radial, you get exactly the graphs that I have been drawing for years. Three levels, lines going up and down the levels and lines going across the levels. This is no coincidence; there is a fundamental 'truth' about that representation that is much like, in my mind, Kim's laws. This truth is not me, or anyone else, saying 'you must do it this way' it is us trying to point out "this is the nature of the DataWeb". What we have done, and I would be happy to spend time with any of you showing you this in detail, is specified a syntax for describing, precisely, the relationships in that web. Relational databases NEEDED ERD in order to get wide adoption, if people couldn't simply communicate, capture and represent the data models they were working with how could they ever build large complex systems. We need not just an abstract data model but a clear way to graphically represent that model.

Tuesday, November 13, 2007

Thursday, November 01, 2007

XDI Update

What a year... I just looked back and saw that the last time posted something that was really about XDI, on this XDI blog, was in March… That’s crazy!! Now in my defense I have posted quite a bit on XRI and XRDS and these are necessary building blocks to the realization of the XDI DataWeb.

So, here’s some of the news and my current thinking…

First and foremost… we have cut the 1.0 Version of our DataWeb Server!! This is the server that we have deployed as part of the Kintera Project (that you can read about in earlier posts). This feat is doubly amazing because of the magnitude of the problem that we are trying to solve and the fact that this year Steve Churchill has been working solo on this project. Steve has performed a Herculean task in building, deploying, supporting and documenting this project… He is a one man team of 20. THANKS STEVE!!

NEXT…

We implemented a plugin framework in our DataWeb server that lets anyone build plugins to access legacy data stores. It works great BUT it is something we made up. We are looking at replacing our plugin framework with Higgins IDAS (Identity Attribute Service). IDAS provides a ‘standards based’ interface definition for ‘Context Providers’… plugins to access legacy systems.

I have started thinking about the qualities that are 'lacking' in IDAS in order for it to be able to replace our plugin framework.... not that it does’t do, what it does, well... just that there are other things that it 'could' be made to do... that it doesn't now.

With the assumption that IDAS implementations sit 'close' to the underlying systems, on the same LAN, caching should not be needed, at least for the classic network latency optimization considerations. Caching could be used for fault tolerance and system failure scenarios but that's a whole other issue. Caching can reduce IO but the problems of keeping that cache in synch far outweigh that consideration if we solve the other problems that I talk about here. In theory, the data is 'right there' so duplicating it SHOULD not be necessary.

What we do need is the ability to 'find' stuff.... Find all of the Digital Subjects whose home city is 'Oakland'. What you DON'T want to have to do is:

1) Traverse all Contexts to see which expose 'Home City' attributes about their subjects

2) Traverse all Subjects in the identified Contexts to query and test the Home City attributes

While caching would improve this problem it is far from a good solution... we don't want to be doing mass traversals, ever, at query time.

What we want to do is pre-determine which attributes are going to be 'search criteria' .... yes.... you MIGHT want to search on any criteria, in which case you have to take the hit of searching without an index... they haven't even solved this in RDBMS world... you can build SQL queries that take days to run and then add a couple of indexes and run them in minutes. Once you have determined the uses cases… add the indexes. (compound keys and simple ‘set’ math across multiple indexes can give you significant flexibility and power)

Executing a search against an index results in a list of pointers to Subjects that meet the search criteria. It should NOT result in a list of pointers to the attributes themselves… remember you are unlikely to query… get me all of the home cities for all of the people whose home city is Oakland… You probably want to query something like; get me the email addresses and names of everyone whose home city is Oakland. (We do NEED to support ‘complex’ matching logic… startsWith, endsWith, greaterThan, beforeDate, etc…)

The next problem is optimizing the data access… The ‘easy’ way to process the results is to iterate over the list dereferencing the pointers… our experience has shown that this royally pisses off the DBAs…. What I mean is, if the ‘Context’ is an RDBMS then the iteration approach results in executing “SELECT email, fullname FROM people WHERE userID = ‘XXXX’” as many times as there are results in the set. This is slow and, as I said, not popular with the DBAs. You need to be able to package your query into “SELECT email, fullname FROM people WHERE userID in ‘XXX,YYY,ZZZ,ABC’” and then parse the results back in your ‘client code’. I put ‘client code’ in quotes because I don’t mean that this is done by the application coder but it should be done at the IDAS implementation. As an application developer I want to be able to say to “IDAS…. Get me all of the emails for people that live in Oakland and get back a list of emails” and never have to care that half of the emails were in Oracle and half were in PeopleSoft. BUT, I want to know that only 2 calls were made across the network (I have had to PROVE this to our customers in order for them to accept our DataWeb Server, they really care about this).

That’s my first pass at ‘what’ we need to do…. Next we have to work out ‘how’ within the existing IDAS spec :-)

Ohhh… and robust distributed transactional management. I will add others as I think of them.