Thursday, November 01, 2007

XDI Update

What a year... I just looked back and saw that the last time posted something that was really about XDI, on this XDI blog, was in March… That’s crazy!! Now in my defense I have posted quite a bit on XRI and XRDS and these are necessary building blocks to the realization of the XDI DataWeb.

So, here’s some of the news and my current thinking…

First and foremost… we have cut the 1.0 Version of our DataWeb Server!! This is the server that we have deployed as part of the Kintera Project (that you can read about in earlier posts). This feat is doubly amazing because of the magnitude of the problem that we are trying to solve and the fact that this year Steve Churchill has been working solo on this project. Steve has performed a Herculean task in building, deploying, supporting and documenting this project… He is a one man team of 20. THANKS STEVE!!

NEXT…

We implemented a plugin framework in our DataWeb server that lets anyone build plugins to access legacy data stores. It works great BUT it is something we made up. We are looking at replacing our plugin framework with Higgins IDAS (Identity Attribute Service). IDAS provides a ‘standards based’ interface definition for ‘Context Providers’… plugins to access legacy systems.

I have started thinking about the qualities that are 'lacking' in IDAS in order for it to be able to replace our plugin framework.... not that it does’t do, what it does, well... just that there are other things that it 'could' be made to do... that it doesn't now.

With the assumption that IDAS implementations sit 'close' to the underlying systems, on the same LAN, caching should not be needed, at least for the classic network latency optimization considerations. Caching could be used for fault tolerance and system failure scenarios but that's a whole other issue. Caching can reduce IO but the problems of keeping that cache in synch far outweigh that consideration if we solve the other problems that I talk about here. In theory, the data is 'right there' so duplicating it SHOULD not be necessary.

What we do need is the ability to 'find' stuff.... Find all of the Digital Subjects whose home city is 'Oakland'. What you DON'T want to have to do is:

1) Traverse all Contexts to see which expose 'Home City' attributes about their subjects

2) Traverse all Subjects in the identified Contexts to query and test the Home City attributes

While caching would improve this problem it is far from a good solution... we don't want to be doing mass traversals, ever, at query time.

What we want to do is pre-determine which attributes are going to be 'search criteria' .... yes.... you MIGHT want to search on any criteria, in which case you have to take the hit of searching without an index... they haven't even solved this in RDBMS world... you can build SQL queries that take days to run and then add a couple of indexes and run them in minutes. Once you have determined the uses cases… add the indexes. (compound keys and simple ‘set’ math across multiple indexes can give you significant flexibility and power)

Executing a search against an index results in a list of pointers to Subjects that meet the search criteria. It should NOT result in a list of pointers to the attributes themselves… remember you are unlikely to query… get me all of the home cities for all of the people whose home city is Oakland… You probably want to query something like; get me the email addresses and names of everyone whose home city is Oakland. (We do NEED to support ‘complex’ matching logic… startsWith, endsWith, greaterThan, beforeDate, etc…)

The next problem is optimizing the data access… The ‘easy’ way to process the results is to iterate over the list dereferencing the pointers… our experience has shown that this royally pisses off the DBAs…. What I mean is, if the ‘Context’ is an RDBMS then the iteration approach results in executing “SELECT email, fullname FROM people WHERE userID = ‘XXXX’” as many times as there are results in the set. This is slow and, as I said, not popular with the DBAs. You need to be able to package your query into “SELECT email, fullname FROM people WHERE userID in ‘XXX,YYY,ZZZ,ABC’” and then parse the results back in your ‘client code’. I put ‘client code’ in quotes because I don’t mean that this is done by the application coder but it should be done at the IDAS implementation. As an application developer I want to be able to say to “IDAS…. Get me all of the emails for people that live in Oakland and get back a list of emails” and never have to care that half of the emails were in Oracle and half were in PeopleSoft. BUT, I want to know that only 2 calls were made across the network (I have had to PROVE this to our customers in order for them to accept our DataWeb Server, they really care about this).

That’s my first pass at ‘what’ we need to do…. Next we have to work out ‘how’ within the existing IDAS spec :-)

Ohhh… and robust distributed transactional management. I will add others as I think of them.

No comments: