The quest for a new standardized Search API : CMIS a new alternative to OpenSearch?
Strangely Search Software Vendors never agree on a common Search API. There was also no Java Community Process initiated on this subject in the Java World. The only real initiative was taken by Jeff Bezos from Amazon by unveiling the first version of OpenSearch. But we can not say that Amazon is well known for its enterprise search servers.
This is quite a shame as search is often the neglected feature in Content-Enabled Applications. However end-users are today used to try first to "googlize" the content of a web site, or a knowledge base or of a collaboration space rather than to use the classical navigation menus in order to find the content they were looking after. The needs for unified search increase: this is sure.
Unified Search or Federated Search servers will then become more and more important for companies. And each of the "Enterprise Search Vendors" is currently fighting with the others in order to offer the largest numbers of "connectors" to all the company underlying systems which could store from one manner or the other some valuable content or documents. For instance Autonomy Idol Server is now provided with more than 400 enterprise connectors ( http://www.autonomy.com/content/Products/idol-modules-connectors/index.en.html ). The same could also be applied to other enterprise search alternatives (and several of them are OEMing part or whole of the Autonomy connectors). Google took another path by open sourcing some of their Google Search Appliance Connectors ( http://code.google.com/hosting/search?q=label:searchappliance). The same problem remains: this is custom code to their respective search servers.
So the Content Management industry was quite critized by its lack of content interoperability standards for the last couple of years. Things look like now rapidly evolving (JCR; CMIS;...). However this is strange to see an industry such as the "Enterprise Search" one to stay so proprietary while having to deal with so many OEM agreements with other Content Management Vendors or Content Enable Applications. It looks like vendor lock-in strategy is still the key in this industry.
However when digging information into the new CMIS standard I noticed an interesting and perhaps not enough publically discussed CMIS extension: The Unified Search Proposal: http://www.oasis-open.org/committees/document.php?document_id=31136 submitted by Gregory Melahn from IBM.
As mentioned by Florent Guillaume from Nuxeo in its CMIS meeting notes:
"The use cases of Federated Search (an engine that, when queried, delegates the search to many repositories and then aggregates the results) and Unified Search (an engine that somehow crawls many repositories to build a database of what's in them, and can then be directly queried) have been discussed a lot, especially unified search as it impacts a number of other features. One feature needed is something allowing the discovery of permissions, to be able to serve search results without having to check with the repository for each document if access can be granted; this will presumably involve some kind of ACLs. Even if such permission discovery does not reflect the full security policy applicable to a document, it can still be useful to weed out some of the documents and improve the efficiency of the search. Another feature needed is something allowing the discovery of what has changed in the repository since a previous crawl; this can be done either through push/events (but as mentioned above this would be out of scope for CMIS 1.0), or through pull/polling/querying to retrieve some kind of journal of the last changes, including deleted documents; this feature is sometimes called an Event Journal or a Transaction Log, and the problem is to make it available efficiently outside the repository for the benefit of search engines."
And this is exactly the point. Search standard such as OpenSearch which are oriented towards Federated Search are too lightweight to build any serious and tightly coupled integration with a dedicated search server. The lack of ACL or history management is an issue for years. And the OpenSearch specification does not address the needs for Unified Search but only for Federated Search. Federated search is also facing several challenges which are resumed in this article from Autonomy: http://www.autonomy.com/content/Technology/autonomys-technology-limitations-of-other-approaches-federation/index.en.html
So the CMIS Universal Search extension looks like a fresh and interesting standardization idea beside the main purpose of the CMIS standard.
Now this is difficult to guess the impact of CMIS on the Search industry. Will software vendors such as Google, Autonomy and others start to replace their proprietary connectors by CMIS plugs? Why Search Vendors are not more present in the CMIS TC Expert Groups? Would it be interesting to fork the CMIS Unified Search Proposal in order to make it a distinct independent sub-initiative mainly driven by all major Search Vendors?
What is sure is that CM vendors and other Content Centric Application Providers could really benefit from a more easy and standardized ways to plug their proprietary applications and expose their data to search servers. How many times are we receiving RFP here at Jahia asking us if we could easily remove the embedded Apache Lucene to rather integrate our WCM with the already acquired Google Appliance or whatever else other enterprise search solutions already in place.
So on one hand it looks like more and more CMS are now natively embedding the free and open source Apache Lucene solution: http://www.cmswatch.com/About/Press/2009-CMS-Search/ but on the other hand companies are investing more money on centralized and unified search solutions.
Easing the way to plug a content-centric application to all these proprietary enterprise search solutions while letting it manage locally its own search index to adres it own custom needs definitively looks like the way to go. And here the CMIS Universal Search extension could really bring something interesting to the industry. Let's hope now it gets some wide and fast adoption.