High-Level Idea of Integrated Storage
Historically, one of the most complex projects that were ever in works at Microsoft, was a project of a so-called integrated storage system. The idea was to create a kind of an object-relational storage system that would allow the higher-level concepts of information to be the first-class citizens instead of files & folders. That would allow applications & OS to share information between each other, re-use of the information (similar to the re-use of code in software development), as well as provide additional services with analysis (BI, semantic, etc.) of that information to provide additional value like relationships between information entities etc. Such a solution would make sense also because of information increase that grows exponentially with each year, which lead to information overhead.
Brief History of Integrated Storage Projects at Microsoft
First introduced in Cairo (1993-94), then morphed into different forms (Storage+, RFS, WinFS), it had two more incarnations at Microsoft. The most advanced to date is the Microsoft Semantic Engine, which was an incubation project in Ray Ozzie’s organization and was focused on building not only the integrated storage but also provide extensively rich set of semantic processing & analysis services for that information for quick development of new applications for this data. After two years of incubation & introduction at Microsoft PDC 2010, the project morphed into the SQL Server Semantic Platform and will ship as part of SQL Server in future. BTW an interesting customer look on Semantic Engine is available here. The latest frontier is now the project Zentity which is done at MSR by a great team of experts in object-relational technologies, RDF and Semantic Web. Zentity is focused on delivering the promise of research output repositories, it’s a research output repository platform which provides a suite of building blocks, tools, and services to create and maintain an organization’s digital library ecosystem. And, the good news are, Zentity will use SQL Server Semantic Platform (which provides semantic concept extraction from documents feature) when this feature of SQL Server will be shipped. So, as you can see, Microsoft is doing a good progress on solving the problem.
Problem that feeds the need for Integrated Storage
While we are here, let’s dig deeper into the problem itself. My perspective on this problem is (and I work in this area for last 5 years now), integrated storage platform is not just an engineering problem, but a philosophical problem. What do I mean?
The promise of integrated storage is:
- break paradigm of separate data silos by bringing all information into one place
- break paradigm of “closed” file formats bringing a “light” on data representation and access to it
- introduce higher-level information entities concepts (like people, documents, customers)
- introduce relation-based concepts of information organization
- provide a platform for applications
Approaches to Information Storage
Currently, use of separate data silos is dictated by many reasons:
- Unstructured data (documents, music, photos, videos, etc.) that is stored in file system is stored there because it’s quite easy to transfer such information between different computers via flash drives and over the Internet
- Structured data (customers, products, people, companies, etc.) that is usually stored in relational databases is stored there because it requires high level of optimizations for efficient data retrieval, as well as support for BI etc.
Let’s look deeper into these two approaches.
File Systems – Unstructured Data
It’s very important to note that file system is actually a “passive” system as opposed to “database”. What do I mean? The most simple file system commonly used across devices, FATxx, enables you to store files, folders, and that’s all, period. You even can’t store metadata for those items which formats do not support it natively. Certainly, there are much more advanced file systems like NTFS, as well as those used in Linux and Mac OS X which enable storing metadata in some form. But when you move information between systems if metadata is not supported natively by file formats, you loose that metadata.
- Simplicity – you can build a very complex tree-like structure of folders & files
- Wide Availability
- Support – event most simple devices support file system concept
- Speed – requesting a file from a file system given its path is a quite cheap operation
- You Can’t Store Metadata in Any File Regardless of File Format,
- Not All File Systems Support Journaling,
- Inexistence of Integrated Search support which means that some programs should be written for indexing of information stored in file system.
- Because of use of “closed” file formats the search application is good as much as its support of those “closed” file formats is.
- Because search is not integral part of file system, it’s index will be as good as search application’s file system watchers are on watching for file changes, which means that it’s impossible to provide full index of stored information for fast search that would be always up-to-date
- Information Duplication – because of closed file formats and thus requirement for file read/write plug-ins, as well as of requirement to less depend on 3rd party applications, applications use their own information silos, which means that some information is duplication which leads to
- High Cost of Maintaining Information Updates
- Information Separation – again, because of file formats, it’s almost impossible to build references to other information entities stored in other files
- Information Representation – different applications represent same information differently
These and other problems led to requirement of a solution that integrated storage would provide us with. Let’s look on another side – relational databases. It’s important to note that, in general, relational database is an “active system” because it’s a higher level concept for dealing with information, usually the database itself is represented as file, and to maintain the system, it’s persistence, it’s consistency, as well as all-day-long online access to information, a relational processor is required to run to provide these basic services.
- Information Model – it’s possible to build a detailed representation of information you deal with
- Optimization – use of detailed data model helps to highly optimize storage of information for fast retrieval and analysis
- Built-in Support for fast retrieval of information based on queries
- typically, integrated full-text search
- Information Sharing – for all apps that use that same data sets
- First-Class Support of Information Entities (concepts)
- Information Duplication Inexistence
- lower cost for maintaining of information updates
- Information Is Metadata
- Information Model – because we use relational data model, we are highly restricted to the information representation we defined, thus different applications that require different information representation (say different schemas for typical information entities like “People” and “Message” used by Outlook & Hotmail – canonical example of failure of WinFS attempt for integration of information)
- Complexity – it’s much harder to build a free tree-like model of relationships between information entities compared to file system because it will require changes to base data model and typically we are restricted to use of those relationships that are defined by the data model
- Even Higher Cost of Maintaining Updates – yes, I’ve just said it’s easier to maintain information updates because due to nature of relational data model it’s possible to avoid data duplication, but with increase of amount of relationships between entities, supporting all changes becomes a nightmare
It’s clear that both solutions are not delivering the promise of Integrated Storage. Let’s now try to define the Integrated Storage from requirements perspective.
Requirements for Integrated Storage
From a high-level perspective, I suppose it’s clear that we could define Integrated Storage with two words:
Simplicity and Power.
That means – let’s bring simplicity of first approach and combine it with power of second approach. It’s easy to say that, however delivering the solution is a too complicated thing to solve it at one time. To date, I haven’t seen any successful integrated storage system – for example, an interesting project “Base4.NET” dead when it’s author, Alex James, moved to Microsoft. Gnome’s Storage project was dead. Etc. Interesting project, M-Files, – a document management solution – tries to solve the problem of integrated storage with a highly complicated relational database-driven approach, with use of Windows Explorer (with plug-in) as explorer into the data, and complicated data modeling software. Another, most recent project in this area, Tabbles, also tries to solve problem using powerful concept of tagging, but is also solving only part of the problem. One of the most interesting approaches to integrated storage, as well as the most advanced application for integrated storage – Digital Memories – was introduced in the well-known project “MyLifeBits” by Gordon Bell, a distinguished principal researcher from MSR, who was behind personal computers and Internet in the beginning of IT industry. Still, the problem of integrated storage is an open question.
Let’s formulate the requirements to an ideal integrated storage:
- Flexible & Open Information Model – allow creation & update data w/o strict dependence on data model – like triples idea in RDF
- Information Entities – first-class citizen with existing files as an example of basics of such information entities (think of descriptors used to characterize information entity)
- Simplicity of Manual Data Organization – easy way to add metadata & relationships to information
- Inexistence of Information Duplication – avoid unnecessary duplication of information entities
- Information Sharing Between Apps– any application can re-use information and augment it with its data
- Information Sharing Between People – it should be a first-class citizen operation of such system – sharing of information across boundaries of user machines
- Information Updates – it should be easy to extend and change model of information representation to support the needs of multiple data consumers
- Information Synchronization – it should provide a transparent mechanism for information synchronization across boundaries of user machines
- Machine Learning Support – it should provide infrastructure for machine learning to extract more value from captured information
As I’ve said in the beginning of this post, the problem of Integrated Storage is not hidden in engineering – otherwise Microsoft or open-source engineers would easily solve the problem. Earlier or later, but solve. But the problem is of a higher-level that just engineering.
There are, however, several potential problems that come out from these requirements to integrated storage. Let’s name them as open questions.
Open Questions for Integrated Storage
- Access control – how to make sure that information won’t be accidentally deleted by some dirty application?
- Privacy – how to make sure that other people won’t get access to your information?
- Maintaining Information Updates – how we can make sure that update of information won’t make 3rd party applications to crash? How we can make sure user should be required to work on this manually as rare as possible?
- Wide Availability – how to make sure the information would be easily transferrable across boundaries and won’t loose it’s metadata?
- Support of Simple Devices – how to make sure the information captured by low-level, simple devices (like sensors, cameras etc.) will be easy to bring into integrated storage? How to make the process automatic?
- Semantic Concepts Extraction – how to make sure that those entities that haven’t been generated automatically by software would still be extracted from existing information to be used by integrated storage consumers?
- Information Unification – how to make sure that different applications and people who see same information in same way could still deal with same information from system’s perspective? One of the possible answers is in applying Null-A Principle and General Semantics approach to separation of information entities extraction and information environment analysis
- Fast Access – how to make sure that the system will be powerful enough to provide almost file system-like queries support to retrieve information entities?
Looking onto this problem from a higher-level perspective, I want to note the following:
Integrated Storage is actually a balance between structured & unstructured data storage approaches.
As such, a real-world Integrated Solution won’t be able to support beauty of strong relational data model, also it will enable to have semantically more rich information entities as first-class citizens. Because of exponential information growth, I assume it won’t be possible to make exact extraction of information, and General Semantics as well as general philosophy reminds us that it’s impossible to have just one strict representation of information because its complexity level is many more levels higher than any data model we can build with computers today.
Still, I believe that having an Integrated Storage would provide a lot of benefits while solving the problem of information overhead and exponential growth of information.