As the goal for the Context Storage is to be able to store files and non-files being represented as graphs, and enable running simple and sophisticated queries across this semi-structured information, let’s take a brief look on first features the new Storage provides.
Despite all the complexity of the underlying technologies from the SQL Server 2012 and Semantic Web concepts used in the system the outcome from user perspective is simple. In this post I’ll show first two scenarios:
- Full-text Queries
- Related Items Queries
Users are able to issue full-text queries across the information inside the system. In the screenshot above you can see the results of search for word “semantics” across the system.
Related Items Queries
These queries are enabled by using the new Semantic Search functionality enabled by SQL Server 2012.
Worth noting that this functionality itself is a complex one; my fellow colleague from SQL Server Group at Microsoft, Kunal Mukerjee (Architect, Principal Dev Lead for Full-Text Search and Semantic Extractions) and his team worked really hard to enable this and other scenarios.
Let me give you a brief introduction on how the process of finding similar objects in the system works.
- For each object in the system, a full-text index is created: files are indexed using appropriate iFilters | text columns are indexed by built-in filter
- When user chooses a file, a comparison of it with other files in index is made and top results are chosen
- A query for most relevant keywords and phrases is issued against these top results
When the query for most relevant keywords and phrases is made, the system builds a n-dimensional, vector space of weighted keywords.
SQL Server Semantic Search bootstraps itself with fairly large and complete statistical language models (LM); system works with ngrams over LM words that are output by the wordbreakers of SQL Server, and injects high level structural and contextual constraints in to the feature space.
ACM SIGKDD 2011, Kunal Mukerjee, Todd Porter, Sorin Gherman, Microsoft)
In another words, the words distribution from LM (which contains expected distributions) is compared to the word distribution of the selected document. System scores the ngrams extracted from the document based on how much more frequently they appear in a document, compared to their expected frequency as computed based on LM. Also this can be seen as approximation to the TFIDF (Term Frequency Inverse Document Frequency), the inverse document frequency is replaced with the frequency from LM. Two important moments here:
1) This algorithm (Tags Index, TI) is corpus independent:
as the goal for the Context Storage is to be used as the back-end for the Universe project which is to be used by Knowledge Workers daily, adding new documents to the system should not require the Tag Index (TI) to be rebuilt on everyday basis
2) This algorithm doesn’t use machine learning thus it doesn’t need a large amount of data to start with to give good results. This is also useful in case of a new user who had just started building the Universe of his information (which is a typical scenario in our case).
The next step after building Tags Index is to to mine Document Similarity Index (DSI).
The DSI algorithm works with a populated and stable TI as its input/data source. The output is an index that can be queried by any given document id, and returns the top N other documents that share commonly ranked key phrases with the given document, with connection weights for each related document.
It’s worth noting that the approach shortly described above is designed to be linear scalable in order to keep working under low performance constraints.
As the end result, though, user sees a list of N similar documents for any given document.
For the end-user scenarios that this technology enables for the Universe I’ll tell in next blog posts.
To learn more about the algorithms hidden in the SQL Server Semantic Search, please proceed to Kunal’s paper.