What I Learned from the Text Analytics Market

While, I was not really lost the last 9 months, I was diving into a new space that is very interesting. Here are my thoughts from the Text Analytics market…

A quick glance at my LinkedIn profile will show you, that my career primarily grew out of the Content Management space. The reason I’m so good to represent you, the CloudShare user, is because I have deep experience working with the platforms that you all are experts in building solutions on top of, or using internally. I got deep into Optical Character Recognition (OCR) because of its ability to make physical content digital, SharePoint/ECM because of its ability to put a friendly UI on the mission critical line of business application for managing content, and most recently Text Analytics for its ability to feed the content pipeline of Big Data.

There is always the concept of CloudShare as a tool and then what you do in that tool. Text Analytics, like SharePoint, is something you do in that tool. Because it’s also typically delivered as a platform. So while I took my brief break from CloudShare, I used Pro daily. And what I did inside this tool was explore and get a better handle on Text Analytics technologies. Text Analytics more primitively referred to as Natural Language Processing or (NLP), is a group of technologies that analyzes unstructured text to produce metadata. Metadata, data about data, is the stuff that makes content useful once saved in a repository and allows you to do more advanced things with the content. Structured metadata, which comes from the system directly such as last modified, version, creation date, author, etc, is more or less automatic. But unstructured metadata that comes from the body of documents is usually entered by individuals. Making it prone to error, and inconsistency. Text Analytics is the stuff that consistently pulls out metadata such as keywords, people, locations, etc, from the body of documents, automating the process users hate to do manually and making it more successful.

To build Text Analytics tools, you have to be both a computer scientist, and a fan of words. I call them “word nerds”. The initial engines started appearing in 2003, and have evolved dramatically. They started as purely static and moved to more advanced machine learning, that can get subtle nuances in language. They can also deal with common things in language such as disambiguation – that Mr. Riley, and Chris Riley are actually the same person.

I joined a Text Analytics vendor, compelled by the technology. I learned a lot. First ah hah moment I had was realizing the Text Analytics market of 2013 is very similar to where OCR was in the late 90’s. Where there were a lot of engines all trying to find a home. If the trend continues to be true, we will see massive consolidation, with a lot of client-side solutions, and only two or three general purpose engines that are embedded in them.

The engines today fall into two categories – APIs and vertical solutions. The general purpose APIs are meant for development shops to integrate the technology. The vertical solutions, while customizable, solve specific problems. The most popular area, although perhaps not the most practical, is sentiment analysis. Companies wanting to use Text Analytics to understand the sentiment of the market relating to their product or their competitor’s product. Sentiment is usually measured on a positive negative, neutral, level but it gets more interesting when you derive intent or action as well as sentiment. This is an area that needs a lot of work still, and it’s very sensitive to the forum and domain. However if you look at where the most money is being spent in the Text Analytics market, it is here.

The low hanging fruit for general Text Analytics, is not large. Besides the specific use cases there is a small need for broad untrained Text Analytics in publishing both internal and public, and discovery including eDiscovery and content forensics. More fine tuned engines are winning in the area of content classification and routing. As a matter of fact there is growth in all of these types of solutions with players like ABBYY and Kodak coming onto the scene with client and server-level solutions with full UI.

These are neat, but not areas I feel will be high growth in Text Analytics, rather, where Text Analytics meets Big Data and Social it gets very interesting. IDC analyst David Schubmehl calls this space Unified Information Access ( I learned a lot from David ), and Gartner is toying with the idea of BigContent. I think the creativity will come in the area where Text Analytics meets volumes of truth data. This is called Linked Data and it’s an extremely powerful iterative tool for getting, not only entities from documents, but entities with social and real context.

It was a fun ride in the Text Analytics market, like OCR I suspect the core of this technology will just be one of those things embedded and assumed. It’s one of those technologies that a lot of people need, but nobody wants. As is true with all bespoke technologies. The power will also be teased out when the technologies mature from purely publishing use cases into more interesting creative ways of linking and suggesting content to users. Users today do not want to curate content, and they don’t want to search for it, they want to be told where and what it is.

CloudShare is proud to have several Text Analytics engine vendors using our product for demos in Pro and full development in Labs, and I hope to get a few of them in the Solution Showcase.