Have you ever tried to cut a board with a hammer? Given enough time and force it can be done but the task is much easier and has a better ending when you use a saw.
Dealing with unstructured data can sometimes feel like cutting a board with a hammer.
It takes up more space.
It usually all gets stuck into some blob column in a database or in one giant folder on a drive.
There are some structured attributes like date and size. Maybe if you’re lucky you might even know where it came from and what might be in it.
But for the most part, unstructured data is largely neglected. Sentenced to sit in storage forever, never again to see the light of day.
Why pay to keep data around if you’re not going to do anything with it?
“I might need it.”
True, you might need it in the future. But will you be able to find it when you need it?
More importantly, is there something you could do right now to make that data more useful AND easier to find when you need it?
Using the right tools for the job is essential when dealing with unstructured data. I like to think of it in terms of 3 main areas:
Text is deceiving because you may think you have added some structure, but does the structure have any meaning?
For example, think of an email. It has some structured elements like to, from, subject, and time sent. We use these structured elements of email to organize and often search our email.
But what about the text?
You could simply group the text by words. That would give you something that highlights things like of, an, the, a, and other words that occur most frequently.
The problem is the most frequently occurring words don’t really have any meaning by themselves.
You could try limiting what you look at by only using words longer than a certain length or come up with a list of words to ignore.
The data may appear to be a little cleaner but these approaches are more like using a hammer to cut the board than using a saw.
Text that has been structured into sentences and paragraphs has meaning. The meaning is dictated by the language of the text. Written language is used to communicate ideas and information from one human brain to another.
When we start looking at the linguistic elements of the text (nouns, verbs, etc.), it becomes easier to add structure that also captures meaning.
Good text analytics software is worth its weight in gold (how do you weigh software?) when it comes to adding structure to your text. It can help you find main ideas, entities, categories, and sentiment. All of these elements help capture the true meaning of the text and not just the words by themselves.
Image recognition has been a difficult problem for decades. Over the past several years advances in this field have become almost commonplace.
Image search is one example. Need to find a picture of a pumpkin in a red wagon? No problem.
Some of these images match because they have textual metadata that tells the search engine information like “pumpkin” and “red wagon” is in the image.
The problem with that method is that someone had to tag the image, and those tags aren’t always 100% correct. Kind of like youtube thumbnails. The thumbnail image isn’t necessarily a frame from the video, it may be something just to get your attention.
Looking at an image and recognizing colors, shapes, and objects is one of those problems that a 2-year-old could do better than a computer for quite some time.
However, this technology has finally gotten out of the toddler phase and is walking on two feet. Check out some examples from Amazon’s Rekognition service:
In addition to objects in the picture, facial recognition and matching have become much more robust to the point that they can provide a percentage of match between faces in photos. They also provide facial analysis to show things like age range and sentiment.
You can scale out to process thousands or even millions of images in a short period of time so applying this to historical data is entirely feasible. These metrics come with a confidence level so you can decide what information to save and what to throw out.
There are a couple of approaches to audio. Search for known patterns in the audio or convert the audio to text then apply text analytics tools.
Phonetic search uses the wave patterns in the sound to match with known patterns of pre-defined words you are looking for and can tell you with some confidence level what is a match. This allows you to monitor for known points of information then tag your audio with those data points.
Professionally produced songs are a relatively easy sound match because they are exactly the same. A small wave pattern from a song will uniquely match up with that song and only that song due to the wide variety of sounds that could exist even between 2 live versions of the same artist.
Conversations, on the other hand, are entirely different. How many ways do people pronounce the word “tomato” or “mountain”? Accents, emotion, and personality all affect the actual wave pattern of the words people speak.
Speech to text has been another hard problem to solve. It uses the same idea of sound matching but then converts the audio into words, each with a certain confidence level. One advantage of converting the audio to text is that you can use your same text analytics tools to process the text instead of setting up a totally different analytics path.
Good speech to text software can also capture some items unique to the audio and provide metadata such as gender and age of the speaker, sentiment of the speaker, and if it was a conversation when people were talking at the same time and when they were listening to one person.
What about video?
Well, if you think about it, video is just a bunch of images that change rapidly (usually 30 times / second) with accompanying audio.
So you can separate the audio and deal with it as audio, then take a sample of images from the video (maybe 1-2 per second) and analyze those using image recognition.
300 hours of new video is uploaded to YouTube each minute! It is not humanly possible to keep up with that kind of data without using the right tools.
What Unstructured Data are you holding on to? Which of these tools could help you add more structure and get more value from that data?