Accuracy vs Consistency

Accuracy vs Consistency

Which is better? Accuracy or Consistency?

This is an age old debate when it comes to working with unstructured data tools.

People see one mistake from some text analytics, speech to text, or image recognition output and cry, “See?! Machines just can’t do this.”

But humans can’t either.


One person isn’t fast enough to read through thousands or millions of items that need “human judgment” to be “accurate.”


So we invented crowdsourcing – services like Amazon Mechanical Turk and CrowdFlower where you can define a specific task then have an army of human “crowd workers” perform these human judgment types of things.

Crowdsourcing gives you access to 100’s and even 1000’s of workers all at once so this solved some issues of speed, but did it really solve accuracy?

Kevin Cocco of commented and presented about a project where they took 120k tweets about weather and had 5 crowd workers classify the sentiment for each one.

Here are the 5 possible answers they could select for each tweet presented:

pasted image 0 22

Guess what percent of the time all 5 workers chose the same answer for a single tweet?




pasted image 0 19


There is a perception that humans are more accurate than machines for some tasks.

The problem is that “accuracy” can be very subjective.

Often times the data set used to train machine algorithms isn’t 100% accurate to begin with.

So if the training data isn’t 100% accurate, how can a model be 100% accurate?

 Accuracy < 100% = FAIL    

If you buy into that equation, or that is your measuring stick for unstructured data tools, chances are you will just give up on using the tools.

What if you change the equation?

 Value > 0% = WIN

Instead of focusing on the mistakes, look at how much value you are getting by using the tool. Look at the speed at which you can process massive amounts of data, look at the consistency of the results.

If you are dealing with a large data set and are trying to refine a model, it is going to be very helpful to run your data through more than once. By using machines you can quickly and easily run all of your tests again as you make refinements.

This is much easier and less expensive than getting an army of crowdworkers to process those 120,000 tweets over and over if you need to add a 6th option or you want to run a different set of 100,000 tweets through for testing or evaluation.

This becomes a huge cost savings when dealing with audio transcription like call center recordings. “Rush” transcription by native language speakers will cost you ~$3 / minute. Speech to text services can give you real-time results for a fraction of the cost.

Do you need 100% Accuracy?


Ever tried voicemail transcription services? Rarely are they 100% accurate, but when you get that email or text of the voicemail someone just left can you tell what the general message or intent is? Even though every word isn’t 100% perfect?

As humans, we are very good at dealing with noise. Although the veracity of this text is questionable, we’re not worried about that for this illustration. Try reading the passage below:

Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae.

Even though none of the words are spelled correctly, were you able to understand what it said?

According to a research at Cambridge University, it doesn’t matter in what order the letters in a word are, the only important thing is that the first and last letter be at the right place.

You experienced the added value of knowing what that message said without listening to it – maybe you were in a meeting and got the message sooner than you would have if you had to listen to it, or maybe you just saved some time because it is faster to read the message than listening to it.

Another approach is measuring changes over time.

Suppose you are processing some content in real-time and calculating an average sentiment for each item then plotting that score over time:

pasted image 0 20

When establishing a baseline with unstructured data tools, even if they aren’t 100% accurate, at least they are consistently inaccurate.

Over time you can measure variations from that baseline and derive some good insights even though your absolute classification isn’t 100% accurate.

pasted image 0 23

When processing unstructured data it is often the outliers that contain the interesting information – those indicate items that may require more immediate attention than others.

80/20 Rule

pasted image 0 21

Some tasks like medical coding demand as close to 100% accuracy as you can get. But again, different human coders will often disagree on what the correct codes should be for a given medical record. This is due to the complexity of the domain and the subjective nature of some classifications in the medical coding system.

In these situations where you’re not necessarily shooting for 100% automation, you can still get value from unstructured data tools by pre-processing your data and providing suggestions for the humans and data from the record to back up that decision.

You may not achieve 100% accuracy, but you can probably hit 80% accuracy and speed up the overall processing time of each record. This increase in throughput per worker can help lower costs and open up business opportunities that were not viable before using unstructured data tools.

Another way to look at this is that the tools can do 80% of the grunt work while humans clean up the remaining 20%.

Easier to Edit or Create?

Create vs Edit

Which is easier: proofreading a 1000 word article or writing a 1000 word article? In most cases, proofreading is easier because you already have something you are starting with vs trying to create it from scratch.

The same is true with most classification tasks. Starting with something allows the human to focus their efforts on verifying that the information is correct and looking for those inaccuracies vs trying to do everything from scratch.

You do need to be careful to not fall into the trap of always agreeing with the computer. Some trials show a decrease in overall accuracy when humans are presented suggestions vs when they come up with the solution from scratch.

One way to combat this is to send known “gold standard” tasks through with normal jobs to check that the humans are paying attention. If you’re testing for “agreeance laziness” you could send something through with a subtle error to make sure they aren’t always taking the computer’s suggestion.

What’s your application?


Bottom line, it depends on your application. For monitoring someone’s heartbeat or launching a missile, 100% accuracy is essential. However, for monitoring twitter feeds or mining some data for insights, lower accuracy can be an acceptable trade-off for real-time results in processing that data.

Can you get the general idea even though it’s not 100% accurate? Can you get value from that? Then don’t throw out the tools. Use them for what they are good at so your human experts can focus on where they can have the greatest impact.