BLOG Computer Vision

Comparing the Top Computer Vision APIs for OCR

Harish Choudhary
July 5, 2017

Computer Vision APIs can identify objects in an image, recognize faces, extract words, text, and even analyze the emotion expressed by people, all in no time. And there is a service offering related to each of these capabilities. Businesses can weed out offensive photographs and protect their users, they can improve search, say on a stock image website with more relevant tags for pictures, or perform sentiment analysis on photos to market better to their audience. The use cases are many, which is why behemoths like Google, Microsoft and IBM have stepped into the fray. Today these biggies are wooing customers with off-the-shelf Computer Vision APIs for anybody who needs them.

However, a one-size-fits-all approach may not always be useful. For instance, you cannot depend on an out-of-the-box solution to identify the breed of dog in a picture or tag all embroidered dresses on your E-commerce website. This is where a hybrid solution, like that offered by Playment comes in. Playment, offers an on-demand, crowdsourced solution that combines the power of technology with human intelligence to analyse images, and extract highly accurate, reliable data, for projects of any scale.

Companies use Playment’s services to create and enhance meta tags for a database of images or train their algorithms to automate higher-order AI tasks.Today, we decided to compare the performance of a few industry-grade solutions against our own offering, at Playment - in order to extract text from images. For the purpose of this experiment, we compared Google’s Cloud Vision API, Microsoft Cognitive Services - Computer Vision API, Free OCR API (open-source) and Playment’s workforce. We sought to extract information contained in 3001 images of cars and furnish the information contained in the number plates.

The results generated were as follows:

Google Cloud Vision API; Microsoft Vision API; FREE OCR API; Playment

Total Images Processed: 3001300130013001

Extracted Data: 2762132815312882

Correctly Extracted: 18394635422850

Incorrect extraction: 92386598932

Not Extracted Anything: 23916731470119*

Recall %: 92% 44% 51% 96%*

Precision %: 66% 34% 35% 98%*

Recall % = Extracted data/Total Images Processed*Precision % = Correctly extracted/Total Extracted Data

You can access the entire dataset used for the above comparison, along with their results here. Get Access to Dataset.

As you can see, out-of-the-box solutions did not fare too well, with Google managing to extract data contained in 2762 images and Microsoft doing so for only 1328 images. The task force at Playment, successfully extracted texts from 2882 images. The corresponding recall % for each provider then stands at 92% (Google), 44% (Microsoft) and 96% (Playment).

Playment employed the services of its super users to push up the recall percentage to a full 100% over one iteration.However, the key differentiating factor is the precision of this data. Playment topped the charts with a 98% precision, whereas Google lagged behind significantly with a precision of 66%. And Microsoft fared the worst with an precision of just 34%, incorrectly extracting information in 865 pictures and not being able to extract data in 1673 cases. As for the performance of the open source tool (Free OCR API) the results were unsatisfactory for operations on a large scale. Here is one such example:

An instance where both Google and Microsoft extracted incorrect data, stumped by the shape of the numerals Service Provider Results.

Google API{"text": ["AB00 PUG"]}

Microsoft API{"text": ["PUG"]}

Free OCR API{"text": ["AB00 PUG"]}

Playment{"text": ["A800 PUG"]}

Google Cloud Vision API Results

Microsoft Computer Vision API Results

Factors that led to poor Recall by Google and Microsoft:

The Google API failed to extract any data when the number plate was inclined. Poor resolution of images also contributed to poor recall. Their APIs struggled to extract information from out of focus images or pictures taken from some distance when the text was slightly blurry.

With Microsoft however, we were not able to identify image resolution or orientation as a cause for poor recall - with their solution failing to read even the simplest of images.

Playment’s model, however, ensures that clients have 100% recall even when the image is not sharp, of low-quality, or otherwise compromised.

Factors that led to poor precision by Google and Microsoft:

Instances that tripped up Google’s Vision API almost always contained similar looking characters, the machine had trouble telling "5" and "S" apart. It also got confused differentiating between "1" and "|", "8" or "B", "A" or "4", "M" or "W, N", "C" or "G", "D" or "O". Precision took a beating when more than one number plate had to be identified from a single image. B&W images and poor resolution also impacted accuracy negatively. However, we noticed that the Google Vision API inaccurately identified just 1 character in many cases and absolutely irrelevant answers were few and far between. Whereas, the Microsoft service frequently threw up incomplete data, got confused when similar looking characters were present, and extracted unusable, highly irrelevant data, very often.

Did results vary when images were zoomed in?

Tilting images did not lead to more recall, but final precision did drop to less that 50% of these cases. If precision has to be boosted, algorithms will need to be specifically trained to do so. Results were not significantly more accurate when pictures were zoomed in, as the resolution of the picture remained unchanged.

Additional Service Possibilities

Image annotation

Apart from extracting data from images, Playment also helps the client create bounding boxes around objects, to clearly identify regions of interest. For instance, an algorithm to train the self-driving car of the future will require it to properly identify different objects, and tell a tree apart from a human, recognize a speed-breaker and so on and so forth.

Another use case would be training self-guided drones to differentiate buildings from birds as they fly the course from destination A to B. In these cases APIs will have to extract ALL information present in the image or document, however, Playment can provide a much more usable solution, right off the bat, by extracting only pertinent information.

The Playment Advantage:

This experiment shows that off-the-shelf computer vision APIs by industry giants cannot be relied upon to provide highly accurate, error-free results for large dataset. Neither can these solutions be customized to suit the specific needs of a niche service provider. One could argue that both of these objectives could be met by building in-house capabilities.

However, the decision to hire and create a development team has its own pitfalls. Finding and hiring the right talent to build a customized solution is a time and resource intensive decision with significant cost implications. Many companies prefer the ease of outsourcing image annotation and data extraction to us so they can hit the ground running, and focus on their core business. Even if image annotation is a recurring need in your line of work, you need to consider if you can continue incurring the fixed costs to maintain and grow your development team.

The pay-per-use model offered by Playment, on the other hand, offers a high degree of flexibility and transparency.And whether you need to extract data from 10,000 or 100,000 images per day, you can rest easy - knowing that Playment will provide a scalable solution to keep pace with the growing needs of your business. You can enjoy the agility of tapping into a distributed workforce as and when the need arises, and meet tight deadlines. We also ensure a hassle-free deployment that integrates seamlessly with your existing workflow.In this context, Playment emerges as a clear winner when compared to a technology-only solution, by harnessing people to provide 100% recall while maintaining accuracy while delivering business-critical data, on time, every time.