Viewing entries by Stephanie Williams

Digital Collections OCR: What it is, and what it isn’t.

  • “I can see the word on the page, but when I search for it, no matches are found.”
  • “This item is searchable. Why can’t I read it with a screen reader?”

We get a lot of great questions like the ones above: the answer to all of them, in some way, is “OCR.”

What OCR Is

Optical Character Recognition (OCR) is amazing technology; with OCR software we are able to search image files for groups of pixels that look like text, guess what that text might be, and save the output in a way that we can feed into our search indexing systems. Even better, we’re sometimes able to overlay that text output on top of an image so that we can show you where we think a word might appear.

At the North Carolina Digital Heritage Center, we scan and store digital heritage materials as images. When we notice that an image contains printed text–documents, posters, ledgers, scrapbooks, and more–we also run it through OCR software. Without OCR, text shown in images is “locked” inside them; with OCR we can leverage the power of full text search to help people discover relevant images a little better than before.

What OCR Isn’t

No OCR method is without limitations. Whether OCR software can correctly “read” the text in an image depends on a few things:

The longer OCR takes, the better it is

The longer the OCR engine is allowed to puzzle over the pixels in an image, the better its output can be. At NCDHC we try to find the right balance between giving the OCR software enough time to produce useful results, and scanning more materials: letting OCR take too long would significantly reduce the amount of materials we’re able to add to DigitalNC each day.

OCR is less accurate with historic materials

Most of the materials we work with are difficult for OCR engines to interpret: compared with more modern materials, historic documents use fuzzier printing methods, display a lot of variation in letter forms, are deteriorating, or contain a mixture of printed and handwritten text.  All of these things are likely to confuse even the best OCR software, producing text output that can differ from what’s visible on the screen.

OCR isn’t the same as a transcription

Without human intervention, it can be difficult for OCR software to interpret the layout of a document. By default, OCR software attempts to “read” an image from left to right. Even if it’s able to recognize all of the words on a page, it may not recognize the order in which the words were intended to be read; for example, the software might not be able to differentiate where one column ends and another begins in a newspaper clipping, or it might include the text of an advertisement in the middle of an article:

Example of OCR text challenges

In contrast, transcriptions represent the text in an image as it’s meant to be read, and requires some amount of human labor to produce.

Summary, and a look ahead

OCR is a fantastic tool that enhances the way users are able to interact with the images available in DigitalNC collections, but its limitations prevent it from producing full, traditionally-readable transcriptions of image materials.

Even so, NCDHC looks forward to next-generation tools and methods for recognizing and searching for text within images. OCR software is constantly improving; the software we use today is faster and more accurate than it was five years ago, and OCR technology benefits from recent advances in machine learning and artificial intelligence.

If you have questions or concerns about searchable content on DigitalNC, or would like information on obtaining a copy of materials that is accessible to screen readers, please don’t hesitate to contact us.


Technical Issues with Yearbooks, Campus Publications

We are currently experiencing technical issues with items that use the page-flip/”book reader” viewer on DigitalNC.org (yearbooks, campus publications, and assorted others). We host the images for these items externally at the Internet Archive (Archive.org), and our systems are currently having trouble fetching images from their servers.

We apologize for any inconvenience! As we work to find a solution, please try searching for affected items directly through the Internet Archive and feel free to contact us with any questions.

UPDATE 10/24/2016, 3:40 pm:

We are still working to restore functionality to yearbooks, campus publications, and the other materials on DigitalNC.org that use the embedded page-flip or “book reader” viewer. In the meantime, we’ve replaced the nonworking viewer on item pages with a link to view the item on Archive.org.  This is expected to be a temporary change. Thank you for your patience!

UPDATE 10/25/2016, 3:15 pm:

We are still waiting for a response from Archive.org and are hopeful that we will be able to restore our original book viewer. In the meantime, we have enabled a replacement viewer for affected items.  This viewer is similar to to the original, but does not integrate as seamlessly with the rest of DigitalNC.org; if you have any questions please feel free to contact us.


Browse Search Results on an Image Wall

Image

We’re interrupting regular blog programming to bring you news about a new site feature! Now, in addition to viewing search results in a list, or as a grid of thumbnails, you can interact with your search results via a Cooliris-powered Image Wall. We’ve added a third button to the view options at the top right corner of search result listings (see the image below). Clicking on the Image Wall button will allow you to click and drag your way through your entire result set, zoom in for a closer view and search image titles.

Image

CoolIris works best with photographs and images. Give it a try, and let us know what you think.


DigitalNC Blog Header Image

About

This blog is maintained by the staff of the North Carolina Digital Heritage Center and features the latest news and highlights from the collections at DigitalNC, an online library of primary sources from organizations across North Carolina.

Social Media Policy

Search the Blog

Archives

Subscribe

Email subscribers can choose to receive a daily, weekly, or monthly email digest of news and features from the blog.

Newsletter Frequency
RSS Feed