Viewing entries tagged "behind the scenes"

North Carolina Newspaper Digitization Part 3: This is How We Do It

Greensboro Daily News Ad, March 2, 1934

Greensboro Daily News Ad, March 2, 1934

Like “Jeopardy!,” I want to tell you the answer before I get to the question.

Following a newspaper digitization and markup standard helps us plan for the future and makes it easier for us to work with vendors, open-source software, and other libraries and archives.

I say this up front, because when we explain how we digitize and share newspapers the frequent response is to ask why we do it the way we do. I think this is because our process is more labor intensive than people expect. It’s definitely not the only way, but we’re committed to this path for right now because it accommodates multiple formats (microfilm, print, born-digital), fits our current digitization capacity, and results in a system we think is flexible and extensible.

That standard I mentioned above comes out of the Library of Congress’ National Digital Newspaper Program (NDNP). All of our newspaper work is NDNP compliant, which means we follow that project’s recommendations for how to structure files, the type of metadata to assign to those files, and also the markup language that tells the computer where words are situated on each page (very helpful for full-text search).

I’ll give you a broad outline of our workflow and the tools we use. However, if you want more specific technical details, head over to our account on GitHub.

Screenshot of PaperBoy!

Screenshot of PaperBoy!

Let’s say one of our partners is interested in having us digitize a print newspaper. We’ll start by scanning each page separately on whichever machine works for the paper’s size. Because the NDNP standard requires page-level metadata, we’ve created a lightweight piece of software that helps us take care of some of that while we scan. Affectionately dubbed “PaperBoy,” this program allows the scanning technician to track page number, date, volume, issue, and edition for each shot. While it slows down scanning a little bit, it speeds up post-processing metadata work quite a lot.

Once the scanning’s complete, we process the files to create derivatives that serve different needs. We use ABBYY Recognition Server to get those multiple formats:

  1. a JPEG2000 image that’s excellent quality yet small in file size
  2. an XML file that includes computer-recognized text from the image along with coordinates that indicate the location of each word on that image
  3. a .pdf file that includes both the image and searchable text.

Now that we have the derivatives, we begin filling out a spreadsheet with page-level metadata. We first add the metadata created using Paperboy and then we run through the scans page by page, correcting any mistakes found in the Paperboy output and adding additional metadata. This also helps us quality control the scans and gives us a chance to find skipped pages.

How much metadata do we do? You can download a sample batch spreadsheet from GitHub, if you’re interested in the specifics, but it includes the PaperBoy output as well as fields like Title, our name (Digital Heritage Center) as batch-creators, and information about the print paper’s physical location. A lot of those fields stay the same across numerous scans or can be programmatically populated with a spreadsheet formula, to help make things go faster.

Once we have the spreadsheet and scans complete, scripts developed by our programmer (also available on GitHub) use those spreadsheets to figure out how to rearrange the files and metadata into packages structured just the way the NDNP standard likes them. The script breaks out each newspaper issue’s files into their own file folder, renaming and reorganizing the pages (if needed). The script also creates issue-level XML files, which tag along inside each folder. These XML files describe the issue and its relation to the batch, and include some administrative metadata about who created the files, etc.

Newspaper files before processing (left) and after (right).

Newspaper files before processing (left) and after (right).

The final steps are to load our NDNP-compliant batches into the software we use to present it online, and to quality control the metadata and scans.

If you think about it, newspapers have a helpfully consistent structure: date-driven volumes, issues, and editions. But there isn’t much else in the digital library world quite like them, so more common content management systems can leave something to be desired for both searching and viewing newspapers.  Because of this, and because there’s just so MUCH newspaper content, we use a standalone system for our newspapers: the Library of Congress’ open source newspaper viewer, ChronAm. It’s named as such because it also happens to be the one used for the NDNP’s online presence: the Chronicling America website.

While not perfect, this viewer does really well exploiting newspaper structure. It also allows you to zoom in and out while you skim and read, and it highlights your search terms (courtesy of those XML files created by ABBYY). Try it out on the North Carolina Newspapers portion of our site.

“Can’t you just scan the newspaper and put it online as a bunch of TIFs or JPGs?” Sure. That happens. But that brings me back around to the why question. We love newspapers (most of the time) and love making it as easy and intuitive to use them as we can. We think it’s important to exploit their newspapery-ness, because that’s how users think of and search them.

We also believe that standards like the one from NDNP are kind of like the rules of the road. While off-roading can be fun, driving en masse enables us to be interoperable and sustainable. Standards mean we have a baseline of shared understanding that gives us a boost when we decide we want to drive somewhere together.

This post’s bird’s eye view (perhaps a low-flying bird) doesn’t include more specific questions you may be asking (“What resolution do you use when you scan?” “You didn’t explain METSALTO!”) I also just tackled our print newspaper procedure, because it’s the most labor intensive. When we work with digitized microfilm and born-digital papers the procedure is truncated but similar.

I hope this post as well as part 1 and part 2 of this series give you a sense of what’s involved in our newspaper digitization process and why we do it the way we do. As always, we’re happy to talk more. Just drop us a line.


Looking Back at DigitalNC.org in 2014

Title page from the 1956 Buccaneer, from East Carolina College, the most popular item on DigitalNC.org in 2014.

Title page from the 1956 Buccaneer, from East Carolina College, the most popular item on DigitalNC.org in 2014.

The North Carolina Digital Heritage Center had a great year in 2014. We continued to work with partners around the state on digitization projects and added a wide variety of material to DigitalNC.org, making it easier than ever for users to discover and access rare and unique materials from communities all over North Carolina.

As we look back on our work over the past year, I wanted to share some of what we’ve learned by looking at our website usage statistics. Like many libraries, the Digital Heritage Center uses Google Analytics to capture information about what’s being used on our website, who’s using it, and how they got there. While there are still lots of questions remaining about usage of DigitalNC, these stats do give us a lot of valuable information.

In 2014, more than 250,000 users visited DigitalNC.org, resulting in more than 1.8 million pageviews. While people visited our website from computers located all over the world, the greatest number by far came from North Carolina. That’s what we expected and hoped to see. More than 200,000 sessions originated in North Carolina, with the users coming from 388 different locations, ranging from over 18,000 sessions in Raleigh and Charlotte to a single visit from the town of Bolivia in Brunswick County (user location is determined by the location of their internet service provider, so this may not tell us exactly where our users are located, but it’s going to be close in most cases).

What did people use on DigitalNC? We were not surprised to find that the most popular collection remains our still-growing library of yearbooks. The North Carolina Yearbooks collection received more than 125,000 pageviews alone, followed by newspapers (44,000) and city directories (11,000). And we were pleased to learn that at least somebody is reading this blog, which received nearly 2,500 pageviews last year. The most popular blog post was our announcement about the digitization of a large collection of Wake County high school yearbooks.

We were also curious to see what single items were the most popular over the past year. The winner, with 438 pageviews, was the 1956 yearbook from East Carolina University. The second most popular was also from East Carolina, the 1930 Tecoan, followed by the 1961 yearbook from the Palmer Memorial Institute and the 1922 yearbook from Appalachian State University.

Lake Hideaway, ca. 1950s, the most popular photo on DigitalNC.org in 2014.

Lake Hideaway, ca. 1950s, the most popular photo on DigitalNC.org in 2014.

The most popular image on our site was from the Davie County Public Library:  a black-and-white photo from the 1950s showing swimmers at Lake Hideaway in Mocksville. Other popular photos included a postcard showing the American Tobacco Company plant in Reidsville, N.C., a group of Stanly County students from 1912, and a portrait of Charles McCartney, the infamous “Goat Man” from the 1950s.

The variety of subjects, locations, and time periods in these photos is representative of the wide-ranging content available in North Carolina’s cultural heritage institutions and on DigitalNC.org. We are honored and excited to have a role in making this content accessible to everyone and look forward to sharing even more of North Carolina’s history and culture online in 2015.


Moving Image Digitization Project, 2014

Moving Image Digitization LogoThe North Carolina Digital Heritage Center is launching a pilot project to help preserve and improve access to historic films and videos in North Carolina’s libraries, archives, and museums. Working with its partners around the state, the Center will select a small number of films and videos, which will then be sent to a vendor to be digitized. The resulting digital files will be published online at DigitalNC.org where they will be made freely available to all users. The original films or videos will be returned to the institutions that contributed them.

We are inviting our existing partners, as well as cultural heritage organizations that have not yet worked with the Center, to nominate moving images from their collections. (See http://www.digitalnc.org/about/participate/ to determine if your organization is eligible.) The Center will evaluate all of the nominations (see evaluation criteria). in an effort to select a variety of content in different formats and which represents the cultural and geographic diversity of North Carolina.

Contact the Digital Heritage Center at digitalnc@unc.edu or (919) 962-4836 if you are interested in suggesting material to digitize or if you have any questions.

Why Is this Just a Pilot Project?

Digitization and online streaming of historic films and videos is complicated and expensive. This project is an effort to determine the cost and viability of providing moving image digitization services to North Carolina Digital Heritage Center partners.

Why Is Everything Being Digitized by a Vendor?

Right now, the Digital Heritage Center has neither the equipment nor the expertise necessary to handle and digitize historic moving images. Working with an experienced vendor will be the most efficient and most affordable way for us to make this content available to users.

How Will the Vendor Be Chosen?

State laws require that we open up this project to a bidding process. While we do not know what vendors will bid and what prices they will offer, we will require that the work is done by a vendor that has experience working with rare and fragile materials.

What If I’m Not Comfortable Sending Materials From My Collection to a Vendor?

We understand that not every institution will want to send unique and fragile materials off site. However, for this project, we have decided that working with an experienced vendor is the best way for us to provide access to this content. Materials that cannot be sent to a vendor will not be selected for digitization as part of this project.

I’ve Got Films That Are in Pretty Bad Shape. Can I Still Suggest Those?

Yes. We understand that many of the historic films in libraries and archives are in poor condition. That’s part of why we want to provide a service like this. We will make sure that we work with a digitization vendor that has experience evaluating the condition of historic films and we will not proceed with digitization if the conversion process is going to harm the original.

What About Copyright?

We will work with each institution to help determine the copyright status of the items nominated for digitization. For films that were created by individuals or companies, we will ask the nominating institution to make an effort to get permission to have the film digitized and shared online.

How Long Will This Take?

We don’t know. That’s part of what we are going to determine as we work on this project. You should expect your materials to be off site for at least a few months.

How Many Films or Videos Will Be Digitized?

It depends. Format, condition, and length are all factors that will contribute to the cost of digitizing historic moving images. We will prioritize the films and videos we’ve selected and digitize as many as we can with what we’ve budgeted for this project.

Selection Criteria for the Moving Image Digitization Project, 2014

  • Is the film or video believed to be unique to your collection, or are there copies at other institutions?
  • Do you have equipment available to play the film or video?
  • Is the media believed to be at least 40 years old?
  • Are you willing to have the film or video sent to a vendor to be digitized?
  • Is there a catalog record or anything describing the content of the film or video?
  • Does the media cover a time period of historical significance?  (For example: Civil War, Great Depression, World War II).
  • Was the film or video created by, or does it contain significant content by or about one of North Carolina’s historically underrepresented communities?
  • Is the media from a county or region that is already represented on DigitalNC.org or other digital library projects?
  • Is there a demonstrated demand for online access to the film or video?  If so, are there examples, such as requests from users or community members?
  • If this media is digitized, is the contributing institution willing to promote the media through press releases and other announcements or programs?


DigitalNC Blog Header Image

About

This blog is maintained by the staff of the North Carolina Digital Heritage Center and features the latest news and highlights from the collections at DigitalNC, an online library of primary sources from organizations across North Carolina.

Social Media Policy

Search the Blog

Archives

Subscribe

Email subscribers can choose to receive a daily, weekly, or monthly email digest of news and features from the blog.

Newsletter Frequency
RSS Feed