2021-11-17 Rob Underwood

Lessons Learned from Capturing XML

XML may be the most basic content tagging language, but there is a lot to discuss around what it is and how Naviga solves challenges associated with it.

When examining the relationship between XML and the publishing industry, there is a lot to talk about, even though it is the most basic content language. In this blog post, I’ll take a deep dive into XML, my interactions with it, and how Zinio and Naviga have solved the challenges that come with XML. 

What is XML?

I guess the proper way to start this topic is to not assume you know what XML is… 

XML stands for extensible markup language. A markup language is a set of codes, or tags, that describes the text in a digital document. The most famous markup language is hypertext markup language (HTML), which is used to format Web pages. 

In the publishing industry, there has always been a challenge for content distribution for the two channels they distribute to: digital and print. 

Once a magazine or newspaper is printed, everything exists fully edited in InDesign. The challenge is, how do we get that content from finished pages onto our web pages? There had to be a better solution than copying and pasting. 

The answer to the question is that we either need to get the XML out of InDesign or we need to have it in the XML format prior to placing it in InDesign – the chicken or the egg conundrum of our time. 

It just so happens that I now work for Naviga, which has been solving that problem for a very long time! Naviga has created systems where writers create content in a web portal and enter information into fields for Headline, Dek, Byline, Body, sidebar, pullquote, photo credit, etc… Unbeknownst to the writer, they have just created a fully tagged XML structure. 

These stories can then flow into other channels, such as a web page, InDesign, Apps, Apple News Plus, Amazon, or any other digital iteration that exists now and in the future! 

 

My relationship to XML

 I never could have guessed how important XML would be in my career. 

In 2009, I was handed a book: A Designer’s Guide to Adobe Indesign and XML: Harness the Power of XML to Automate Your Print and Web Workflows. And it still sits on the bookshelf in my office.  

Fun Fact: I actually ended up working with Cathy Palmer several times over the years! 

I was so intrigued by the possibilities of XML as a language that could facilitate the publishing industry with a solution to its most basic problem: repurposing content. 

This one book took hold of my imagination and has guided my career. In addition to teaching a course on XML, at every job I have had over the years, working with XML has always been part of my role in publishing.  

Zinio and XML

When I was interviewed for my current role at Zinio in 2015, I knew this was where I needed to be after seeing what Zinio was doing for its publishers. 

First, Zinio was taking the finished PDF’s and creating XML for the publishers. 

This was the most ingenious idea! Build a whole tagged database of stories for our publishers and take the worry about distribution on multiple digital platforms right off their plate. 

Secondly, Adobe DPS was nearing the end of its shelf life. Publishers wanted a low touch solution to apps for Google Play, iOS and other digital platforms that they could see and not foresee, such as Apple News Plus. Zinio offered this solution while even feeding the tagged data back to them via an API!  


XML challenges and how we solved them  

Turning a PDF into XML can be a challenge. For anyone who has ever tried to extract copy from a PDF, you can understand this. 

What we have done over the years is write algorithms for capturing content based on how certain content types appear, such as body, headline, etc… We can base these algorithms on font, color, size, etc… 

To make the XML capture process much cleaner, we allow our publishers to deliver us InDesign documents in conjunction to the PDF. We capture the cover and all full-page ads from the PDF, and get the stories right out of InDesign. This makes for a cleaner caption, especially for hyphenation. Additionally, the images have not been washed through the PDF process. 

Because of advances with file transfer over the years, uploading 5 GBs of packaged InDesign documents can be handled in less than 20 minutes, not the 8 hours it took a decade before. 

If paragraph styles are used consistently within InDesign we can map styles to tags, which makes the process even easier on our end. 

We also accept a flat plan which can correct a number of things. 

  1. We can mark ads so that advertisements don’t accidentally get captured as editorial. 
  2. We can make a better Interactive TOC if all section names are listed. 
  3. We can control how sidebars are handled, and whether or not they become their own story. 
  4. We can separate two stories on the same page into separate stories. 
  5. Additionally, other notes can be passed along to the team for a better capture. 

Templates: Because all of your content is tagged, none of the content needs to be styled after the capture. We create templates that automatically apply your styling. 

What all of this work amounts to is little to no amount of work for the publisher. If the publisher does wish to create a more digitally immersive experience, they can now spend their time adding additional video content and applying multiple templates to different stories or sections. 

In conclusion, XML has saved the publishing world from countless hours of needed labor. 

XML has changed my life along with many other employees in the publishing industry, and in a good way! Zinio has ambitiously used the XML platform to expedite the digital deliveries for publishers presently and in the future. It is our commitment to you that we will continue to streamline that workflow and to make digital publishing an automated part of your digital delivery.  

Please contact us if you are interested in the many ways that we can transform your content and deliver it digitally on your behalf!