Converting Content to XML: the Truth You Should Know

Recently, we finished several conversion projects. These projects involved the migration of legacy content from unstructured FrameMaker to DITA and to proprietary client’s XML standards.

Although there were some challenges, it was fun. We found new interesting solutions and obtained good results. We used Mif2Go and FrameMaker conversion tables, along with our own scripts and achieved a fast pacing conversion. The feedback that we received from the customers indicates they are as pleased with the results as we were.

So, it’s a good time to summarize and share our experience with you. The most frequently asked questions are typically about the efforts required to make a conversion project successful. The conversations discussed during our latest Techshoret conference have proven that it’s one of the important issues that people take into account when considering moving to DITA.

To understand whether or not a conversion will be painless, you have to realize that every conversion of unstructured content to XML is style based. Whether you convert Word to DocBook, or FrameMaker to DITA, whether you use Mif2Go or FrameMaker conversion tables, you should map styles used in legacy documents to XML elements. For example, you define content formatted with the Heading1 style to be wrapped into the title element, text formatted with the Note style goes to the note element, and so on.

So, the success of a conversion project mainly depends on whether or not you used styles in your legacy documents properly and consistently. Remember, computers do not understand semantics on their own. They rely on styles instead. Computers cannot recognize a portion of content as a note, unless you apply the note format to the content.

To make a conversion as painless as possible, I recommend the following suggestions:

1. Get your legacy content into order. XML standards usually impose certain rules on how information should be structured and organized. For example, an XML standard can require that you provide some background information for a procedure before the actual steps. If the structure of your legacy content doesn’t meet this content model, you’ll have to rewrite your procedures appropriately.

If you are converting content to a topic-based model, such as DITA, define the boundaries of a topic. The simplest way is to use heading styles as a distinct sign of a new topic. (Be aware, however, that this doesn’t work 100% of the time.) Alternatively, you may also want to split legacy content into individual topics based on 1st level headings, while treating nested section headings as subsections of a topic. In any event, make sure that headings are distinctively and consistently formatted.

2. Get rid of redundant and repetitive content. It’s especially important when you migrate to DITA or other XML standards that you provide effective mechanisms for content reuse. There is no need to waste time on converting multiple pieces of content that provide the same information, but are just phrased differently.

You need to convert only one portion of the content. You probably would want to rewrite the content in some unified way so that you will be able to reuse it in multiple contexts.

3. Mark up information types. The XML standard to which you convert the documentation may require different ways for structuring information, depending on the type of content. (For example, DITA provides different content models for procedures and reference materials.) Alternatively, an XML standard may require that the information type should be declared as a value of a special attribute.

In any event, in your legacy content, you have to manually mark up the content to indicate the information type for each portion of the content. In FrameMaker, for example, you can do it using markers. A conversion tool will use these markers to automatically wrap the content into the upper level elements that are valid for the information type.

4. Use formatting styles consistently. For example, if you italicize words using a style called Italicize, keep using it throughout all your documents. Don’t create a duplicate style with another name, say, Emphasis, which has the same format.

Similarly, if you use a special format for “to do” statements, keep formatting all “to do” statement the same way. Don’t leave them formatted with Body or other styles.

5. Don’t use manual or ad hoc formatting. Again, computers are not that smart, they rely on styles to identify to which XML element the portion of content should be converted. If you use CTRL+B to bold a text rather than the format specifically designed for bolding, this text won’t be wrapped into an appropriate XML element and thus, won’t be bolded in your output.

6. Distinctively format every piece of content that must be translated into a specific XML element. For example:

  • If you have a note in an original document, where this note must be converted to the note element, in the original document, the note must be formatted with a style created specifically for notes. You can’t use the regular Body format and expect that a conversion tool will somehow recognize this text as note.
  • If you use numbered lists in procedures and in conceptual descriptions, while a target XML standard has different elements for numbered lists in procedures (for example, steps) and in other contexts (for example, li), use two different styles.

7. Use conditional tags consistently. Most likely, you want to preserve conditional text. In XML, content can be conditionalized using XML attributes. This means that you have to map conditional tags to the corresponding attributes and their values. For example, you can define that the content marked up with the conditional tag ProductA must be wrapped into an element with the Product attribute set to “ProductA”. To make all conditional text converted to XML, make sure you use conditional tags consistently.

8. Define mapping for markers, if necessary. Sometimes, you may want to preserve some markers (for example, index) in XML, and make these markers stored as attributes or XML elements. Specific implementation of marker mapping can differ depending on the conversion tools that you are using, but you’ll have to map them anyway – don’t think a computer will do it for you if you didn’t give specific instructions.

Of course, these are general considerations. Each XML standard has its own requirements, your content model can be less restrictive, but if you take these tips into account, you will be off with a good start.

If you want to see how conversion works, come to the FrameMaker user group meeting that will be held on Sunday, March 15th at 6pm at Valor in Yavne. Adam Dales from Siemens and I (WritePoint) will be presenting a case study on the conversion project we’ve accomplished for Siemens. We’ll tell you about challenges we faced, solutions we found, and tools we used. You’ll learn a lot of new things, I promise.