Processing MS Word Documents

The perpetual endurance of using Word documents in authoring/generation processes has long been a headache for publishers and anyone else wanting to re-use the content from these documents in a clean, easy and structured way and to also repurpose in other processes and situations.

This extracted data is commonly used for the production of high-quality typeset documents and processing for correction cycle workflows in ways that can be better managed outside of Word. Many customers also require a Word format as a post typeset deliverable, along with structured data files (such as XML/HTML), formatted PDF/PS and other formats like EPub or EDGAR filing.

Common pain points that we can help overcome when handling Word documents include:-

  • Multi-source, mixed formats, inconsistent styling.
  • Incorporate Word styles or map to styles based on matched property styles.
  • Avoid multiple post conversion clean-up operations with first-time consistent conversions.
  • Configurable cleanup operations to merge currency columns and delete empty rows.
  • Intelligent conversion handling to detect table headings that may have not been set in Word and rotate larger tables to landscape.
  • A rationalised conversion to clean and consistent outputs.
  • Word files generated from the Composition typeset file.

Introducing CT Conversion Software and Editor Functionality

CT includes an MS Word document processing tool for the extraction of data and embedded content and re-generation of Word documents.

The content is extracted into clean, structured and validated XML or SML* tagged data formats for use in typesetting or data processing and then these same formats can be taken back into CT for Word file regeneration, the ‘round trip’.

The CT software comprises of many configurable options of how the content should be extracted and what elements of the document should be preserved or ignored. This helps to provide a clean and consistent output, no matter what condition or variations your source documents contain. It also provides an editing environment combined with an integrated XML and SML validation engine and tag set reference documentation (for SML), providing the perfect conversion and clean-up environment prior to feeding structured data into your production workflow.

The XML output format is a minimal structure we call SMX which provides a streamlined yet rich XML compliant format that can be used directly or easily transformed to be used with a customers’ existing Schema or DTD via a built-in XSLT pipeline.

Main Feature Summary

  • Converts documents to structured tagged formats and extracts content and embedded objects.
  • Converts tagged formats and objects back to Word format.
  • Simple drag and drop GUI or command line driven options.
  • Based on the latest .Net and Java technologies.
  • XML and SML output formats.
  • Regex, XSLT, and Perl post-processing output customisation options.
  • Multiple configuration options for cleaning, standardizing and enriching outputs from multiple mixed typeset MS Word documents.
  • Ability to bring through all paragraph Word styles if a Word document has been styled correctly and can be used within the typesetting workflow.
  • Integrated with a Validation and Comparison Engine for post-transform tag set validation and document comparison options.
  • Scalable and customizable docx processing solution that can integrate into many composition typesetting environments and workflows.

Free Trial Download
Request a no obligation, fully functional 14 day free trial: Evaluations

Pricing and Purchase
See more information about the different CT product options, pricing and buy online on our Products page.

More about SML
SQUAREMOONS SML is a structured tagged data format developed specifically for our solution platform and workflow. It is a non-XML format designed as an intuitive data tagging and coding markup language which can be fully validated and documented by our purpose-built integrated SML data validation and comparison engine which can work seamlessly inside and outside of a composition environment.

SML can be utilised for typesetting within different applications such as Arbortext APP and Adobe InDesign, sample screenshots of which are below.