Generating Documents with the Open XML SDK – Part 1

Recently I was asked to create a document generation engine for a loan application and quote system at work. Our customers needed to enter some basic information about a loan applicant, and at a later time receive a URL to the bundle of PDF’s which represented the data rich documents they needed to send to their client.

The problem outlined:

Generally speaking, the templates were already in .rtf, .doc or .docx form. Many included old mail merge and form fields, as well as embedded formulas and the like. Additionally, the following cases needed to be catered for:

Repeating sections
Data-bound tables
Template composition
Turning sections on and off based on data
Image injection
Data-bound fields

Several documents need to be created at once, then zipped up to form a downloadable bundle for the end user to consume. The killer was that the end users wanted to make templates using familiar tools.

With a variety of options at our disposal; XSL:FO , HTML via a view engine like Spark, InfoPath, or even Adobe Forms the determining factor for us was that when end users want to design documents, 90% of the time, they want to do it in MS Word. Generally speaking, our users were familiar with defining ‘what they want’ in Word. As a design tool, Word’s typographic capabilities lie somewhere between XSL:FO and Adobe InDesign or QuarkXPress.

Unfortunately, using Word has its drawbacks. Given source documents that contain various conflicting methods of data binding, domain logic on the form of formulas and form fields, as well as having file creation dates in the early 90’s; the potential for mandelbugs is extremely high.

In the past, automating word server-side with .NET meant hacking about with COM Interop or VBA or both. While it was a practical approach, it often meant a number of things:

Server side installation and licensing of Office products
Difficulties in managing instances of winword.exe, excel.exe etc.
COM Interop libraries were designed with Visual Basic making development in C# difficult.
Generally high levels of excruciating, eye-popping, pain.

A new solution

The Open XML SDK, now in its second version, is a suite of tools including a flexible API to generate documents, and a reflectoresque tool that shows how an office document is constructed. This suite is designed to cater for document creation; it will not automate user interactions to Powerpoint, but it will make awesome documents from scratch, and it will do it faster than you can say “No more PIA’s!”.

This comprehensive API gives you the flexibility to inject content as XML directly, or to create content using typed classes. Finally, LINQ to XML works brilliantly, and VB developers could even take full advantage of XML literals, and intellisense for XML Schemas if desired.

Developer prerequisites:

Open XML SDK v2.0 ( get the productivity tool here too)
Content Control Toolkit
Visual Studio 2008 or above

Benefits

No need to have licenced products installed on servers to generate templates.
Templating in a familiar editor
Can convert various formats of documents into templates
Super awesome fast
Verifiable output – output can be schema verified
Extensible
Testable, maintainable code

Drawbacks

No ability to automate Word itself, or inspecting paginated output.
Requires a basic understanding of XML & XPath queries
Code required.

A design emerges

User interaction

The process begins when a customer interacts with the user interface to enter relevant information about the documents to be created. Additional information about the request is sourced from any existing data available. The interface stores these requests for documents as jobs in a queue.

User interface driven document generation jobs — Our customers interact with our software to request a bundle of generated, data-rich documents. These are stored as jobs in a queue.

Job processing

A separate job service polls this collection of jobs for new work to perform, fetches any required data, flattens the data into presentation models, and delegates to relevant ‘DocumentBuilders’ to create the documents themselves.

The last stage involves converting the documents to PDF, moving the resultant documents into a folder structure which is then zipped, moved and linked to.

Job Procesing — The job processor polls the job queue for docgen jobs, and chooses the required document builder(s) to execute the job.

Document Building

The document builders create a word document based on a template document and an XML representation of the data to be injected. They do this following an MVC pattern of sorts; the template is just a view, it has knowledge of data-bindings and that’s about it. The document builder is the controller, it initializes the process and passes data to the template, as well as orchestrating post-data-binding manipulations of the template. The model, comes in the form of a POCO which is ultimately serialized to XML and injected into the view by the controller.

To clarify, each document builder is responsible for generating one type of document. They may have intimate knowledge of the view and the model; they are by no means generic. However, there are generic patterns we can apply to common design issues and I will get to those in a later post.

Content Controls and CustomXmlParts

At the heart of the design is the concept of content controls: these are a feature of MS Word that allow us to use place holders in a document and bind data to them. I also use them to allow manipulations to the document beyond simple data-binding.

CustomXmlParts are equally integral; these are the buckets in which we pour our view models into. Once hydrated, the content controls in a word document can data bind to nodes in the CustomXmlPart via XPath queries.

Where to go from here

In my next few posts, I’ll dive deeper into the preparation of templates, data binding them to XML, various tools I use, and the Document Builders themselves. Along the way I’ll be solving some common issues like tables and composing templates. Finally, I’ll broach the topic of automated testing and potential for a TDD like approach.

UPDATE: Part two – Databinding with ContentControls is now published.

In the meantime, I’d like to direct you to the sources of information I used to become familiar with Open XML:

While I’m here, I’ll just make a quick shout out to my new colleagues on this project George and Paul, whose hard work underpins a lot of the ideas you see here. A special thanks to Darren for encouraging me to get this stuff out in the form of a blog, and challenging my thinking every step of the way. Thanks guys, this is a direct result of your hard work and advice.

9 thoughts on “Generating Documents with the Open XML SDK – Part 1”

Phil M says:

2010/08/07 at 1:16 am

Good start. Did you ever get back to building the Document Builder.
Where’s part 2, 3 etc????
I want to do server-side mail-merge from a Word template using the OpenXML SDK, replacing content controls with data from a SQL Server query and producing repeated pages as many as the query needs.
This allows the user to create word templates with embedded content controls, and thus manage the design.
There is much about this on the net. But nothing showing how to actually write a class that allows one to find a content control and replace it with data, and do it page after page.

1. jburger79 says:
  
  2010/08/07 at 8:56 am
  
  Thanks Phil,
  Part 2 & 3 aren’t far off – Im hoping to polish 2 off very soon. Stay tuned!
  Rest assured you can achieve server side document building and I’ll show you how it can be done.
  
  Cheers,
  JB
  
Adam says:

2010/08/20 at 12:14 am

Yes, interesting – Looking forward to reading part 2 & 3 also.

My company is working on an in house MS Access application (client front and server backend) that is currently using bookmarks to export data to a variety of word 2k7 templates depending on the user requirement. And now also looks like additional documents will also be required to create a final document package.

We also have users complain that they cannot adjust the templates without causing issues. I think somewhere in what you describe there is something allowing us to be much more flexible.

I’m not an expert or developer but I am awaiting the next steps to see if this is something we can use. if so I will be forwarding this info to my colleagues for review.

Thanks for the info so far, keep up the good work.

Tim H says:

2010/08/26 at 6:13 am

Excellent article Jim. We’re attempting to do something very similar but from a Java angle. We’re having good early results using the DocX4J toolkit instead of the Open XML SDK. The problem area seems to be achieving an accurate conversion of the Docx file to Pdf.

Have you reached the ‘convert to pdf’ stage of your project yet? What conversion solution are you planning to use? I really hope that SharePoint 2010 Word Automation Services is not the only practical solution here – having to licence expensive SharePoint servers just to do a Pdf conversion seem ridiculous.

Thanks,
Tim H

1. jburger79 says:
  
  2010/08/26 at 8:47 am
  
  Thanks Tim!
  
  We are getting some great results using a 3rd party server utility called Ecrion XF Rendering Server. Two main benefits: it was fastest in our internal trials against a few other major competitors, and it doesn’t rely on an instance of Word being installed on the server. I believe it uses XSL:FO under the covers. Additionally, we have encountered a few issues with the use of afChunks, font rendering issues and the like and their support has been fantastic. It is one of the more expensive players in the market (AUD$1500), however it fares favourably when put up against a sharepoint 2010 instance.
  
  Hope that helps, and good luck 🙂
  JB
  
Tim H says:

2010/08/26 at 10:08 am

Jim,

Thanks for your prompt reply. I’ll have a closer look at the Ecrion solution. Interesting that you quoted licence costs in Australian dollars!

Regards,
Grateful Kiwi.

Pingback: Generating Word Documents – Part 2: Simple Databinding « Cultivating code
Pingback: Generating Word Documents Part 3 – Mail Merge without Word « Cultivating code
Pingback: 2010 in review « Cultivating code