Converting Word Document to HTML

Microsoft Word happens to be the text editing application of choice for many of my clients. And they would like to post the content to web, so I still find myself in a situation where I need to export a Word document to HTML.

There is an option to “Save document as Html” in Word that results in big, convoluted, HTML file full of errors. Also, there is a bit of inline formatting where I would like the format to concur to the website formatting instead.

So, a good result would have no or very little formatting, and as simple as possible markup.

Many paths… Here’s the one I walk

I have tried many ways to achieve that, even wrote a cleanup script, but in the end settled with exporting from Word and running Tidy on the resulting HTML file.

The the screenshots are made on Windows XP running Office 2003, but it’s basically the same with Windows Vista or Windows 7, and Office 2007.

1. Choose “Save as” item from the “File” or “Office Button” drop-down.

2. A dialog with file selector will appear, and select Save as type: “Web Page, Filtered (*.htm; *.html)”.

3. Go to Tools and select “Web Options…”

4. Go to “Encoding” tab and select Save this document as: “Unicode (UTF-8)”5. Hit OK to close Web Options and then Save to save the document.

You can close the Word now, it’s done its part.

Time to run the Tidy, and here is the command line with the options. Update: If you prefer to do this outside of command line, there’s a simple way now.

tidy.exe -asxml -utf8 -q -b -c --word-2000 y --drop-empty-paras y --fix-backslash y --drop-proprietary-attributes y -output

Huh, wow. It’s not hard as it looks! I’ll explain in more detail, and you can always learn more by running the help command, like so:

tidy.exe -?

Parameters

Here are the explanations and my comments for the parameters and configuration settings I used here.

I have copied the initial explanation for the parameters from the command line help, and from Quick Reference page for the configuration settings, and added my comments (in italic) for some of them.

So, let’s start at the beginning.

-asxml, -asxhtml = convert HTML to well formed XHTML; Probably not required for HTML5, but just a matter of preference – I like things neat.

-utf8 = use UTF-8 for both input and output; The document was exported as UTF-8, for the best support for international characters.

-quiet, -q = suppress nonessential output; There will be a lot of warnings because of the exported mess, and the export will fail in that case. So, hush.

-bare, -b = strip out smart quotes and em dashes, etc.

–word-2000 y = This option specifies if Tidy should go to great pains to strip out all the surplus stuff Microsoft Word 2000 inserts when you save Word documents as “Web pages”. Doesn’t handle embedded images or VML. You should consider using Word’s “Save As: Web Page, Filtered”. I guess this was the actual version when they created tidy. :)

–drop-empty-paras y = Drop empty paragraphs.

–fix-backslash y = This option specifies if Tidy should replace backslash characters “\” in URLs by forward slashes “/”.

— drop-proprietary-attributes = This option specifies if Tidy should strip out proprietary attributes, such as MS data binding attributes.

-output <path to output file> = Full path where you would like the produced file to be saved.

<path to input file> = Full path to the input file (that was exported from Word).

The results

After this you’ll get a about three times smaller, decently marked-up (X)HTML document.

What’s more important, the errors will be gone, and the document will be valid.

The document I have tested with after the export had 60 errors, and no errors after tidying. Note that some errors will probably occur if you have images in your document, since the ids will contain a space character, but that can’t be fixed with this method.

Update – simple version

If you happen to hate the command line, here’s a drag and drop version for Windows systems. No configuration either if you follow the instructions.

Here’s a link to a zip file to download, it contains just the tidy executable and a shortcut.
Here are links if you would like to download the shortcut and tidy separately.

When you download everything, follow these instructions…

  1. Create a tools folder on your hard drive “C:” if you don’t have one.
  2. Extract the files in the zip to the “C:\tools” follder.
  3. Copy the shortcut to your Desktop, or “Send to” folder, or wherever you would like to use it.

That’s it, the “installation” is over.

Now, you’ll still want to export the Word file as previously described, but now you can simply drag and drop it on the new doc2html shortcut, and it will fix the given htm file. Please note that now the original export file will be overwritten, but you don’t need it anyway.

Notes:
If you would prefer to keep the tool elsewhere, you’ll need to change the shortcut, but it’s pretty straightforward, just change the “C:\tools\tidy.exe” path.

2 Replies to “Converting Word Document to HTML”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.