Microsoft Word happens to be the text editing application of choice for many of my clients. And they would like to post the content to web, so I still find myself in a situation where I need to export a Word document to HTML.
There is an option to “Save document as Html” in Word that results in big, convoluted, HTML file full of errors. Also, there is a bit of inline formatting where I would like the format to concur to the website formatting instead.
So, a good result would have no or very little formatting, and as simple as possible markup.
Many paths… Here’s the one I walk
I have tried many ways to achieve that, even wrote a cleanup script, but in the end settled with exporting from Word and running Tidy on the resulting HTML file.
1. Choose “Save as” item from the “File” or “Office Button” drop-down.
2. A dialog with file selector will appear, and select Save as type: “Web Page, Filtered (*.htm; *.html)”.
3. Go to Tools and select “Web Options…”
4. Go to “Encoding” tab and select Save this document as: “Unicode (UTF-8)”5. Hit OK to close Web Options and then Save to save the document.
You can close the Word now, it’s done its part.
Time to run the Tidy, and here is the command line with the options. Update: If you prefer to do this outside of command line, there’s a simple way now.
tidy.exe -asxml -utf8 -q -b -c --word-2000 y --drop-empty-paras y --fix-backslash y --drop-proprietary-attributes y -output
Huh, wow. It’s not hard as it looks! I’ll explain in more detail, and you can always learn more by running the help command, like so:
Here are the explanations and my comments for the parameters and configuration settings I used here.
I have copied the initial explanation for the parameters from the command line help, and from Quick Reference page for the configuration settings, and added my comments (in italic) for some of them.
So, let’s start at the beginning.
-asxml, -asxhtml = convert HTML to well formed XHTML; Probably not required for HTML5, but just a matter of preference – I like things neat.
-utf8 = use UTF-8 for both input and output; The document was exported as UTF-8, for the best support for international characters.
-quiet, -q = suppress nonessential output; There will be a lot of warnings because of the exported mess, and the export will fail in that case. So, hush.
-bare, -b = strip out smart quotes and em dashes, etc.
–word-2000 y = This option specifies if Tidy should go to great pains to strip out all the surplus stuff Microsoft Word 2000 inserts when you save Word documents as “Web pages”. Doesn’t handle embedded images or VML. You should consider using Word’s “Save As: Web Page, Filtered”. I guess this was the actual version when they created tidy. :)
–drop-empty-paras y = Drop empty paragraphs.
–fix-backslash y = This option specifies if Tidy should replace backslash characters “\” in URLs by forward slashes “/”.
— drop-proprietary-attributes = This option specifies if Tidy should strip out proprietary attributes, such as MS data binding attributes.
-output <path to output file> = Full path where you would like the produced file to be saved.
<path to input file> = Full path to the input file (that was exported from Word).
After this you’ll get a about three times smaller, decently marked-up (X)HTML document.
What’s more important, the errors will be gone, and the document will be valid.
The document I have tested with after the export had 60 errors, and no errors after tidying. Note that some errors will probably occur if you have images in your document, since the ids will contain a space character, but that can’t be fixed with this method.
Update – simple version
If you happen to hate the command line, here’s a drag and drop version for Windows systems. No configuration either if you follow the instructions.
When you download everything, follow these instructions…
- Create a tools folder on your hard drive “C:” if you don’t have one.
- Extract the files in the zip to the “C:\tools” follder.
- Copy the shortcut to your Desktop, or “Send to” folder, or wherever you would like to use it.
That’s it, the “installation” is over.
Now, you’ll still want to export the Word file as previously described, but now you can simply drag and drop it on the new doc2html shortcut, and it will fix the given htm file. Please note that now the original export file will be overwritten, but you don’t need it anyway.
If you would prefer to keep the tool elsewhere, you’ll need to change the shortcut, but it’s pretty straightforward, just change the “C:\tools\tidy.exe” path.