Disgusting Word-formatted HTML and how to fix it
In working on a new website for the MIT Working Papers in Linguistics, I recently inherited a collection of HTML files with all of our books’ abstracts. To my dismay (but not surprise) the markup in these files were horrendous. Here are some of the cardinal sins of markup that I saw committed in these files:
- Confusing
ids andclasses.ids should be unique on the page… but here’s an instance of using multiple instances of the sameidin order to format them together.
<div id="indent"> <div id="number">4.2.1</div> <div id="page">161</div> <div id="section">Old French (Adams 1987)</div> </div> <div id="indent"> <div id="number">4.2.2</div> <div id="page">164</div> <div id="section">The evolution of the dialects of northern Italy</div>
- Putting a class on every instance of something. Everything paragraph should be formatted equivalently. We get the point.
<p class=MsoNormal><b>The English Noun Phrase in Its Sentential Aspect</b></p> <p class=MsoNormal>Steven Paul Abney</p> <p class=MsoNormal>May 1987</p>
- Using blank space for formatting.
<p class=MsoNormal><o:p>&nbsp;</o:p></p>
- CSS styles that don’t exist. Browsers just ignore these anyway…
<p class=MsoNormal>One factor in determining which worlds a modal quantifies over is the temporal argument of the modal’s accessibility relation.<span style='mso-spacerun:yes'> </span>It is well-known that a higher tense affects the accessibility relation of modals.<span style='mso-spacerun:yes'> </span>What is not well-known is that there are aspectual operators high enough to affect the accessibility relation of modals.<span style='mso-spacerun:yes'> </span>
The solution
My solution was to write a perl script which takes care of a number of these issues. It’s not foolproof and doesn’t involve any voodoo—for example, it can’t retypeset things which were formatted using whitespace—but it does a good job as a first pass.
You can run the script by making it executable (chmod +x cleanwordhtml.pl) then specifying a target filename as an argument. For example,
./cleanwordhtml.pl source.html > clean.html
I used this with a simple bash for loop to run over all my files:
for f in */*.html; do ./cleanwordhtml.pl $f > ${f%.html}-clean.html; done;
Hopefully someone else can benefit from my experience.
Tags: code, HTML, markup, microsoft, MITWPL, Office, perl, word
If you enjoyed this post, make sure you subscribe to my RSS feed (optionally with tweets from my Twitter)!
12月 30th, 2009 at 10:41 pm
Wow. What a pain. That HTML thing in word is horrible.
Thanks for sharing the script! Did you try a HTML cleaning tool like HTML Tidy?
1月 8th, 2010 at 12:24 pm
Thanks alot for that script. Not only Word misformatted - even my own written code is sometimes disgusting. So that really helps. Thanks alot!
1月 9th, 2010 at 4:40 am
used to run into the same problem too, your solution is quite innovative.
3月 2nd, 2010 at 1:21 am
This is indeed a great tool for those of us who constantly use MS Word for web authoring. Thanks a bunch.
Chimpu
3月 2nd, 2010 at 1:22 am
I did and it is better than HTML Tidy in many ways
4月 16th, 2010 at 5:05 am
I have a Ruby script that does something similar. It removes crummy Word formatting and lots of extra, unnecessary tags (tweaked for my job). I use it this way: copy HTML, run script (from keyboard quick launcher) on HTML that's in the clipboard, paste cleaned code. http://gist.github.com/291783
12月 27th, 2011 at 6:04 pm
Absolutely great work, thanks so much for the script. For me too - Not only Word misformatted - even my own written code is sometimes disgusting. So that really does help.