mitcho Michael 芳貴 Erlewine

Postdoctoral fellow, McGill Linguistics.

blog

Disgusting Word-formatted HTML and how to fix it

In working on a new website for the MIT Working Papers in Linguistics, I recently inherited a collection of HTML files with all of our books’ abstracts. To my dismay (but not surprise) the markup in these files were horrendous. Here are some of the cardinal sins of markup that I saw committed in these files:

  1. Confusing ids and classes. ids should be unique on the page… but here’s an instance of using multiple instances of the same id in order to format them together.
<div id="indent"> <div id="number">4.2.1</div> <div id="page">161</div> <div id="section">Old French (Adams 1987)</div>
</div> <div id="indent"> <div id="number">4.2.2</div> <div id="page">164</div> <div id="section">The evolution of the dialects of northern Italy</div>
  1. Putting a class on every instance of something. Everything paragraph should be formatted equivalently. We get the point.
<p class=MsoNormal><b>The English Noun Phrase in Its Sentential Aspect</b></p>
<p class=MsoNormal>Steven Paul Abney</p>
<p class=MsoNormal>May 1987</p>
  1. Using blank space for formatting.
<p class=MsoNormal><o:p>&amp;nbsp;</o:p></p>
  1. CSS styles that don’t exist. Browsers just ignore these anyway…
<p class=MsoNormal>One factor in determining which worlds a modal quantifies
over is the temporal argument of the modal’s accessibility relation.<span
style='mso-spacerun:yes'>  </span>It is well-known that a higher tense affects
the accessibility relation of modals.<span style='mso-spacerun:yes'> 
</span>What is not well-known is that there are aspectual operators high enough
to affect the accessibility relation of modals.<span style='mso-spacerun:yes'> 
</span&gt

The solution

My solution was to write a perl script which takes care of a number of these issues. It’s not foolproof and doesn’t involve any voodoo—for example, it can’t retypeset things which were formatted using whitespace—but it does a good job as a first pass.

You can run the script by making it executable (chmod +x cleanwordhtml.pl) then specifying a target filename as an argument. For example,

./cleanwordhtml.pl source.html > clean.html

I used this with a simple bash for loop to run over all my files:

for f in */*.html; do ./cleanwordhtml.pl $f > ${f%.html}-clean.html; done;

Hopefully someone else can benefit from my experience.

Tags: , , , , , , ,

If you enjoyed this post, make sure you subscribe to my RSS feed (optionally with tweets from my Twitter)!

  • http://intensedebate.com/people/georgeu2000 georgeu2000

    Wow. What a pain. That HTML thing in word is horrible.

    Thanks for sharing the script! Did you try a HTML cleaning tool like HTML Tidy?

    • http://ezyresell.com/about.html Chimpu Sharma

      I did and it is better than HTML Tidy in many ways :)

  • http://www.urlaubspartner.net Reisepartner

    Thanks alot for that script. Not only Word misformatted - even my own written code is sometimes disgusting. So that really helps. Thanks alot!

  • http://www.abercrombieuk.net abercrombie

    used to run into the same problem too, your solution is quite innovative.

  • chimpushrm9

    This is indeed a great tool for those of us who constantly use MS Word for web authoring. Thanks a bunch. :)

    Chimpu

  • http://mazuhl.tumblr.com Mazuhl

    I have a Ruby script that does something similar. It removes crummy Word formatting and lots of extra, unnecessary tags (tweaked for my job). I use it this way: copy HTML, run script (from keyboard quick launcher) on HTML that's in the clipboard, paste cleaned code. http://gist.github.com/291783

  • http://www.newscharts.de/ Nachrichten

    Absolutely great work, thanks so much for the script. For me too - Not only Word misformatted - even my own written code is sometimes disgusting. So that really does help.