Disgusting Word-formatted HTML and how to fix it
In working on a new website for the MIT Working Papers in Linguistics, I recently inherited a collection of HTML files with all of our books’ abstracts. To my dismay (but not surprise) the markup in these files were horrendous. Here are some of the cardinal sins of markup that I saw committed in these files:
- Confusing
id
s andclass
es.id
s should be unique on the page… but here’s an instance of using multiple instances of the sameid
in order to format them together.
- Putting a class on every instance of something. Everything paragraph should be formatted equivalently. We get the point.
- Using blank space for formatting.
- CSS styles that don’t exist. Browsers just ignore these anyway…
The solution
My solution was to write a perl script which takes care of a number of these issues. It’s not foolproof and doesn’t involve any voodoo—for example, it can’t retypeset things which were formatted using whitespace—but it does a good job as a first pass.
You can run the script by making it executable (chmod +x cleanwordhtml.pl
) then specifying a target filename as an argument. For example,
I used this with a simple bash for loop to run over all my files:
Hopefully someone else can benefit from my experience.