blog

Archive for December, 2009

Disgusting Word-formatted HTML and how to fix it

Wednesday, December 30th, 2009

In working on a new website for the MIT Working Papers in Linguistics, I recently inherited a collection of HTML files with all of our books’ abstracts. To my dismay (but not surprise) the markup in these files were horrendous. Here are some of the cardinal sins of markup that I saw committed in these files:

  1. Confusing ids and classes. ids should be unique on the page… but here’s an instance of using multiple instances of the same id in order to format them together.
<div id="indent"> <div id="number">4.2.1</div> <div id="page">161</div> <div id="section">Old French (Adams 1987)</div>
</div> <div id="indent"> <div id="number">4.2.2</div> <div id="page">164</div> <div id="section">The evolution of the dialects of northern Italy</div>
  1. Putting a class on every instance of something. Everything paragraph should be formatted equivalently. We get the point.
<p class=MsoNormal><b>The English Noun Phrase in Its Sentential Aspect</b></p>
<p class=MsoNormal>Steven Paul Abney</p>
<p class=MsoNormal>May 1987</p>
  1. Using blank space for formatting.
<p class=MsoNormal><o:p>&amp;nbsp;</o:p></p>
  1. CSS styles that don’t exist. Browsers just ignore these anyway…
<p class=MsoNormal>One factor in determining which worlds a modal quantifies
over is the temporal argument of the modal’s accessibility relation.<span
style='mso-spacerun:yes'>  </span>It is well-known that a higher tense affects
the accessibility relation of modals.<span style='mso-spacerun:yes'> 
</span>What is not well-known is that there are aspectual operators high enough
to affect the accessibility relation of modals.<span style='mso-spacerun:yes'> 
</span&gt

The solution

My solution was to write a perl script which takes care of a number of these issues. It’s not foolproof and doesn’t involve any voodoo—for example, it can’t retypeset things which were formatted using whitespace—but it does a good job as a first pass.

You can run the script by making it executable (chmod +x cleanwordhtml.pl) then specifying a target filename as an argument. For example,

./cleanwordhtml.pl source.html > clean.html

I used this with a simple bash for loop to run over all my files:

for f in */*.html; do ./cleanwordhtml.pl $f > ${f%.html}-clean.html; done;

Hopefully someone else can benefit from my experience.

Our TV makes “PG13” look like “PG43.” I’m afraid of what that show could be. #ithinkitwaskingarthur #thekeiraknightlyone

— December 28th, 2009 8:02 pm

Seriously, #wtf is this!? http://twitpic.com/vmwsf cc @theunfocused @mayleesa @ChristianSonne

— December 28th, 2009 4:53 pm

I hate Word-exported HTML. s/mso-[^:]+:\s*[^;]+;//;

— December 28th, 2009 4:40 pm

Ever been unsure if you’re looking at a post preview or a published post? Try Distinct Preview for #wordpress. http://tinyurl.com/yg7xzdj

— December 28th, 2009 1:47 pm

My sister met the guy who does the voice of the Aflac duck as well as “drive your dreams” in Japan.

— December 28th, 2009 11:25 am

Announcing Ignite WordCamp at WordCamp Boston http://tinyurl.com/yh93qb7 #ignite #wordpress #wordcamp #boston #wcbos

— December 27th, 2009 11:16 pm

What is this!? It’s beautiful out and feels like spring! http://twitpic.com/vhiae

— December 27th, 2009 3:07 pm

Bacon spinach red potato mushroom cheese omlette. Good morning America! http://twitpic.com/vgtwf

— December 27th, 2009 12:24 pm

Whoops. Wrong account.

— December 26th, 2009 11:40 pm

Released YARPP 3.1.3b3—should fix #localizations http://tinyurl.com/ykwfmnp (dl link) Codestyling mo files had a problem with #wordpress 2.9

— December 26th, 2009 11:39 pm

Trying the free trial/diagnosis at OtherInbox.com. Anyone else use it?

— December 26th, 2009 9:05 pm

Trying the free trial/diagnosis at OtherInbox.com.

— December 26th, 2009 9:04 pm

I’ve been here for months… How did I just today discover the Bloc 11 cafe? http://twitpic.com/vc1op

— December 26th, 2009 1:24 pm

やっぱりクリスマスは飲茶だね! http://twitpic.com/v691s http://twitpic.com/v691t

— December 25th, 2009 12:48 pm

© 2006-2010 mitcho (Michael 芳貴 Erlewine).
Proudly powered by WordPress.
Entries (RSS) and Comments (RSS).
The views expressed on these pages are mine alone and do not
reflect those of my employers and clients, past and present.