<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>mitcho.com &#187; markup</title>
	<atom:link href="http://mitcho.com/blog/tag/markup/feed/" rel="self" type="application/rss+xml" />
	<link>http://mitcho.com</link>
	<description></description>
	<lastBuildDate>Fri, 10 Feb 2012 23:24:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4-alpha-19719</generator>
		<item>
		<title>Disgusting Word-formatted HTML and how to fix it</title>
		<link>http://mitcho.com/blog/projects/disgusting-word-formatted-html-and-how-to-fix-it/</link>
		<comments>http://mitcho.com/blog/projects/disgusting-word-formatted-html-and-how-to-fix-it/#comments</comments>
		<pubDate>Wed, 30 Dec 2009 21:29:44 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[markup]]></category>
		<category><![CDATA[microsoft]]></category>
		<category><![CDATA[MITWPL]]></category>
		<category><![CDATA[Office]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[word]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=3287</guid>
		<description><![CDATA[In working on a new website for the MIT Working Papers in Linguistics, I recently inherited a collection of HTML files with all of our books&#8217; abstracts. To my dismay (but not surprise) the markup in these files were horrendous. Here are some of the cardinal sins of markup that I saw committed in these [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/markdown-for-wordpress-and-bbpress/' rel='bookmark' title='Markdown for WordPress and bbPress'>Markdown for WordPress and bbPress</a></li>
<li><a href='http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/' rel='bookmark' title='回収 vs. 収集 and Better Word Meanings Through Usage'>回収 vs. 収集 and Better Word Meanings Through Usage</a></li>
<li><a href='http://mitcho.com/blog/life/the-most-beautiful-word/' rel='bookmark' title='The Most Beautiful Word'>The Most Beautiful Word</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>In working on a new website for the MIT Working Papers in Linguistics, I recently inherited a collection of HTML files with all of our books&#8217; abstracts. To my dismay (but not surprise) the markup in these files were horrendous. Here are some of the cardinal sins of markup that I saw committed in these files:</p>

<ol>
<li><strong>Confusing <code>id</code>s and <code>class</code>es.</strong> <code>id</code>s should be unique on the page&#8230; but here&#8217;s an instance of using multiple instances of the same <code>id</code> in order to format them together.<br/></li>
</ol>


<div class="wp_syntax"><div class="code"><pre class="html" style="font-family:monospace;">&lt;div id=&quot;indent&quot;&gt; &lt;div id=&quot;number&quot;&gt;4.2.1&lt;/div&gt; &lt;div id=&quot;page&quot;&gt;161&lt;/div&gt; &lt;div id=&quot;section&quot;&gt;Old French (Adams 1987)&lt;/div&gt;
&lt;/div&gt; &lt;div id=&quot;indent&quot;&gt; &lt;div id=&quot;number&quot;&gt;4.2.2&lt;/div&gt; &lt;div id=&quot;page&quot;&gt;164&lt;/div&gt; &lt;div id=&quot;section&quot;&gt;The evolution of the dialects of northern Italy&lt;/div&gt;</pre></div></div>


<ol>
<li><strong>Putting a class on every instance of something.</strong> Everything paragraph should be formatted equivalently. We get the point.<br/></li>
</ol>


<div class="wp_syntax"><div class="code"><pre class="html" style="font-family:monospace;">&lt;p class=MsoNormal&gt;&lt;b&gt;The English Noun Phrase in Its Sentential Aspect&lt;/b&gt;&lt;/p&gt;
&lt;p class=MsoNormal&gt;Steven Paul Abney&lt;/p&gt;
&lt;p class=MsoNormal&gt;May 1987&lt;/p&gt;</pre></div></div>


<ol>
<li><strong>Using blank space for formatting.</strong>  <br/></li>
</ol>


<div class="wp_syntax"><div class="code"><pre class="html" style="font-family:monospace;">&lt;p class=MsoNormal&gt;&lt;o:p&gt;&amp;amp;nbsp;&lt;/o:p&gt;&lt;/p&gt;</pre></div></div>


<ol>
<li><strong>CSS styles that don&#8217;t exist.</strong> Browsers just ignore these anyway&#8230; <br/></li>
</ol>


<div class="wp_syntax"><div class="code"><pre class="html" style="font-family:monospace;">&lt;p class=MsoNormal&gt;One factor in determining which worlds a modal quantifies
over is the temporal argument of the modal’s accessibility relation.&lt;span
style='mso-spacerun:yes'&gt;  &lt;/span&gt;It is well-known that a higher tense affects
the accessibility relation of modals.&lt;span style='mso-spacerun:yes'&gt; 
&lt;/span&gt;What is not well-known is that there are aspectual operators high enough
to affect the accessibility relation of modals.&lt;span style='mso-spacerun:yes'&gt; 
&lt;/span&amp;gt</pre></div></div>


<h3>The solution</h3>

<p>My solution was to write a perl script which takes care of a number of these issues. It&#8217;s not foolproof and doesn&#8217;t involve any voodoo—for example, it can&#8217;t retypeset things which were formatted using whitespace—but it does a good job as a first pass.</p>

<div class="files">
<div class="file">
<a href="http://mitcho.com/blog/wp-content/uploads/2009/12/cleanwordhtml.pl_.txt">cleanwordhtml.pl</a><br/>
<span class="specs">perl</span>
</div>
</div>

<p>You can run the script by making it executable (<code>chmod +x cleanwordhtml.pl</code>) then specifying a target filename as an argument. For example,</p>


<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">.<span style="color: #000000; font-weight: bold;">/</span>cleanwordhtml.pl source.html <span style="color: #000000; font-weight: bold;">&gt;</span> clean.html</pre></div></div>


<p>I used this with a simple bash for loop to run over all my files:</p>


<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">for</span> f <span style="color: #000000; font-weight: bold;">in</span> <span style="color: #000000; font-weight: bold;">*/*</span>.html; <span style="color: #000000; font-weight: bold;">do</span> .<span style="color: #000000; font-weight: bold;">/</span>cleanwordhtml.pl <span style="color: #007800;">$f</span> <span style="color: #000000; font-weight: bold;">&gt;</span> <span style="color: #800000;">${f%.html}</span>-clean.html; <span style="color: #000000; font-weight: bold;">done</span>;</pre></div></div>


<p>Hopefully someone else can benefit from my experience.</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/markdown-for-wordpress-and-bbpress/' rel='bookmark' title='Markdown for WordPress and bbPress'>Markdown for WordPress and bbPress</a></li>
<li><a href='http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/' rel='bookmark' title='回収 vs. 収集 and Better Word Meanings Through Usage'>回収 vs. 収集 and Better Word Meanings Through Usage</a></li>
<li><a href='http://mitcho.com/blog/life/the-most-beautiful-word/' rel='bookmark' title='The Most Beautiful Word'>The Most Beautiful Word</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/disgusting-word-formatted-html-and-how-to-fix-it/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

