blog

Posts Tagged ‘HTML’

Disgusting Word-formatted HTML and how to fix it

水曜日, 12月 30th, 2009

In working on a new website for the MIT Working Papers in Linguistics, I recently inherited a collection of HTML files with all of our books’ abstracts. To my dismay (but not surprise) the markup in these files were horrendous. Here are some of the cardinal sins of markup that I saw committed in these files:

  1. Confusing ids and classes. ids should be unique on the page… but here’s an instance of using multiple instances of the same id in order to format them together.
<div id="indent"> <div id="number">4.2.1</div> <div id="page">161</div> <div id="section">Old French (Adams 1987)</div>
</div> <div id="indent"> <div id="number">4.2.2</div> <div id="page">164</div> <div id="section">The evolution of the dialects of northern Italy</div>
  1. Putting a class on every instance of something. Everything paragraph should be formatted equivalently. We get the point.
<p class=MsoNormal><b>The English Noun Phrase in Its Sentential Aspect</b></p>
<p class=MsoNormal>Steven Paul Abney</p>
<p class=MsoNormal>May 1987</p>
  1. Using blank space for formatting.
<p class=MsoNormal><o:p>&amp;nbsp;</o:p></p>
  1. CSS styles that don’t exist. Browsers just ignore these anyway…
<p class=MsoNormal>One factor in determining which worlds a modal quantifies
over is the temporal argument of the modal’s accessibility relation.<span
style='mso-spacerun:yes'>  </span>It is well-known that a higher tense affects
the accessibility relation of modals.<span style='mso-spacerun:yes'> 
</span>What is not well-known is that there are aspectual operators high enough
to affect the accessibility relation of modals.<span style='mso-spacerun:yes'> 
</span&gt

The solution

My solution was to write a perl script which takes care of a number of these issues. It’s not foolproof and doesn’t involve any voodoo—for example, it can’t retypeset things which were formatted using whitespace—but it does a good job as a first pass.

You can run the script by making it executable (chmod +x cleanwordhtml.pl) then specifying a target filename as an argument. For example,

./cleanwordhtml.pl source.html > clean.html

I used this with a simple bash for loop to run over all my files:

for f in */*.html; do ./cleanwordhtml.pl $f > ${f%.html}-clean.html; done;

Hopefully someone else can benefit from my experience.

Using Templates with YARPP 3

水曜日, 1月 14th, 2009

Post updated January 2012 to reflect cleaner template code available with YARPP 3.4.

If you have a YARPP support question not directly related to the templating feature, please use the YARPP support forums.

Version 3 of Yet Another Related Posts Plugin is a major rewrite which adds two new powerful features: caching and templating. Today I’m going to show you how you can use templates to customize the look of your related posts output.1

Previously with YARPP you were relatively limited in the ways you could present related posts. You were able to set some HTML tags to wrap your posts in and choose how much of an excerpt (if any) to display. This limited interface worked great for many users—indeed, these options still exists in YARPP 3.0. However, there’s also a new option for those of you who want to put your PHP skills to work and have complete control over your related posts display. The option will let you choose any files in the templates subdirectory of YARPP.

templates interface

(続きを読む…)


  1. For those of you interested in the WP and SQL voodoo used to make this happen, I’ve posted a more technical article


© 2006–2013 mitcho (Michael 芳貴 Erlewine).
Proudly powered by WordPress on Media Temple.
The views expressed on these pages are mine alone and do not
reflect those of my employers and clients, past and present.