Exporting emacs org-mode html, ain’t sed grand.

I’ve recently begun using emacs org-mode, and I quite like it.  I do a lot of my writing, some of which might finally end up here on my blog, in org-mode these days.  For a lot of my applications it’s a perfect overlay on plain text documents.  I can export what I’ve written as html or latex, which is fantastic.

What’s less fantastic, is when I want to cut and paste the html output here to my WordPress blog.  Unfortunately org-mode puts in linefeeds between paragraph elements, and for some reason wordpress maintains these, resulting in incorrect word wrapping.  So I want a way to remove the the linefeeds between paragraph elements.

This was just a little bit beyond my capabilities with SED, and I’m often telling myself self, you should really learn how to use sed and regex terms better . So I thought bugger it, let’s figure out how to do this. So I whipped out the excellent book “sed&awk” from O’Reilly.

As someone who has only used sed for banal substitutions, I had to learn the following:

  • “:whatever” can be used to create a label.  There are two commands that allow you to utilize these lables: “b” creates a branch, while “t” jumps to a label if a successful substitution has been made on the currently addressed line.
  • “N” is needed to join two lines, since sed normally works on a one-line-at-a-time fashion.

With these two tidbits, and a basic understanding of how sed operates, we can construct the desired script.

:top
/

/ {:loop N s/\n/ / /<\/p>/{P;D;btop} bloop}

In one line the command looks like this:

 sed ':top;/<p>/{:loop;N;s/\n/ /;/<\/p>/{P;D;btop};bloop}'

If you’re like me, it’s not immediately clear what’s going on here, so let’s break it down:

  • First we create a label with “:top”.
  • “/<p>/” tells sed to look for the paragraph block tag.  The next ‘line’ of the script will be called after this tag is found.
  • The curly braces “{}” group a set of commands, so upon encountering the paragraph tag, it executes the contents of these brackets.   In the brackets:
    • Create a new label “:loop”.
    • “N” creates a multiline pattern space by reading the next line of input, and appending it to the contents of the pattern space.
    • “s/\n/ /”: substitute a space for the line feed.
    • “/<\/p>/{P;D;btop}”:  If sed encounters an end of paragraph tag, it executes “P;D;btop”, which (P) prints the contents of the multiline pattern space, (D) deletes it, and (btop) creates a branch(b) and goes to the label(top).   It’s a little like “if (<p>) goto top”.
    • “bloop” (b) branch and goto label(loop).

So as long as no closing tag (</p>) is found, we have a loop that keeps adding new lines to the multiline pattern buffer, and substituting spaces for linefeeds.  When the closing tag (</p>) is found, the loop goes back up to the “top” label.  That loop makes sure all of the  paragraph sections get handled.

So that’s it.  If anyone knows a more elegant solution to this, I’d be glad to here about it.

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    This site uses Akismet to reduce spam. Learn how your comment data is processed.