I’ve recently begun using emacs org-mode, and I quite like it. I do a lot of my writing, some of which might finally end up here on my blog, in org-mode these days. For a lot of my applications it’s a perfect overlay on plain text documents. I can export what I’ve written as html or latex, which is fantastic.
What’s less fantastic, is when I want to cut and paste the html output here to my WordPress blog. Unfortunately org-mode puts in linefeeds between paragraph elements, and for some reason wordpress maintains these, resulting in incorrect word wrapping. So I want a way to remove the the linefeeds between paragraph elements.
This was just a little bit beyond my capabilities with SED, and I’m often telling myself self, you should really learn how to use sed and regex terms better
. So I thought bugger it, let’s figure out how to do this
. So I whipped out the excellent book “sed&awk” from O’Reilly.
As someone who has only used sed for banal substitutions, I had to learn the following:
- “:whatever” can be used to create a label. There are two commands that allow you to utilize these lables: “b” creates a branch, while “t” jumps to a label if a successful substitution has been made on the currently addressed line.
- “N” is needed to join two lines, since sed normally works on a one-line-at-a-time fashion.
With these two tidbits, and a basic understanding of how sed operates, we can construct the desired script.
:top
//
{:loop
N
s/\n/ /
/<\/p>/{P;D;btop}
bloop}
In one line the command looks like this:
sed ':top;/<p>/{:loop;N;s/\n/ /;/<\/p>/{P;D;btop};bloop}'
If you’re like me, it’s not immediately clear what’s going on here, so let’s break it down:
- First we create a label with “:top”.
- “/<p>/” tells sed to look for the paragraph block tag. The next ‘line’ of the script will be called after this tag is found.
- The curly braces “{}” group a set of commands, so upon encountering the paragraph tag, it executes the contents of these brackets.  In the brackets:
- Create a new label “:loop”.
- “N” creates a multiline pattern space by reading the next line of input, and appending it to the contents of the pattern space.
- “s/\n/ /”: substitute a space for the line feed.
- “/<\/p>/{P;D;btop}”: If sed encounters an end of paragraph tag, it executes “P;D;btop”, which (P) prints the contents of the multiline pattern space, (D) deletes it, and (btop) creates a branch(b) and goes to the label(top).  It’s a little like “if (<p>) goto top”.
- “bloop” (b) branch and goto label(loop).
So as long as no closing tag (</p>) is found, we have a loop that keeps adding new lines to the multiline pattern buffer, and substituting spaces for linefeeds. When the closing tag (</p>) is found, the loop goes back up to the “top” label. That loop makes sure all of the paragraph sections get handled.
So that’s it. If anyone knows a more elegant solution to this, I’d be glad to here about it.