Weekly Project Update 13
Yet another week of nothing really exciting to report. Right smack bang in the middle of some work that is incremental in nature and is therefore not that interesting to write about. So like last week, I’ll keep this update short.
Work is progressing on the import feature for Feed Journaler. I’m currently mired in adding support for reading Wordpress Export format (WXR) files. I’m hoping to leverage the existing data model currently being used to import blog posts coming in from an RSS feed, But in order to do this, some augmentation work is needed.
For one thing, the
FeedItem class, which is responsible for holding information about a feed (title, date, guid, etc.) assumes that the body of the imported post is in HTML. Since Day One prefers journal entries to be in Markdown, this class also does a HTML- to-Markdown translation. However, the WXR exports produced by Micro.blog already has posts in Markdown, making this translation unnecessary. So in order to use
FeedItem, the representation of a post body needs to be abstracted away, breaking it up into two implementations, one for HTML and one for Markdown. This sort of activity is pretty much routine, but it does involve moving codes around, fixing tests, and other things that are not super exciting.
One other wrinkle in this is the RSS parser library I’m using. To explain this, let me spend some time explaining the WXR file format. If you open it up in a text editor, you would see that it is, in essence, an RSS feed. But it’s an RSS feed that includes a fair bit of information about what the export actually contains. This is because the export includes, along with blog posts, everything else that makes up a site, like HTML pages and reference to assets like images. In order for an application to make sense of it all, each item needs to state what it actually is: whether it’s a blog post, or a HTML page. This is not something the RSS standard does out of the box, but thanks to the magic of XML and namespaces, WXR can include this information within the base RSS format by defining a separate namespace and including a bunch of elements that use it.1
Unfortunately the RSS parser library I’m using does not know about these namespaces so I can’t use it to read the WXR file. This means I will need to parse the file myself, using the XMLDocument libraries provided by Cocoa. Again, this is not hard, but it involves more moving things around, more tests to write, etc.
Such is the nature of software development sometimes. Peaks and lulls, swings and roundabouts, aphorisms and platitudes 😉 . But, yeah, not much more to say on this project this week.
But one rant about the WXR format before we leave. Of all the information they chose to include in an export, one thing they didn’t is the post’s content type. I had a try to see how posts looks in a WXR export from Wordpress. As I expected, and unlike Micro.blog, the posts are exported in HTML. But there’s nothing in the file saying that it’s HTML. Nothing like a
Content-type: text/html that you’d see in HTTP responses. So how is Feed Journaler to know the post body type? How could it tell, with absolute certainty, a post in HTML from one in Markdown? Without this information, it can’t.
So how I’m going to deal with exports from both Wordpress and Micro.blog will be interesting. I’ve considered adding support for detecting the type based on the post contents itself. For example, I could count the number times the substring
<p> appears in the post — something that isn’t often seen in Markdown. Or I could look for the substring
](http and guess that to be part of a Markdown link. Of course, this isn’t perfect: a typical micro post from Micro.blog may not contain either one, so the guess will be ambiguous
I guess for the first cut of this, I’ll just assume that the exports are from Micro.blog, and treat the body as Markdown. But this seems like a glaring omission from the format in my opinion.
There was another project that I spend some time working on. I’ve mentioned it before in passing, but I think to really talk about it, it will need some context. I think I will leave this for another post though.
Fun fact: podcast feeds do something similar, with most of them building atop the base RSS format with some additional elements identified by an iTunes namespace.↩︎