Notes on migrating this blog from Wordpress to Plone
I'm proud to be now running this blog on Plone 4, having just completed a migration from Wordpress. Here are a few notes on the migration.
Initial steps that I won't go into detail on, at least for now:
- Obtained a Linode VPS with enough RAM to feed Plone (it's looking like my small Plone 4 site will need about 100MB to itself ... a greedy baseline, but that does include the database).
- Created a Plone 4 buildout with ZEO and one Zope instance being run under supervisord.
- Installed plone.app.caching, the new caching framework Martin Aspeli and Ric Newbery have been working on. This still requires a number of svn checkouts at this point, but it's coming along nicely.
- Installed Scrawl so that I can manage blog entries as a separate content type.
- Installed plone.app.discussion, the new comment and discussion add-on created by Timo Stollenwerk, and its recaptcha add-on. Made sure comments were enabled for the Blog Entry type, and that comment moderation was not turned on for the duration of the migration.
- Turned off much of Plone's default HTML filtering. Security is not a huge concern since I'm going to be the only one editing the site, and I tend to want to use fancy stuff in posts sometimes.
Data Export
Moving the posts and comments from my old blog's MySQL database was easier than I feared, though I did have to do a bit of coding.
I decided up front to do this by way of dumping the data to CSV files, rather than writing import code that read directly from MySQL. That was mostly a visceral reaction to a memory of a hard time getting MySQL-python installed and working properly once previously, and may have been an irrational fear. But dumping the data from MySQL to CSV was easy enough, with the following two queries that grabbed just the data I needed:
SELECT ID, post_date, CONVERT(post_content USING latin1), post_title
INTO OUTFILE '/tmp/musings.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n'
FROM wp_posts, wp_post2cat
WHERE wp_posts.ID=post_id AND category_id=48
AND post_status='publish';
SELECT comment_post_ID, comment_author, comment_author_email, comment_author_url, comment_date, CONVERT(comment_content USING latin1), user_id
INTO OUTFILE '/tmp/comments.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n'
FROM wp_comments
WHERE comment_approved = '1' AND comment_type='';
This grabs a bunch of fields and dumps them into a CSV file with the given CSV dialect parameters. The only tricky bit here is the call to CONVERT, which was needed because my raw data in MySQL had been improperly encoded. A normal connection to MySQL defaults to the latin1 encoding (which is what MySQL calls windows-1252). But Wordpress had been sending it utf8-encoded data, and the MySQL table had been configured to store things as utf8. So when I stored data, MySQL was decoding the utf8 input as windows-1252, and then re-encoding as utf8. On retrieval via a normal connection the reverse transformation was applied and it didn't matter, but SELECT INTO OUTFILE just copies the raw data from the table, which was effectively gobbledygook. So I had to explicitly make MySQL convert the stored value to latin1 (read: decode it as utf8 and then encode as windows-1252) in order to end up with the utf8 I wanted. This would have been needed for the other fields as well, except I wasn't using non-ASCII characters in them.
The category restriction on the first query makes sure that I only got the Plone-related posts from the old blog (I had been using it for personal blogging as well). The comment type restriction on the second query excludes pingbacks.
Data Import
I ended up writing this custom External Method to import the CSV data into Plone. (I looked at transmogrifier, csvreplicata, and ArcheCSV, but for all of these it looked like I would have ended up writing a significant amount of code in the end anyway to get them to do what I wanted. And I knew I could do that "from scratch" in just a page or two of Python...)
import csv
import re
from DateTime import DateTime
from zope.component import queryUtility, createObject
from plone.i18n.normalizer.interfaces import IIDNormalizer
from plone.app.discussion.interfaces import IConversation
PRE_RE = re.compile(r'(<pre>.*?</pre>)', re.IGNORECASE | re.DOTALL)
def cleanup_wordpress_text(text):
text = PRE_RE.sub(lambda x: x.group(1).replace('\r\n\r\n', '\n\n'), text)
return text.replace('\r\n\r\n', '<p>').replace('\r\n','\n').decode('utf-8')
def importmusings(self):
context = self
reader = csv.reader(open('/tmp/musings.csv'), delimiter=',', quotechar='"', doublequote=False, escapechar='\\')
posts = {}
for row in reader:
id, date, text, title = row
short = queryUtility(IIDNormalizer).normalize(title)
if short in context:
del context[short]
post = context[context.invokeFactory('Blog Entry', short)]
post.setCreators(['davisagli'])
post.setEffectiveDate(DateTime(date))
post.setTitle(title)
text = cleanup_wordpress_text(text)
post.setText(text, mimetype='text/html')
post.reindexObject()
context.portal_workflow.doActionFor(post, 'publish')
posts[id] = post
reader = csv.reader(open('/tmp/comments.csv'), delimiter=',', quotechar='"', doublequote=False, escapechar='\\')
for row in reader:
post_id, author, email, url, date, text, uid = row
try:
post = posts[post_id]
except KeyError:
continue
conversation = IConversation(post)
comment = createObject('plone.Comment')
comment.text = cleanup_wordpress_text(text)
if uid == '1':
comment.creator = comment.author_username = 'davisagli'
comment.author_name = 'David Glick'
comment.author_email = 'dglick@gmail.com'
else:
comment.creator = None
comment.author_name = author
comment.author_email = email
date = DateTime(date).asdatetime()
comment.creation_date = comment.modification_date = date
conversation.addComment(comment)
return 'Done.'
That cleanup_wordpress_text function turns double newlines from Wordpress into proper paragraph tags -- unless they're within a PRE tag. The rest of the code is pretty readable -- yay, Python.
Syntax Highlighting
You probably noticed one of the new site features -- syntax highlighting for blocks of code. This is provided by the Pygments module, applied as a transformation to the entire response just before the Zope publisher returns it. I achieved that via a plugin (called collective.pygmentstransform, available in the collective, and not released so far or probably ever) for Martin's plone.transformchain (also unreleased so far). It's imperfect (not least because it guesses the language heuristically), but good enough for now I think. Yes, I should probably be doing this as WSGI middleware, but I haven't spent the time to figure out how to run Zope 2.12 under WSGI yet.
w00T