Changing My Blog to Jekyll: How I Did It
For those who want to make a Jekyll blog.
This is the second of two posts about redoing my blog. This one is mainly for other people who’d like to make a Jekyll blog and are interested in how I did some of the things I did. If you’re migrating from Blogger you’ll find it especially relevant.
This year I finished a long project: moving my entire twelve-year-long Blogger blog over to Jekyll. Probably not a lot of people, especially programmers, have stuck with Blogger as long as I did. I thought it was kinda charming, until I didn’t. But if you’re doing this with a Blogger blog of any kind—or if you’ve just seen some of the stuff I’ve got going on and think it’s kinda neat—here’s how I got all the functionality of Blogger that I cared about with the excellent duo of Jekyll and GitHub.
Step 1: Get my old posts.
Blogger gives you a button to get a copy of everything you’ve ever written to it, under Settings >
Other > Import & Back Up > Back up Content
. It’s all there in one big XML dump.
Step 2: Translate to Jekyllese (YAML + Markdown).
While I was researching how to do this, I discovered A.J. O’Neal’s blogger2jekyll. It looked quite promising, but apparently something in the middle of my massive (8MB) XML file made it choke, and I could never get it to go past the first hundred posts or so out of over 300.
I’d just recently been doing some WordPress work, so my solution was to set up a localhost with a basic WordPress installation on it, and use it as a middleman. Blogger sucks and WordPress is popular, so I figured a Blogger-to-WordPress migrator would be a well-used and well-refined bit of code. And indeed, there’s a plugin named Blogger Importer, made by wordpressdotorg itself. It worked beautifully.
Then I installed another plugin, Wordpress to Jekyll Exporter, which also worked a treat. Except it didn’t export any comments. More on that later.
Step 3: Clean up, clean up.
I opened up my brand-new .md files and found that I had, here, a very bad combination. On the one
hand I had Blogger’s terrible awful formatting. For all of Blogger’s history, for example, it has
used <div>
s for paragraphs, not <p>
s. The “p” stands for “paragraph”!! It insists on wrapping
every post in a pointless <div>
. Most styles are inline. The WYSIWYG editor peppers extra opening
and closing tags everywhere. Its formatting of images is some table-based stuff that may have made
perfect sense in the early ’00s but that developers now find horrifying. It’s also changed in subtle
ways over the years. On the other hand, I had my idiosyncratic insistence that paragraphs should be
indented, not spaced out from each other, compounded with how I didn’t know or care at all about W3C
standards.
The WordPress-to-Jekyll exporter seems to have tried its best to make it all Markdown, but it was ugly in those files. So I spent a long time clearing out the crap and getting them down to bare Markdown. It took a long time, but it felt good.
If you find yourself facing a similar task, my advice is: Use Vim. Make sure you install surround.vim. Then you can make up whatever macros you need, right on the spot. Here are some that were helpful to me:
" (p)aragraphs from (h)yphens - because that's how I paragraphed all my old
" blog posts
nnoremap <leader>ph :%s/^-\(-\)\@!/\r/gc<CR>
" (p)aragraphs from double line (b)reaks - because that's how I paragraphed
" all my even older blog posts
nnoremap <leader>pb :%s/<br \/> <br \/>/\r\r/gc<CR>
" (p)aragraphs from (n)ewlines - same as above but for single <br> tag
nnoremap <leader>pn :%s/<br \/>/\r\r/gc<CR>
" (p)aragraphs from non-breaking (s)paces
nnoremap <leader>ps :%s/\ \ \n \ /\r\r/gc
" (p)aragraphs from (i)nline line breaks
nnoremap <leader>pi :%s/<br\ \?\n\?\/>\ \? \ \?\n\? \ \?/\r\r/gc
" (s)ection (b)reak
inoremap \sb <CR>* * *<CR>
nnoremap <leader>sb i<CR>* * *<esc>
" pretty up linebreaks [run g(q)] on the body of the (p)ost
nnoremap <leader>qp ?^---<CR>jvGgq
" (d)elete (e)mpty divs
nnoremap <leader>de :%s/<div>\n<\/div>//g<CR>
" (d)elete extra (n)ewlines
nnoremap <leader>dn :%s/\n\n\+/\r\r/g<CR>
" (r)eplace non-breaking (s)paces with spaces
nnoremap <leader>rs :%s/ /\ /g
" (r)eplace (i)talic tags <i>
nnoremap <leader>ri :%s/<\(\/\)\?i>/*/g<CR>
" (r)eplace e(m) tags <em>
nnoremap <leader>rm :%s/<\(\/\)\?em>/*/g<CR>
" (r)eplace (b)old tags <b>
nnoremap <leader>rb :%s/<\(\/\)\?b>/**/g<CR>
Step 4: Get my images.
Blogger uploads your images to Google Photos, but then it hides them from you in a secret area just
for Blogger images. They don’t seem to have mentioned this to anyone. The only way I found them was
by going to
the Blogger menu: Help
, then typed in “images”, and clicked on “Add images and videos to your
blog”. It says there, “Images in your blog are stored in a Google Album Archive…”, with a link.
Apparently if you’re logged in to Google, though, you can just go to
get.google.com/albumarchive and it’ll take you right to your
own.
Once you find the damn thing, you can click through to your blog’s image album, and in the top
corner there’ll be a “…” button. From that you can download all your photos in a big giant zip
folder. (As I recall, I didn’t actually do this, because I realized that I still had copies of all
the Blogger photos on my hard drive. So as I was going through all the posts to insert figure
includes for the images, I would make sure each image got moved over to the site’s images folder.
For some reason I was under the belief that this saved me work. I’m not sure it did.)
Step 5: Figure out image displaying.
Bare Markdown images work, but I like adding captions, and there’s not a great way to do that
without getting outside of Markdown. After I established that bare Markdown wouldn’t work, the best
solution was a Jekyll include
. I named mine fig.html
.
With this tool available to me, I decided I’d also shorten all my image names! No more typing out
full URLs. Just IMG_0408.jpg
or whatever my camera decided to name it. Then I’d chuck them all in
a single location, and set the name of that location in my _config.yml
. I’ve liked this a lot.
When I want an image, I just write,
{% include fig.html src="IMG_0408.jpg" caption="This photo gots a caption." %}
And fig.html
does everything else for me.
At first I put all my images under /assets/images/
. Then I realized I really should have small
versions of them, because some of my pictures are pretty heavy on load time, so I made fig.html
render an <img>
tag that displays the small version, inside an <a>
tag that links to the large
version. And I made folders full of large and small versions of all the images at
/assets/images/lg/
and /assets/images/sm/
.
This strategy also worked excellently with Magnific Popup. I knew I wanted a lightbox-style image displayer, and after much looking I concluded Magnific was the best. I futzed with its CSS until it displayed the snazzy captions I wanted, and I’m a happy user.
Finally I had one more realization, and that was that with all those images in the directory, doing stuff to my site was now taking forever, especially the Git operations.
I decided this was a problem worth bringing in an extra dependency for. So off of some other Jekyller’s tip, I got set up with Cloudinary. It’s a cloud-based image hosting site. If you’re an obscure, little-read blogger like me, their free plan will be all the bandwidth you could ever need. And they have the added benefit that you can have them serve your image at different sizes just by feeding them a different URL.
To put it all together, my _config.yml
ended up with the following in it:
image_dirs:
local:
root: "/assets/images"
large: "/lg/"
small: "/sm/"
cloudinary:
root: "http://res.cloudinary.com/chuckmasterson/image/upload"
large: "/"
small: "/c_limit,h_450,w_450/"
I’ve defined two places to look for the images. This way I can build and preview the site on my own
computer before pushing, using an environment
variable with $
JEKYLL_ENV=local bundle exec jekyll build
, and I can see the post with photos even if I’m away from
an internet connection. The bit in cloudinary.small
tells Cloudinary to shrink the image to
450×450px before serving it.
fig.html
looks like this, then:
{% if jekyll.environment == "local" or include.elsewhere %}
{% assign src_to_use = include.src %}
{% assign img_root = site.image_dirs.local.root %}
{% assign lg_dir = site.image_dirs.local.large %}
{% assign sm_dir = site.image_dirs.local.small %}
{% else %}
{% capture src_to_use %}{{ include.src
| replace: "'", "_"
| replace: " ", "_"
}}{% endcapture %}
{% assign img_root = site.image_dirs.cloudinary.root %}
{% assign lg_dir = site.image_dirs.cloudinary.large %}
{% assign sm_dir = site.image_dirs.cloudinary.small %}
{% endif %}
<figure class="image-fig {{ include.class }}">
<a class="mfp-trigger" href="{% if include.elsewhere %}{{ include.src }}
{% else %}{{ img_root }}{{ lg_dir }}{{ src_to_use }}
{% endif %}">
<img
{% if include.elsewhere %}
src="{{ include.src | uri_escape }}"
{% else %}
src="{{ img_root }}{{ sm_dir }}{{ src_to_use }}"
{% endif %}
/>
</a>
{% if include.caption %}
<figcaption>
{{ include.caption | markdownify }}
</figcaption>
{% endif %}
</figure>
All of this does make the process of adding images slightly more obtuse:
- Copy all images for the post into
/assets/images/staging/
. - Copy again into
/assets/images/lg/
. - Upload everything from the staging folder onto Cloudinary (simple drag-and-drop).
-
Resize everything in the staging folder with
$ mogrify -resize 640x640 ./*.jpg
(Which I keep saved in that folder as
resize.sh
.) - Move those into
/assets/images/sm/
. - All done.
But I like the flexibility I get from it.
Step 6: Bring in the comments.
Most Jekyll setups kind of assume that you either don’t care what your readers have to say, or you want to use Disqus or some other commenting thing that works via a database from another site. If I wanted that, why make a static site in the first place?
Somehow or other I got tipped off to the existence of Eduardo Bouças’s excellent alternative Staticman. This is a program that runs on a server somewhere else, and uses GitHub’s API in a neat, creative way: it accepts a comment that’s submitted to it and processes it into a YAML file. Then, with the program’s own GitHub account (@staticmanapp), it commits the YAML file to your repo with the right name. And there you go. The comment appears without you having to touch it. Or you can moderate it: then Staticman opens a pull request with the YAML file.
So I was like, “Sweet! Sign me up!”
But before I could do that, I had to make files out of my comments.
Recall from earlier that my comments were all still in a local WordPress installation, sitting there in the database. Well, I brushed up a little on my MySQL, and eventually figured out the following query, which I ran in phpMyAdmin, to get them out of the database and into one big file:
SELECT CONCAT("author: \"", REPLACE(c.comment_author, '"', '\\"'), "\"
authorurl: ", c.comment_author_url, "
_id: ", c.comment_ID - 2, "
legacyslug: ", m.meta_value, "
timestamp: ", c.comment_date, "
text: \"", REPLACE(c.comment_content, '"', '\\"' ), "\"
DELIMITER"
)
FROM wp_comments c
INNER JOIN wp_posts p ON c.comment_post_ID = p.ID
INNER JOIN wp_postmeta m ON p.ID = m.post_id
WHERE c.comment_approved = 1
AND m.meta_key = 'blogger_permalink'
INTO OUTFILE "/var/lib/mysql-files/commentdump.txt";
This file then needs a bunch of processing. First, the backslashes. Urgh, no matter what escape sequence I used, MySQL would give me either too many backslashes or none, wherever there was a backslash in my comments. So, replace all the double backslashes:
sed --in-place 's:\\":\":g' commentdump.txt
Now, the file is a bunch of valid YAML snippets, separated by a line with just the word DELIMITER
on it. So, split those apart using csplit
, the
context-based file splitter.
csplit -f comment -n 5 -s commentdump.txt -b '%d.yml' '/DELIMITER/+1' {*}
And also, there were a bunch of ^M
characters in there. Got rid of those.
sed --in-place 's#^M\\##g' *.yml
# bash will insist that you type the ^M character over again (Ctrl-V, Ctrl-M)
Then a bunch of massaging:
# remove leading slash from blogger_permalink field
sed -i 's#^legacyslug:\ /#legacyslug:\ #g' *.yml;
# convert slashes in blogger_permalink a.k.a. legacyslug into hyphens
for file in *.yml; do num=`grep -n 'legacyslug:' $file | cut -d : -f 1`; sed -i "${num}s:/:-:g" $file; done;
# remove .html from the end of the legacyslug field
for file in *.yml; do slug=`cat $file | grep "legacyslug" | sed "s/\.html//"`; sed -i "s/legacyslug.*/$slug/" $file; done;
# rename comments so they're sequential in the right order
for file in *.yml; do comwpid=`grep "^_id:" $file | awk -F" " '{print $NF}'`; fname=comment-$((9999 - comwpid)).yml; mv $file $fname; done
# create a folder for each slug that exists in the comments
for file in *.yml; do slug=`cat $file | grep 'legacyslug: ' | awk -F' ' ' { print $NF } ' | sed "s/\.html//"`; mkdir -p ~/PATH/TO/YOUR/_data/comments/$slug; done
# sort the comments into the folders you just made
for file in *.yml; do slug=`cat $file | grep 'legacyslug: ' | awk -F' ' ' { print $NF } '`; mv $file ~/PATH/TO/YOUR/_data/comments/$slug/; done;
# get rid of my old Blogger user url and replace with chuckmasterson.com
find . *.yml -exec sed 's#https://www.blogger.com/profile/03918675492238901083#http://www.chuckmasterson.com#' {} \;
# deletes everything in all the comment subfolders in _data/comments/,
# in case you mess things up and need to start over
find _data/comments -maxdepth 2 -type f -exec rm {} \;
If all those commands worked, you should be left with a folder called _comments
with lots of
subfolders each with the name of one of your posts, in the format yyyy-mm-name-of-post
. It’s
entirely likely that these won’t transfer exactly to your situation and you’ll need to learn a bit
about sed
, grep
, cat
, awk
, tail
, and all those other bizarre file-manipulating Bash
commands. But you’ll get it eventually. I have faith. I knew nothing about those commands before
I started this project.
Step 7: Thread the comments
As if that weren’t enough, I wanted threaded commenting too, because I decided I could be as perfectionist as I wanted. And so I found Michael Rose’s also-excellent instructions for how to get that done.
I more or less lifted my comments section straight out of Michael’s source
code at /src/_layouts/page__comments.html
and
/src/_layouts/comment.html/
. page__comments.html
goes into the post
layout, and it loops
through each comment in the relevant directory, rendering them with comment.html
and adding the
CSS class child
to posts that have been made by replying to another post. It’s all quite clever
and Liquidy.
Trouble was, we both discovered later that it was incorrect. We were misusing a Staticman option
named options[parent]
. options[parent]
should later be very useful for setting up comment reply
emails, so we don’t want to misuse it in a way that makes that impossible. So I needed to
change some stuff in the comment rendering. The commit where I did
that
is much more useful to look at than anything I could write here, since you can see all the different
files I had to change, and how.
Step 8 (??): Comment reply notification emails
Ostensibly, Staticman has the capability to send a notification email to everyone who’s subscribed to a post. However, several people in the comments on a related issue have tried to get it set up, by the book, and it seems not to work quite yet. It could be that we’re all doing it wrong, or it could be that Staticman is buggy. I don’t know yet, but I look forward to finding out and getting the emails working. It’ll be the final piece in this puzzle.
Step 9: Forms
I’ve got a couple forms—one for subscribing and one for manually commenting in case comments aren’t working. These come to me through Formspree.
Step 10: Redirect
As Yevgeniy Brikman has pointed
out, Blogger
doesn’t give you much flexibility in redirecting to a better site. Its templates use what he calls
“some sort of arcane XML syntax that only supports basic variable lookups, loops, and if/else
statements”. In other words, if you want to redirect from
chuckmasterson.blogspot.com/2015/04/something.html to
chuckmasterson.com/blog/2015/04/something, you’re not going to get the templating language to do it
for you. It can’t change the string that way. You’ll have to use a JavaScript redirect, unless you
want to do what Yevgeniy did and put an if-else statement in there that contains the URLs of every
post you’ve ever written. (If that sounds like fun, he’s got some Ruby on the post I linked, but
I couldn’t make it work since I’m not a Rubyist.) But you can salvage some SEO cred by setting your
canonical
to the new blog.
My JavaScript redirect looks like this—including the ridiculous quote-escaping that Blogger’s arcane XML garbage makes you do:
<script>
var bsUrl="<data:blog.url/>";
ghUrl = bsUrl.replace("blogspot.", "");
</script>
<body onload='window.location = ghUrl'>
[...]
Once you’ve got that, you also need to add jekyll-redirect-from
to your Gemfile, and change
whatever bit of your YAML metadata has the original blogger URL so that the field is named
redirect_from
. That’ll set it up so that when someone arrives from an old Blogger link, they’ll
get to the right Jekyll post.
Step 11: Enjoy
Blogging with Jekyll pairs well with vim-jekyll.
Note: comments are temporarily disabled because Google’s spam-blocking software cannot withstand spammers’ resolve.
1 Comment
Chuck
HistoryOne downfall of Staticman is that the spam filtering feature hasn’t been fully developed. Spammers, for some reason, started heavily bombarding this post. So I’m locking comments, but you can still leave one via the manual comment form. You should copy-paste the address of this page first.