Search interesting materials

Saturday, May 12, 2007

How to make a backup of all blog content?

I like to have a backup of my blog on my notebook, so that I can run searches in it when I am not connected.

Blogspot has nice URLs for each post - e.g.
is the URL for a post that I wrote on making email -> blogger (mostly) work. This suggests a file system where there are directories 2007, 2007/03 and then a file 2007/03/how-to-make-email-to-blogger-work.html, which would be a case of nice software engineering.

How would I make a personal file system which mirrors my blog which has this structure? I'm unable to do this. I tried to use wget with recursive get options and it gets lost. A key feature that I want is to be able to say wget -c so that modified posts are picked up (but all posts are not brought down).

Right now, I have a simple and dumb solution: I take one file per month, and I fetch the whole thing every time (which is wasteful of resources for google). I use this script:


rm -f *.html *.text
for year in 2005 2006 2007 ; do
  for month in 01 02 03 04 05 06 07 08 09 10 11 12 ; do
   wget ""$year"_"$month"_01_archive.html"
   links -dump ""$year"_"$month"_01_archive.html" > $year$month.text

This works, but it's not a nice solution: (a) I'm wasting bandwidth and google's resources - and the waste will grow as the years go by - and (b) It doesn't get me the clean well organised file system with nice file names that ought to be possible.


  1. This is a good guide ...

    And BTW, blogger does not store articles in directory structure as you think it does. Thats only a virtual representation of the articles stored in a flat database.

    None the less, some of the tools in the article above should do what you want.


  2. sir,

    this might be useful

  3. Sir,

    Firefox has a wonderful add-on - DownThemAll. It allows one to download all the files from a particular blog directory [one-at-a-time] for e.g. 2007 / 2006 / 2005 / ..etc.

    The good thing is we can chose the format of the file we wish to download from the site / blog. This could be [.pdfs], [.html], [.doc] can even enter a different format of file.

    In case of a blog, the downloaded html pages will look exactly as they appear online, i.e. the RHS index, blog-owner's pic, etc.

  4. Ravi, thanks for the pointer. But you know me: I don't like doing anything which requires interaction. That takes too much time. I want a 100% automatable solution that can be stuck into a crontab and then I can forget about it.

    If I interacted with software, I'd get a lot less done! :-)


Please note: Comments are moderated. Only civilised conversation is permitted on this blog. Criticism is perfectly okay; uncivilised language is not. We delete any comment which is spam, has personal attacks against anyone, or uses foul language. We delete any comment which does not contribute to the intellectual discussion about the blog article in question.

LaTeX mathematics works. This means that if you want to say $10 you have to say \$10.