Database with all XKCD comics

posted on 2013-02-05

I've decided to take a large collection of images and create a mega image to scroll through using LeafJS. The plugin is extremely simple and if I create files that mimic the OpenStreetMap API I should be fine with just adding the correct URL. One problem, I need a large collection of images.

One thing I could come up with is downloading a comic that doesn't really mind being downloaded and republished with the right attribution: XKCD.

The images have random names, but you can find the name using the json metadata they host for every comic. The following Haskell script downloads the metadata and outputs the filename to the stdout.

{-# LANGUAGE OverloadedStrings #-}

import Network.HTTP
import Data.Aeson             ((.:), (.:?), decode, FromJSON(..), Value(..))
import Control.Monad (mzero)
import Control.Applicative ((<$>))
import qualified Data.ByteString.Lazy.Char8 as BS

data XkcdMeta = XkcdMeta { img :: String } deriving (Show)

instance FromJSON XkcdMeta where
     parseJSON (Object v) = XkcdMeta <$>
                            v .: "img"
     -- A non-Object value is of the wrong type, so fail.
     parseJSON _          = mzero

getUrlOrEmptyJsonString url = do
    result <- Network.HTTP.simpleHTTP (getRequest url)
    case result of
        Left _ -> return "{\"img\": \"\"}"
        Right response -> return (rspBody response)

decodeOrEmpty json = case decode (BS.pack json) :: Maybe XkcdMeta of
    Just xkcd -> img xkcd
    Nothing -> ""

jsonUrls = map (\x -> "" ++ (show x) ++ "/info.0.json") [1,2..1169]

downloadAndShowImageUrl url = do
    json <- getUrlOrEmptyJsonString url
    putStrLn $ (decodeOrEmpty json)

download = mapM_ downloadAndShowImageUrl

main = download jsonUrls

Second part of the equation is downloading all the files and extracting their size to a json database. Because it dependens on libgd for graphics, I've created a small cabal project with the the dependencies described in it. The only source file extract the size data of each image and outputs a json record.

module Main (main) where

import System.Environment (getArgs)
import Graphics.GD (loadPngFile, loadJpegFile, imageSize)
import Data.Aeson (ToJSON(toJSON), encode, object, (.=))
import qualified Data.ByteString.Lazy.Char8 as BS
import qualified Data.Text as T
import Data.List (isSuffixOf)

data ImageMeta = ImageMeta {
    filename :: String,
    index :: Int,
    pixelWidth :: Int,
    pixelHeight :: Int
    } deriving (Eq, Show)

instance ToJSON ImageMeta where
  toJSON (ImageMeta filename index pixelWidth pixelHeight) = object [
        (T.pack "index") .= index,
        (T.pack "filename") .= filename,
        (T.pack "pixelWidth") .= pixelWidth,
        (T.pack "pixelHeight") .= pixelHeight ]

loadImage filename = if isSuffixOf ".png" filename
    then loadPngFile filename
    else loadJpegFile filename

dumpMetaJsonOfFileAt (index, filename) = do
    img <- loadImage filename
    size <- imageSize img
    let metadata = ImageMeta filename index (fst size) (snd size)
    putStrLn $ BS.unpack (encode metadata)

-- Each commandline argument is considered a png file to load and dump about
main = do
    args <- getArgs
    mapM_ dumpMetaJsonOfFileAt (zip [1,2..] args)

The complete source and database with all images can be downloaded here: xkcd_comics_db.tar.xz mirror. It contains XKCD 1 to XKCD 1169. The comics.json file only contains PNG and JPEG images, I've decided to drop all GIF images at that point.

Next step will be to create a program to implement any kind of Packing algorithm. As soon as I have, I'll write another post. In the meantime, I hope somebody can enjoy the database and the code snippets and Learn You a Haskell for Great Good!