How to serve IA-style books from your own cluster

The Internet Archive BookReader is designed so that you can run it on your own server. Once you download the BookReader source code to your webserver, you can load the BookReaderDemo, which will run the bookreader code with static images. You can change the location of the images to anywhere on your webserver, and you should be up and running!

Others have modified the IA BookReader to read image files from an image server, such as the Djatoka JPEG 2000 Image Server, instead of using static files on disk. This is also pretty easy to do.

These two scenarios should cover most use cases. Most likely, your book images are either static images in a directory, or they are served by an image server. However, what if your images are stored in a zip file, similar to how archive.org stores book images? We’ll walk you through how to set up your webserver (or cluster) to serve images using IA-style book data.

Internet Archive Storage for Book Data

The Internet Archive stores book images in JPEG 2000 format, and the individual images are sequentially-numbered and stored in a ZIP file. There are various other files that describe a book, and these files are grouped together in an Internet Archive item. An item has an identifier that is unique within the IA cluster.

Here is a breakdown of how the files in an item would look for an item with the identifier bookid, which is located in the directory /1/items/bookid:

Files used by the bookreader:

  • bookid_abbyy.gz – contains OCR data XML format, used by full-text search
  • bookid_jp2.zip – contains processed JPEG2000 images, these are scaled and displayed by the bookreader
  • bookid_meta.xml – contains bibliographic metadata about the book
  • bookid_scandata.xml – contains image size and page number information
  • scandata.xml – older variant of bookid_scandata.xml
  • scandata.zip – older variant of bookid_scandata.xml

The structure of bookid_jp2.zip looks like this:

> unzip -l bookid_jp2.zip |head
Archive:  bookid_jp2.zip
  Length     Date   Time    Name
 --------    ----   ----    ----
        0  09-04-07 17:25   bookid_jp2/
   677279  09-04-07 17:21   bookid_jp2/bookid_0001.jp2
   418643  09-04-07 17:21   bookid_jp2/bookid_0002.jp2
   400545  09-04-07 17:21   bookid_jp2/bookid_0003.jp2
   367304  09-04-07 17:21   bookid_jp2/bookid_0004.jp2
   447760  09-04-07 17:21   bookid_jp2/bookid_0005.jp2
   383252  09-04-07 17:21   bookid_jp2/bookid_0006.jp2

Structure of the Bookreader codebase

The BookReader code is designed to be split onto two kinds of different cluster nodes, web nodes and data nodes. However, it is easy to run the BookReader on a single machine, serving both roles.

The static files, such as BookReader.js and BookReader.css, are served from a web node. They located in the top-level BookReader directory in the git repository.

The IA-specific backend PHP and python files that parse the meta.xml and extract the JPEG 2000 image are normally served from a data node. They live in the BookReaderIA directory in the repository. These files are not necessary for a simple BookReader deployment using static images or an image server.

Setting up the Datanode and the BookReader Image Server

BookReaderImages.php turns your data node into a very simple image server. It extracts, decompresses, scales, rotates, and recompresses images that are stored in the Internet Archive storage format.

Supported input image formats:

  • JPEG 2000
  • JPEG
  • TIFF
  • PNG

Supported output image formats (these are formats that all browsers can display):

  • JPEG
  • PNG

Supported archive formats:

  • ZIP
  • Tar

Supported image operations:

  • Scaling by powers of two
  • Scaling by an arbitrary factor (increases server load)
  • Rotation by 90 degrees

In addition to serving images, the data node also executes scripts that directly interfaces with the files in an item. Most importantly, BookReaderJSIA.php reads meta.xml and scandata in order to instantiate the BookReader with the correct parameters.

1. Configure PHP

First, you will need to configure a web server on the datanode to serve php scripts. You can use a standard web server such as Apache or Nginx with fastcgi-php enabled.

In this example, we will set the docroot of the webserver to /var/www, and it will be able to serve php scripts from /var/www/BookReader (note the capitalization).

Phil at the Biodiversity Heritage Library has written detailed instructions on how to configure Nginx and fastcgi-php for hosting the BookReader.

2. Install BookReader PHP code

Now that your data node’s webserver has been configured to serve files from /var/www, create a directory called /var/www/BookReader (note the capitalization).

In the /var/www/BookReader directory, install the following scripts from the BookReaderIA/datanode repo directory:

  • BookReaderImages.php
  • BookReaderImages.inc.php
  • BookReaderMeta.inc.php
  • BookReaderJSIA.php

3. Test to see if the data node is properly serving PHP scripts

Let’s see if the webserver on the data node is properly configured. Try loading BookReaderImages.php script. An example URL would look like:

http://cluster.biodiversitylibrary.org/BookReader/BookReaderImages.php

Without any script arguments, your script should return a 404 HTTP status. If you aren’t using a custom 404 handler, you should see something like:

Error serving request:
  Image error: Image stack does not exist at 

Debugging information:
#0 /var/www/BookReader/BookReaderImages.inc.php(245): BookReaderImages->BRfatal('Image stack doe...')
#1 /var/www/BookReader/BookReaderImages.php(38): BookReaderImages->serveRequest(Array)
#2 {main}

If you have php-cli installed, you can also run php BookReaderImages.php from the command line and you should see a similar error message.

This will verify that the webserver is configured to serve php scripts, and that the scripts are in the correct location.

4. Install binaries required for serving images

Tools needed to extract and decompress images, that need to be installed in the webserver process owner’s (www-data) executable $PATH:

  • unzip
  • 7z (for efficiently extracting images from tar archives, since tar does not seek() to the requested file)
  • netpbm tools (we call bmptopnm, jpegtopnm, tifftopnm, pngtopnm, pnmtopng, and pnmtojpeg)
  • exiftool (this can be installed at any path)

In addition to these binaries, you need to install the Kakadu JPEG 2000 Software. We use Kakadu because it is fast, but you could modify the BookReaderImages.inc.php files to use Jasper or OpenJpeg instead.

Although we have a license to the Kakadu SDK, it is possible to use the freely distributed pre-compiled Kakadu binaries with BookReaderImages.php. Download the 32-bit Linux Kakadu binaries from here. If you are on 64-bit linux, you will also need to install ia32-libs.

5. Edit paths in BookReaderImages.inc.php

Paths to the exiftool and kdu_expand command-line binaries are hard-coded in BookReaderImages.inc.php. These will need to be edited in your copy to your install path for these binaries:

   // Paths to command-line tools
    var $exiftool = '/petabox/sw/books/exiftool/exiftool';
    var $kduExpand = '/petabox/sw/bin/kdu_expand';

Also, the path to the Kakadu shared library (libkdu_vXXX.so) is hardcoded and will need to be changed:

        putenv('LD_LIBRARY_PATH=/petabox/sw/lib/kakadu');

6. Edit path in BookReaderJSIA.php

archive.org stores items in a directory structure that looks like /XX/items/bookid, where XX is a 1 or 2 digit integer. If your directory structure is different, you will need to remove or edit this path check in BookReaderJSIA.php:

if (!preg_match("|^/\d+/items/{$id}$|", $itemPath)) {
    BRFatal("Bad id!");
}

7. Test the data node PHP scripts

You should be finished setting up the data node at this point.

You can test BookReaderImages.php by passing in four cgi parameters:

  • zip – path to image zip file
  • file – image inside zip file to decompress
  • scale – radix-2 reduction parameter
  • rotate – rotation angle in 90-degree increments

An example URL will look like:

http://ia600307.us.archive.org/BookReader/BookReaderImages.php?zip=/35/items/flatlandromanceo00abbouoft/flatlandromanceo00abbouoft_jp2.zip&file=flatlandromanceo00abbouoft_jp2/flatlandromanceo00abbouoft_0007.jp2&scale=4&rotate=0

You can test BookReaderJSIA.php by passing three cgi parameters:

  • id – the bookid
  • itemPath – the path on disk to the item (not the web-accessible path)
  • server – the domain name of the datanode
  • subPrefix – this is usually the same as the id, if you follow archive.org naming conventions

An example URL will look like:

http://ia600307.us.archive.org/BookReader/BookReaderJSIA.php?id=flatlandromanceo00abbouoft&itemPath=/35/items/flatlandromanceo00abbouoft&server=ia600307.us.archive.org&subPrefix=flatlandromanceo00abbouoft

Read Aloud and full-text search have not yet been installed but you can now use the datanode to serve book images!

If you have trouble with BookReaderImages.php, try running kdu_expand on the command line. The php script will set LD_LIBRARY_PATH, create a symlink called /tmp/stdout.bmp that points to /dev/stdout, and execute a command like:

unzip -p '/data/b/bookid/bookid_jp2.zip' 'bookid_jp2/bookid_0001.jp2' | /petabox/sw/bin/kakadu/kdu_expand -no_seek -quiet -reduce 2 -rotate 0 -i /dev/stdin -o /tmp/stdout.bmp | (bmptopnm 2>/dev/null) | pnmtojpeg -quality 75

Set up the Webnode

1. Install a web server

If you are using a single server for both the webnode and the datanode, this step is already done

2. Install the static BookReader webnode scripts

If your docroot is /var/www, create a directory called /var/www/bookreader (note the capitalization).

If you are using a single server for both webnode and datanode, you will now have two directories in /var/www called “BookReader” and “bookreader”. You will need a case-sensitive file system for this to work (HFS+ on OS X won’t work with this naming scheme).

In /var/www/bookreader, install the javascript and css files from the main BookReader git directory.

In addition, you will have to install the following (links provided):

3. Create a luanch script

The BookReader.inc draw() method writes the necessary HTML to render the bookreader. You can create a simple file called book.php that calls draw() and takes the bookid as a parameter:

require_once('BookReader.inc');
 
//assuming your book path is /data/b/bookid
 
$id = $_GET['id'];
$first_letter = $id[0];
 
BookReader::draw('cluster.biodiversitylibrary.org',
    '/data/'.$first_letter.'/'.$id,
    $id,
    '',
    'test title');

Be sure you have placed both BookReader.inc and book.php in the webserver’s docroot. You should now be able to launch the bookreader with URL such as:

http://cluster.biodiversitylibrary.org/book.php?id=journalofnatural11lond

Thanks for using the Internet Archive BookReader! Happy Reading!