Jacob Vosmaer's blog

Adding a table of contents to a PDF

2024-01-02

In this post I talk about how I use pdftk-java to add tables of contents to PDF scans of old synthesizer manuals.

Synthesizer manual PDF's

Because I am a music technology enthousiast I read manuals of (old) electronic music equipment. Often these manuals are PDF image scans that are not searchable and that do not have a table of contents. This is annoying because I use an e-reader, which prevents me from quickly flicking around. It is faster if I open the table of contents on the e-reader and jump straight to the section I want to read.

Because I have run into this problem more than once I have developed a workflow to add a table of contents to an existing PDF.

pdftk

The tool I use to modify the PDFs is called pdftk ("PDF toolkit"). The original pdftk is apparently deprecated but there is a Java rewrite available: pdftk-java. I can install this on my Mac by running port install pdftk.

Pdftk has a sub-command that extracts metadata from a PDF into a text file. You can also update the metadata of a PDF by reading such a text file back in. My solution is to get the metadata, insert the table of contents and apply the text file to the PDF again.

Generating table of contents metadata

The metadata format generated by pdftk is verbose, it looks something like this:


BookmarkBegin
BookmarkTitle: Important notes
BookmarkLevel: 1
BookmarkPageNumber: 7

And that is just one line in the table of contents. To generate this text I made a Ruby script that generates the output.

Consider this table of contents from the Roland R-8M manual.

Table of contents of R-8M manual

To convert this particular table into pdftk metadata, I use the following Ruby script.


# om-toc.rb

Entry = Struct.new(:page, :title, :level)
DATA.each_line do |line|
  line.chomp!
  next if line.empty?

  entry = Entry.new
  page = line[0,3]
  entry.page = Integer(page, 10) + 2
  entry.title = line[3,line.size]
  entry.level = 1

  while entry.title[0] == '='
    entry.level += 1
    entry.title = entry.title[1, entry.title.size]
  end

  puts <<~EOS
    BookmarkBegin
    BookmarkTitle: #{entry.title}
    BookmarkLevel: #{entry.level}
    BookmarkPageNumber: #{entry.page}
  EOS
end

__END__
005Important notes
006Front and rear panel
007Try out the R-8M
008=CONNECTIONS
009=HOW TO PLAY THE SOUNDS
009==Turn the Power on
009==Listen to the ROM Play Demonstration
010==How to Play Each Instrument
010===Play the internal instruments
011===Play the instruments of a sound ROM card
014===Selecting the Feel Patches
015Before You Modify the Settings
(etc.)

Note how this script combines code and data. Everything after __END__ is no longer Ruby code and can be accessed as a file with DATA.read. I do this because I often need to tweak the code to the shape of the table of contents.

The data at the bottom of the script is organized with two goals:

  1. Easy to parse
  2. Easy to (touch) type while looking at the original table of contents

The first three characters are the page number as listed in the real table. Then there are zero or more = symbols to indicate chapter / section / subsection nesting. Everything else on the line is title text.

The page numbers in the table often don't match the actual PDF pages. To ease the data entry the DATA section uses the numbers as printed in the manual. The code of the script adds 2 to each number which happens to be the offset here. Another problem I sometimes see is that the table of contents numbers pages as "6-2" meaning "chapter 6, page 2". I then use the Ruby code to automatically convert that to e.g. page 18 in the PDF.

The output of this particular script looks like this:


BookmarkBegin
BookmarkTitle: Important notes
BookmarkLevel: 1
BookmarkPageNumber: 7
BookmarkBegin
BookmarkTitle: Front and rear panel
BookmarkLevel: 1
BookmarkPageNumber: 8
BookmarkBegin
BookmarkTitle: Try out the R-8M
(etc.)

To inject this data into the PDF I use the following shell script which calls pdftk twice.


#!/bin/sh
set -e

if [ $# != 2 ]; then
  echo 'Usage: $0 ORIG_PDF NEW_WITH_TOC_PDF'
  exit 1
fi

tmp=$(mktemp)
tmp2=${tmp}.2
trap "rm -f ${tmp} ${tmp2}" EXIT

pdftk "$1" dump_data output $tmp
awk '
  { print }
  /NumberOfPages/ { system("ruby om-toc.rb") }
' $tmp > $tmp2
pdftk "$1" update_info $tmp2 output "$2"

The end result is a new PDF that has a table of contents!

Table of contents after script

Conclusion

This technique works. It involves some data entry, which is not fun, but it automates the rest and having a table of contents is nice. For me it's worth the data entry.

IndexContact