data2check documentation – The EpubCheck

1. The EpubCheck

EpubCheck is an open source tool to validate EPUB documents. It checks whether the file complies with the general EPUB standards and, thus, it is a valid EPUB which works error-free on all devices. Among other things, the OCF container structure, the OPF and the OPS markup as well as iternal reference consistency is checked. EpubCheck was developed by the idpf (International Digital Publishing Forum) and is freely accessible under https://github.com/IDPF/epubcheck.

The result of this check is an XML check report. Further information on the check result can be found under 2.2 The output.

1.1 How EpubCheck processes an EPUB document

(The source of the following information is the EpubCheck wiki.)

When the EpubCheck tool is used to validate a document it does so by using a set of "checkers" each of which examines a particular portion of the EPUB file. As the tool examines the file, it will use an OCFChecker to validate the OCF structure, an OPFChecker to validate the OPF file, and so on.

In the following sections we will examine what each of the checkers does. The purpose of this document is to outline what the tool does, not necessarily how it does it, so much of the details of how the checkers work will be glossed over.

1.1.1 Examining the file

The first thing validated is the ZIP file. The EpubCheck tool ensures that the ZIP file has a "ZIP header" or section at the begining of the file. It also checks that the mimetype file is at the proper location and has the appropriate content. Technically it does this by reading from byte 30 in the file looking for "mimetype" and then from byte 38 in the file looking for application/epub+ZIP.

After these checks the ZIP file is loaded as a ZIP package, which will fail if the ZIP file is corrupt, has bad info in the header, or is otherwise incomplete.

1.1.2 Parsing and validating the files in the package

Most of the files in an EPUB document are XML files. Each XML file is checked to make sure it is well-formed, and that it validates.

For each of the XML based files (OPF, NCX, XHTML, DTBook, SVG) the tool has one or more schema files which defines the structure of the file, and the tool validates the files against those schema. In addition to validating the files against the schemas, the tool will also run a set of checks for things that aren't related to the validity of the individual XML files but are required for a valid EPUB document. These checks include things like ensuring that if there is an image called out in the XHTML file, that image really exists in the ZIP file and is listed in the manifest.

1.1.3 Checking the OCF related content

The encryption.xml, container.xml and signatures.xml files, if they exist, are checked against the respective schemas. The OCFChecker also retrieves the OPF file.

1.1.4 The OPF file

  • Validates the OPF against the schema.
  • Checks the unique-identifier to ensure that it references an actual id in the OPF file.
  • Checks the existence of the NCX file.
  • Checks each item in the manifest
    • that they exist in the package.
    • for invalid content in the media-type attribute.
    • for text/html, which is not appropriate for epubs.
    • for deprecated media-types in OPS documents.
    • for newer media-types in OEBPS 1.2 documents.
    • for fallbacks for unknown media-types.
  • Opens each item and runs the appropriate checker (OPSChecker for XHTML, DTBookChecker for DTBook, BitmapChecker for images, etc.).
  • Checks each item in the spine element.
    • must be valid for the spine.
    • must be a preferred type, or have a provided fallback.

1.1.5 Further file formats

XHTML files
EpubCheck validates the XHTML file against the schema files.
Checks that each reference image exists in the package.
DTBook files
Validates the DTBook file against the schema.
Checking bitmaps
Validates the image header and image type.
NCX file
Validates the NCX against the schema.

2. Checking a document

Under the menu option »Documents« you can check Word and InDesign documents with the help of a previously created configuration. Furthermore, you can check here your EPUBs with the help of the EpubCheck.

2.1 The checking process

By clicking the »Choose file« button (see figure 1), your file manager opens, where you can select an EPUB document to be validated (see figure 2).

NOTICE: All formats of the files to be uploaded must be XML compatible. This means: please upload only files with the extension .epub for an EPUB!

Upload of an EPUB document - Clicking »Choose file« to open the file manager

Figure 1: Upload of an EPUB document - Step 1: Clicking »Choose file« to open the file manager.

Upload of an EPUB document - Choosing an .epub file

Figure 2: Upload of an EPUB document - Step 2: Choosing an .epub file.

After selecting an EPUB document via double-click, the configuration type EPUB - EpubCheck appears in the dropdown list next to »Choose configuration«. Plese select this type (see figure 3).

Available configuration type after selecting an EPUB document

Figure 3: Available configuration type after selecting an EPUB document.

After selecting a document to be validated and the configuration type EPUB - EpubCheck, please click the green »Upload file and start a check« button in order to start the checking process. Now, you will see a clock symbol which displays the progress of the check. This process may take some seconds (see figure 4).

Validation of the EPUB is in progress

Figure 4: The validation of the EPUB is in progress.

The document selected by you was successfully checked. After completion of the check, you can see the result XML file on the ride-hand side (see figure 5).

The EpubCheck is completed

Figure 5: The EpubCheck is completed.

Further information on the output document can be found in detail in the following.

2.2 The output

Irrespective of whether a check was "successful" (no errors found) or "not successful" (errors found), always one output document is displayed in the form of a link:

epubcheck-report.xml: By following this link, an XML check report is downloaded (see figure 6).

XML error report of the EpubCheck

Figure 6: The XML error report of the EpubCheck.

This XML file contains, amongst other things, the metadata of the respective publication, as for example information on the copyright, on the used fonts and references. At the beginning of the file you will find a messages element where all the errors found by the EpubCheck are listed, every single one in an own message element (see figure 7).

Excerpt from the epubcheck-report.xml

Figure 7: Excerpt from the epubcheck-report.xml.

There is one error message in our example. The cover file could not be found.

Frequent error messages that the EpubCheck generates and explanations can be found in the EpubCheck wiki.

3. The history of checked documents

In the History of checked documents under the menu option »Documents« (left-hand side) all checks already carried out are listed in chronological order and access to the individual check results is granted. Therefore, the History of checked documents represents your personal database.

In this table all previous checks are listed including the time of the check (»Date of check«), the name of the file to be checked (»Test file«), the result of the check (green check icon (check icon) for "successful check and no errors found", orange pin (pin icon) for "successful check but errors found" and red flash (flash icon) for "check has failed, e.g. due to a system error") as well as the configuration used for the individual check (see figure 8).

The History of checked documents

Figure 8: The History of checked documents.

By clicking one of the documents linked under »Test file«, the appropriate output document appears on the right-hand side and can be viewed as described under 2.2 The output (see figure 6).

Your History of checked documents can also be found on the data2check start page under the menu option »Home« (see under 4.1 Home in the general part of this help).

Copyright © 2018 data2check, all rights reserved

GTCT | Imprint | Privacy policy