Optical character recognition (OCR) from data fields in digital images

Extracts information from the data fields in camera trap images (not the metadata). Many camera traps include data fields in camera trap images, often including date and time of images, and sometimes other information. This function extracts the information from these fields using optical character recognition provided by the package tesseract after reading images using the package magick.

OCRdataFields(inDir, geometries, invert = FALSE)

Arguments

inDir: character. Directory containing camera trap images (or subdirectories containing images)
geometries: list. A (possibly named) list of geometry strings defining the image area(s) to extract.
invert: logical. Invert colors in the image? Set to TRUE if text in data field is white on black background. Leave if FALSE if text is black in white background.

Value

A data.frame with original directory and file names, and additional columns for the OCR data of each extracted geometry.

Details

Normally all these information should be in the image metadata. This function is meant as a last resort if image metadata are unreadable or were removed from images. OCR is not perfect and may misidentify characters, so check the output carefully.

The output of this function can be used in writeDateTimeOriginal to write date/time into the DateTimeOriginal tag in image metadata, making these images available for automatic processing with recordTable and other functions that extract image metadata.

This function reads all images in inDir (including subdirectories), crops them to the geometries in the "geometries" list, and performs optical character recognition (OCR) on each of these fields (leveraging the magick and tesseract packages).

Geometries are defined with geometry_area from magick. See geometry for details on how to specify geometries with geometry_area. The format is: "widthxheight+x_off+y_off", where:

width: width of the area of interest
height: height of the area of interest
x_off: offset from the left side of the image
y_off: offset from the top of the image

Units are pixels for all fields. digiKam can help in identifying the correct specification for geometries. Open the Image Editor, left-click and draw a box around the data field of interest. Ensure the entire text field is included inside the box, but nothing else. Now note two pairs of numbers at the bottom of the window, showing the offsets and box size as e.g.:

"(400, 1800) (300 x 60)"

This corresponds to the geometry values as follows:

"(x_off, y_off) (width x height)"

Using these values, you'd run:

geometry_area(x_off = 400, y_off = 1800, width = 300, height = 60)

and receive

"300x60+400+1800"

as your geometry.

OCR in tesseract has problems with white font on black background. If that is the case in your images, set invert to TRUE to invert the image and ensure OCR uses black text on white background.

Even then, output will not be perfect. Error rates in OCR depend on multiple factors, including the text size and font type used. We don't have control over these, so check the output carefully and edit as required.

Author

Juergen Niedballa

Examples