Home

Awesome

<!-- markdownlint-disable-file MD033 -->

Multilingual OCR for Swift: using C/C++ libs with iOS (and macOS)

Welcome to our project, Tesseract OCR in your Xcode Project.

We started this project because we really wanted Japanese OCR in an iOS app. Apple's Vision system doesn't support Japanese, so we picked the open-source project Tesseract OCR, and... we started running into issues around building C/C++ libs for iOS (and macOS). We worked through it all, though, and are happy to say "we did it!". And it doesn't have to be hard, it can be easy...

...like, this easy:

  1. git clone or download this repo
  2. cd to the repo
  3. run ./Scripts/Build_All.sh
    1. watch just the right amount of progress messages as you wait for a successful build...
  4. run ./Scripts/test_tesseract.sh to get some language recognition data and test the build
  5. open iOCR/iOS/iOCR-iOS.xcodeproj
  6. run the iOCR target on an iPad Pro 12.9-inch Simulator and see that Chinese, English, and Japanese text are recognized in the app

We always wanted you to be able to do something like this, too!, so we made this, Our Guide to Building and Integrating C/C++ libs in Xcode.

This Guide

This Guide covers how we made this project/repo (the "build environment"), how to target builds for different frameworks and processor architectures, and finally how to import the libs into Xcode and use their C-API's with Swift.

If you want to know more, read on to...

The project environment

When you clone this repo (or download the ZIP and expand it), you'll be looking at a directory/folder we refer to as the TBE_PROJECTDIR. Navigate to your TBE_PROJECTDIR, and you'll see the inital state of the repo, which looks mostly like this (only showing the important stuff, for now):

iOCR
README.md       iOCR.xcconfig    iOS    macOS

Root:
README.md       include

Scripts:
Build_All.sh    README.md        gnu-tools      set_env.sh      test_tesseract.sh      xcode-libs

Sources:
config_sub

The build scripts will also create new directories—Downloads, Logs, Sources—that will be populated with artifacts of the build process.

Let's move on to what we're building, and see how it all comes together.

Build from source

The top-level libraries needed to perform OCR are, in hierarchical order:

We've also configured Leptonica and Tesseract to build with zlib for image compression, which is specified with the -lz flag in their config-make files. Not apparent, but the build process also links against the C++ standard library. We need to consider these two dependencies later when we set up the Xcode project to use our build products. I cover this in the Xcode configuration walkthrough.

There is additional tooling to support the process of building the top-level libs, packages like autoconf and automake from GNU.

The final arrangement of the packages we settled on looks like:

  1. autoconf
  2. automake
  3. pkgconfig
  4. libtool
  5. libjpeg
  6. libpng
  7. libtiff
  8. leptonica
  9. tesseract

Starting the build

For each of the packages above, the build process is:

  1. downloads a package's TGZ/ZIP to Downloads
  2. extracts that TGZ/ZIP to Sources
  3. configures and makes that source, then installs those build products into Root

The Scripts/build directory contains all the shell scripts to execute those three steps. Looking in there:

 % ls Scripts/build
build_all.sh*
...
build_leptonica.sh*
build_tesseract.sh*  
...
config-make-install_leptonica.sh
config-make-install_tesseract.sh
...
set_env.sh
utility.sh

Any of the build_PACKAGE-NAME.sh scripts can be run by itself. The top-level libraries also have a config-make-install helper script that covers the details of building and installing for multiple architectures and platforms, which we'll cover after we see the completed build.

build_all.sh is the build chain; running this one script will produce all the files that we will need for Xcode:

 % ./Scripts/build/build_all.sh

...
(some time later)
...

======== tesseract-4.1.1 ========
Downloading... done.
Extracting... done.
Preconfiguring... done.
--**!!**-- Overriding $SOURCES/tesseract-4.1.1/config/config.sub
ios_arm64: configuring... done, making... done, installing... done.
ios_arm64_sim: configuring... done, making... done, installing... done.
ios_x86_64_sim: configuring... done, making... done, installing... done.
macos_x86_64: configuring... done, making... done, installing... done.
macos_arm64: configuring... done, making... done, installing... done.
lipo: ios... done.
lipo: sim... done.
lipo: macos... done.
tesseract command-line: copying... sym-linking to arm64 binary... done.

After a while, we see that Tesseract was finally configured, made, and installed. And then there was a final lipo step.

The builds are targeted for two different processor architectures, arm64 and x86_64. There are also three different platform configurations, ios, macos, and ios_sim (simulator). This results in the following three files for every library, and each is needed for the following use-case:

lib nameuse
Root/ios_arm64/lib/libtesseract.arunning on an iOS device
Root/ios_arm64_sim/lib/libtesseract.arunning in iOS Simulator, on an M1 Mac
Root/ios_x86_64_sim/lib/libtesseract.arunning in iOS Simulator, on an Intel Mac
Root/macos_arm64/lib/libtesseract.arunning on an M1 Mac (AppKit)
Root/macos_x86_64/lib/libtesseract.arunning on an Intel Mac (AppKit)

For iOS, we can use the lipo tool to stitch the files for the two different architectures (arm64 and x86_64) together, and then we can plug that one lib into Xcode. This will finally leave us with a set of three binary files for each library, and installed to the common location Root/lib:

lipo these architecture_platform libsinto this final lib
Root/ios_arm64/lib/libtesseract.aRoot/lib/libtesseract-ios.a
Root/ios_arm64_sim/lib/libtesseract.a <br/> Root/ios_x86_64_sim/lib/libtesseract.aRoot/lib/libtesseract-sim.a
Root/macos_x86_64/lib/libtesseract.a <br/> Root/macos_arm64/lib/libtesseract.aRoot/lib/libtesseract-macos.a

Now that Tesseract is built and installed, we can test it out and see some payoff for all this hard work.

Test Tesseract

To get a very quick and basic validation of our hard work, we'll ignore those installed libs for a moment and focus on a command-line (CL) tesseract program that was also built and installed as a part of our process.

For the CL (and lib-based Xcode) Tesseract to work, we need to get the trained data for the languages we want recognized. We'll get Traditional Chinese (vertical), Japanese (horizontal and vertical), and English.

The data is downloaded to Root/share/tessdata. For this test, the data is made known to the CL tesseract program by exporting an environment variable, export TESSDATA_PREFIX=$ROOT/share/tessdata, in the test script.

Run Scripts/test_tesseract.sh to download the trained data and run a quick OCR test on these sample images:

<table> <tr> <td> <img height="300" src="iOCR/Assets.xcassets/chinese_traditional_vert.imageset/cropped.png"/> </td> <td> <img width="300" src="iOCR/Assets.xcassets/japanese.imageset/test_hello_hori.png "/> </td> <td> <img height="300" src="iOCR/Assets.xcassets/japanese_vert.imageset/test_hello_vert.png"/> </td> <td> <img height="300" src="iOCR/Assets.xcassets/english_left_just_square.imageset/hexdreams.png"/> </td> </tr> <tr> <td>Chinese (trad, vert)</td> <td>Japanese</td> <td>Japanese (vert)</td> <td>English</td></tr> </table>
% ./Scripts/test_tesseract.sh
# Checking for trained data language files...
downloading chi_tra.traineddata...done
downloading chi_tra_vert.traineddata...done
downloading eng.traineddata...done
downloading jpn.traineddata...done
downloading jpn_vert.traineddata...done
# Recognizing sample images...
testing Japanese...passed
testing Japanese (vert)...passed
testing Chinese (trad, vert)...passed
testing English...passed

And with that little test completed, we can get into Xcode.

Write an app

The main API for Tesseract is in C++, but Swift doesn't support C++. Swift does support C APIs, and Tesseract also has a C-API, so we'll use that.

If you're not familiar with the Tesseract API, here are the basics with figurative code samples.

Tesseract API basics, in Swift

The following Swift excerpts were taken from testGuideExample() in iOCR/iOCRTests/iOCRTests.swift. We'll also ignore the destroy/teardown code.

Initialize API handler

Create an API handler and initialize it with the trained data's parent folder, the trained data's language name, and an OCR engine mode (OEM):

let tessAPI = TessBaseAPICreate()!
TessBaseAPIInit2(tessAPI, trainedDataFolder, "jpn_vert", OEM_LSTM_ONLY)

TessBaseAPIInit2() is one of 4 API initializers, and lets us set the OEM. OEM_LSTM_ONLY is the latest neural-net recognition engine, which has some advantage in text-line recognition over the previous engine.

For the API to be able to find the tessdata parent-folder, we added Root/share as a folder reference in our Xcode project, then:

let trainedDataFolder = Bundle.main.path(
            forResource: "tessdata", ofType: nil, inDirectory: "share")

Prepare the image

We use TessBaseAPISetImage2() to set the image, and that API requires the image to be in Leptonica's PIX format. We create the PIX object by getting a pointer to a byte-string of some UIImage data, and then passing that pointer to the pixReadMem() function:

let uiImage = UIImage(named: "japanese_vert")!
let data = uiImage.pngData()!
let rawPointer = (data as NSData).bytes
let uint8Pointer = rawPointer.assumingMemoryBound(to: UInt8.self)

var image = pixReadMem(uint8Pointer, data.count)

Image settings & Perform OCR

Set our image, the resolution, and page segmentation mode (PSM). PSM defines how Tesseract sees or treats the image, like 'Assume a single column of text of variable sizes' or 'Treat the image as a single word'. All the images in this guide have been cropped to just the text, and letting Tesseract figure this out for itself (PSM_AUTO) works just fine:

TessBaseAPISetImage2(tessAPI, image)
TessBaseAPISetSourceResolution(tessAPI, 144)
TessBaseAPISetPageSegMode(tessAPI, PSM_AUTO)

Finally, we call the GetUTF8Text method which runs the recognize functions inside Tesseract, and get some text back:

let txt = TessBaseAPIGetUTF8Text(tessAPI)

and looking at the result in the debugger:

print String(cString: txt!)

  (String) $R3 = "Hello\n\n,世界\n"

Now that we have some text from the image, we want to know where in the image Tesseract found it.

Iterate over results

The API can recognize varying levels of text in this top-down order: blocks, paragraphs, lines, words, symbols. RIL_TEXTLINE is the ResultIteratorLevel for working with individual lines of text. Here we're using a textline iterator and getting the (x1,y1) and (x2,y2) coordinates of the recognized line's bounding box:

let iterator = TessBaseAPIGetIterator(tessAPI)
let level = RIL_TEXTLINE

var x1: Int32 = 0
var y1: Int32 = 0
var x2: Int32 = 0
var y2: Int32 = 0

TessPageIteratorBoundingBox(iterator, level, &x1, &y1, &x2, &y2)

Note: TessBaseAPIGetUTF8Text() or TessBaseApiRecognize() must be called before the TessPageIterator and TessResultIterator methods.

There is a small test and full working example of these basics in iOCRTests.swift::testGuideExample(), in the following Xcode project.

iOCR Xcode project

TBE_PROJECTDIR/iOCR/iOCR.xcodeproj is an example of putting everything together into a working project and running an app in the simulator that shows off those API basics.

Open the project and run the iOCR target for an iPad Pro (12.9-in) (some of the UI was coded specifically for that device's screen size).

<img height="683" src="Notes/static/Guide/ipad_app_all_good.png"/> <!-- We can also see between horizontal and vertical Japanese, and vertical Chinese, that the results of a "line" varies depending on some combination of language and the text's orientation: - In the horizontal Japanese and vertical Chinese examples, we get what we'd expect - In the vertical Japanese example, <span style="font-size: 1.1em">Hello</span> and <span style="font-size: 1.1em">,世界</span> are recognized as two separate lines -->

Each of the four cards consists of a sample image against a gray background; images were run through Tesseract at the TEXTLINE level. Colored rectangles drawn on top of the image represent lines that Tesseract recognized. Each recognized line is also represented in the table below the image. The recognized line's bounding box, utf8 text, and confidence score are wrapped up in a RecognizedRectangle:

struct RecognizedRectangle: Equatable {
    let id = UUID()
    public var text: String
    public var boundingBox: CGRect
    public var confidence: Float
}

The Recognizer class manages that struct, along with all the API setup and teardown. It has two main methods, getAllText() and getRecognizedRects(), for getting all text and/or RecognizedRectangles.

We create a recognizer:

let recognizer = Recognizer(imgName: "japanese_vert", trainedDataName: "jpn_vert", imgDPI: 144)

and, to simply show the results, call these methods in the debugger:

print recognizer.getAllText()

  (String) $R2 = "Hello\n\n,世界\n"

// and...

print recognizer.getRecognizedRects()

  ([iOCR.RecognizedRectangle]) $R8 = 2 values {
    [0] = {
      id = {}
      text = "Hello\n\n"
      boundingBox = (origin = (x = 9, y = 12), size = (width = 22, height = 166))
      confidence = 88.5363388
    }
    [1] = {
      id = {}
      text = ",世界\n"
      boundingBox = (origin = (x = 7, y = 210), size = (width = 30, height = 83))
      confidence = 78.3088684
    }
}

Everything looks good, now.

Better configuration is better recognition

But—to make a point about better configuration making for better recognition—with a small, bad tweak we can get an odd result:

  1. Open ContentView.swift
  2. Locate the Recognizer for vertical Japanese
  3. Change the imgDPI from the correct value of 144 to the incorrect value of 72
RecognizedView(
    caption: "Japanese (vertical)",
    recognizer: Recognizer(imgName: "japanese_vert", trainedDataName: "jpn_vert", imgDPI: 72)
)

Re-run the app and we can see the text value <*blank*> with a confidence of 95%. This value corresponds to the unexpected recognition of a single stroke inside the <span style="font-size: 1.1em"></span> character as a whole other valid character:

<img height="404" src="Notes/static/Guide/ipad_app_bad_blank_cropped.png"/>

Learning Tesseract

Configuration can matter a lot for Tesseract. You might need to dig in if you don't immediately get good results. Two resources we've consulted to get a quick picture of this configuration landscape were:

The Tesseract User Group and its Github Issues are also good resources.

Lessons Learned

Updating targets for M1/Apple Silicon

One problem I ran into building for M1/Apple Silicon was that the GNU Auto tools were not up to date with the new platform names expressed in the host triplet, like arm64-apple-ios14.3.

The first error I had to tackle came while configuring top-level libs:

Invalid configuration `arm64-apple-ios14.3': machine `arm64-apple' not recognized
configure: error: /bin/sh .././config.sub arm64-apple-ios14.3 failed

After searching around, I found a number of cases like this documented in StackOverflow or in GitHub issues. A common solution was just echoing back the host triplet that was passed into config.sub by just re-writing config.sub itself before it was run by ./configure:

echo 'echo $1' > ../config.sub

I added that re-write to the top-level build scripts, which did clear the configuration error for all the top-level libs, and the build chain was whole again... until it came to running make on Tesseract:

ld: library not found for -lrt
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [tesseract] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

First question is: What changed, now, that make fails trying to find a non-existent library? Second question should have been: How do I roll back and get it making again?

Maybe I could have stopped there and given those questions/concerns more thought, but I quickly (instantly?) convinced myself that:

So digging into the error message was the clear path forward.

I eventually found the environment flags ADD_RT_FALSE and ADD_RT_TRUE were the values that ultimately affected the exclusion or inclusion of -lrt and followed those till I hit what felt like pay dirt with this select statement in ./configure:

...
*darwin*)
  OPENCL_LIBS=""
  OPENCL_INC=""
  if false; then
    ADD_RT_TRUE=
    ADD_RT_FALSE='#'
  else
    ADD_RT_TRUE='#'
    ADD_RT_FALSE=
  fi
...

and I'm still not convinced I understand the logic of "if false, do thing I want you to do; otherwise do the thing I'm trying to rememdy/avoid"... but!

I did understand that none of these lines would ever execute because darwin was missing, and Why was darwin missing, now?

Because that's the thing I had changed, but didn't figure out from before. The targets used to be like:

export TARGET='arm-apple-darwin64'

before they were changed for Apple Silicon and the latest version of Xcode to:

export TARGET='arm64-apple-ios14.3'

so just passing the new target as a host triplet through config.sub:

echo 'echo $1' > ../config.sub

is wrong. darwin has to be seen in the host triplet and I need to retain my target configuration, so:

print -- "--**!!**-- Overriding ../config.sub"
echo 'echo arm-apple-darwin64' > ../config.sub

 ...

export TARGET='arm64-apple-ios14.3'

is the fix I needed.