Providing Better Feedback in Realtime Object Detection Apps

 

Visual Recognition with IBM Watson + Realm

The IBM Watson & Cloud Developer Advocacy team built a cognitive demo Swift app in collaboration with Realm. Through this example you learn a different pattern on building a cognitive mobile application. The Scanner app allows users to take a picture of an object, uses the Realm Mobile Platform to sync with the Watson Visual Recognition Service, and recognize what the object is. In addition to recognizing the image, the app also invokes face detection and text recognition, too!

Recent advances in computer vision technology and computational resources have made it easier to build real-time object recognition apps on iOS devices than before. However, implementing recognition technology on devices itself is only a part of an app development. Combining it with user interaction and providing appropriate feedbacks are crucial for user-friendly apps. In this lightning talk, I would like to talk about real problems we have faced and solved to give better feedbacks to users while developing Wantedly People, an iOS app that recognizes business cards in camera instantly.


I’ll be talking about providing better feedback in real-time object detection apps. Maybe cannot be viewed only using VX native. I’m Shinichi, and I’m a software engineer at Wantedly, a startup in Japan.

Object detection is a process of finding objects, images or videos (e.g. cats, faces, or cat faces).

Wantedly People

Half a year ago, we released a real-time object detection app called Wantedly People. It’s an app to manage business cards. The great thing of this app is that it can automatically detect match through business cards in camera.

This app detects business cards. For example, when the user taps this shutter button it scans them and transfers them to data. Here I will focus on this detection process only, and present what we’ve done in order to improve our app’s feedback to users.

We want to draw bubbling circle layers on detected cards and animate them towards the next pair.

Assumption: we already have detection logic

You might be wondering how we can detect business cards in camera but today let’s assume we already have detection logic - we know where cards are positioned in the frame.

struct CardFeature {
    let coordinates: [CGPoint]
    ...
}

We define each card in the frame as CardFeature. This has coordinates of an area where a card is detected. In this case, after detection we have three CardFeatures in both planes.

Then we have DetectionViewController.

extension DetectionViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
  func captureOutput(
    _ captureOutput: AVCaptureOutput!,
    didOutputSampleBuffer sampleBuffer: CMSampleBuffer!, ...) {
      ...
      guard let frame = sampleBuffer.toUIImage() else { return }
      let cardFeatures = detector.cardFeatures(in: frame)  // Detect cards
      updateCircleLayersView(with: cardFeatures)  // Add CALayers
      ...
  } 
}

It’s an iOS app, so we use AV Foundation and adapt each protocol. In captureOutput they get method where we get each frame. We code a method to detect card features. After that, we will be updated using those features, adding circle layers on it.

In each frame, we are resetting circles in the previous frame and drawing new circles. Circled areas are discontinuous, because each detection takes time and there are gaps between frames. This feedback should be more understandable. Besides, they are supposed to have bubbling animations but it is hard.

One object

If the app only needs to detect one object it’s just one vis-a-dish card because when we find objects sequentially; we can always say it’s object A even if it’s defined. All we have to do is add a circle layer and move it between frames.

Multiple objects

This doesn’t work if we have multiple objects. Let’s say we have three objects in the current frame and in the next frame. People can easily guess, object A will be D, B will be E. But our app cannot unless we define a loop. We want this ABC data to be consistent. This means we need tracking.

Tracking

Tracking is to assign consistent labels. There are generic algorithms available in OpenCV (computer vision library). We tried them, but we found performance issues on some iOS devices. We decided to build a tracking logic ourselves by combining these information. Briefly, they are comparatively lightweight and fast. dHash was particularly good.

struct CardFeature {
    let coordinates: [CGPoint]
let trackingID: Int ! ...
}

Thanks to tracking, now we have better CardFeatures tracked. The only difference is it has a trackingID but this is indispensable.

extension DetectionViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
  func captureOutput(
    _ captureOutput: AVCaptureOutput!,
    didOutputSampleBuffer sampleBuffer: CMSampleBuffer!, ...) {
      ...
      guard let frame = sampleBuffer.toUIImage() else { return }
      let cardFeatures = detector.cardFeatures(in: frame)
      //
      // Now we can animate circle layers
      // from previous positions to current positions
      //
      updateCircleLayersView(with: cardFeatures) ! ...
  } 
}

Back to ViewController. When we update a view, we can animate layers from previous positions to current positions. Now we can distinguish cards in each frame. It’s much better. It can find circles moving appropriately. All we have to do now is to add bubbling animations to each layer. This feedback helps users understand what’s going on.

You might think the problem I’ve just talked is specific to our application. It depends on the situation, on what we want to detect, which feedback you want to add or how long detection takes.

open class CIFaceFeature : CIFeature {
    open var bounds: CGRect { get }
open var trackingID: Int32 { get } ! ...
}

If you are using CoreImage you can get trackingID at that of detection, although it’s only available to face detection. You can get something like CIFaceFeature. Since I cannot publish our product code, I put a sample using this CIFaceFeature on GitHub - check it and adjust it.

Summary

In real time object detection app a difficulty in giving feedback differs depending on situation. But sometimes tracking can improve feedback. When you build a real-time object detection app it’s a good idea to care about tracking from the beginning of the project, in addition to detection process itself.


Shinichi Goto

Shinichi Goto

Shinichi Goto is a software engineer in Tokyo who's enthusiastic about mobile app development. After earning his MS in the field of computer vision and working as an infrastructure engineer, he was attracted to the mobile world and got there. At Wantedly he writes code for iOS, server-side, image processing and more.

Transcribed by Sandra Sanchez-Roige