Viola–Jones object detection framework--Rapid Object Detection using a Boosted Cascade of Simple Fea
ACCEPTED CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION 2001
Rapid Object Detection using a Boosted Cascade of Simple
Features
简单特征的优化级联在快速目标检测中的应用
Paul Viola Michael Jones
viola@merl.com mjones@crl.dec.com
Mitsubishi Electric Research Labs Compaq CRL
三菱电气实验室 康柏剑桥研究所
201 Broadway, 8th FL One Cambridge Center
Cambridge, MA 02139 Cambridge, MA 02142
Abstract
This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the “Integral Image” which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers[6]. The third contribution is a method for combining increasingly more complex classi- fiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more compu- tation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guar- antees that discarded regions are unlikely to contain the ob- ject of interest. In the domain of face detection the system yields detection rates comparable to the best previous sys- tems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differenc- ing or skin color detection.
摘要
本文描述了一个视觉目标检测的机器学习法,它能够非常快速地处理图像而且能实现高检测速率。这项工作可分为三个创新性研究成果。第一个是一种新的图像表征说明,称为“积分图”,它允许我们的检测的特征得以很快地计算出来。第二个是一个学习算法,基于Adaboost自适应增强法,可以从一些更大的设置和产量极为有效的分类器中选择出几个关键的视觉特征。第三个成果是一个方法:用一个“级联”的形式不断合并分类器,这样便允许图像的背景区域被很快丢弃,从而将更多的计算放在可能是目标的区域上。这个级联可以视作一个目标特定的注意力集中机制,它不像以前的途径提供统计保障,保证舍掉的地区不太可能包含感兴趣的对象。在人脸检测领域,此系统的检测率比得上之前系统的最佳值。在实时监测的应用中,探测器以每秒15帧速度运行,不采用帧差值或肤色检测的方法。
1. Introduction
This paper brings together new algorithms and insights to construct a framework for robust and extremely rapid object detection. This framework is demonstrated on, and in part motivated by, the task of face detection. Toward this end we have constructed a frontal face detection system which achieves detection and false positive rates which are equiv- alent to the best published results [16, 12, 15, 11, 1]. This face detection system is most clearly distinguished from previous approaches in its ability to detect faces extremely rapidly. Operating on 384 by 288 pixel images, faces are detected at 15 frames per second on a conventional 700 MHz Intel Pentium III. In other face detection systems, auxiliary information, such as image differences in video sequences, or pixel color in color images, have been used to achieve high frame rates. Our system achieves high frame rates working only with the information present in a single grey scale image. These alternative sources of information can also be integrated with our system to achieve even higher frame rates.
1.引言
本文汇集了新的算法和见解,构筑一个鲁棒性良好的极速目标检测框架。这一框架主要是体现人脸检测的任务。为了实现这一目标,我们已经建立了一个正面的人脸检测系统,实现了相当于已公布的最佳结果的检测率和正误视率, [16,12,15,11,1]。这种人脸检测系统区分人脸比以往的方法都要清楚,而且速度很快。通过对384×288像素的图像,硬件环境是常规700 MHz英特尔奔腾III,人脸检测速度达到了每秒15帧。在其它人脸检测系统中,一些辅助信息如视频序列中的图像差异,或在彩色图像中像素的颜色,被用来实现高帧率。而我们的系统仅仅使用一个单一的灰度图像信息实现了高帧速率。上述可供选择的信息来源也可以与我们的系统集成,以获得更高的帧速率。
There are three main contributions of our object detection framework. We will introduce each of these ideas briefly below and then describe them in detail in subsequent sections.
本文的目标检测框架包含三个主要创新性成果。下面将简短介绍这三个概念,之后将分章节对它们一一进行详细描述。
The first contribution of this paper is a new image representation called an integral image that allows for very fast feature evaluation. Motivated in part by the work of Papageorgiou et al. our detection system does not work directly with image intensities [10]. Like these authors we use a set of features which are reminiscent of Haar Basis functions (though we will also use related filters which are more complex than Haar filters). In order to compute these fea- tures very rapidly at many scales we introduce the integral image representation for images. The integral image can be computed from an image using a few operations per pixel. Once computed, any one of these Harr-like features can be computed at any scale or location in constant time.
本文的第一个成果是一个新的图像表征,称为积分图像,允许进行快速特征评估。我们的检测系统不能直接利用图像强度的信息工作[10]。和这些作者一样,我们使用一系列与Haar基本函数相关的特征:(尽管我们也将使用一些更复杂的滤波器)。为了非常迅速地计算多尺度下的这些特性,我们引进了积分图像。在一幅图像中,每个像素使用很少的一些操作,便可以计算得到积分图像。任何一个类Haar特征可以在任何规模或位置上被计算出来,且是在固定时间内。
The second contribution of this paper is a method for constructing a classifier by selecting a small number of im- portant features using AdaBoost [6]. Within any image sub- window the total number of Harr-like features is very large, far larger than the number of pixels. In order to ensure fast classification, the learning process must exclude a large ma- jority of the available features, and focus on a small set of critical features. Motivated by the work of Tieu and Viola, feature selection is achieved through a simple modification of the AdaBoost procedure: the weak learner is constrained so that each weak classifier returned can depend on only a single feature [2]. As a result each stage of the boosting process, which selects a new weak classifier, can be viewed as a feature selection process. AdaBoost provides an effec- tive learning algorithm and strong bounds on generalization performance [13, 9, 10].
本文的第二个成果是通过使用AdaBoost算法选择数个重要的特征构建一个分类器[6]。在任何图像子窗口里的类Haar特征的数目非常大,远远超过了像素数目。为了确保快速分类,在学习过程中必须剔除的大部分可用的特征,关注一小部分关键特征。选拔工作是通过一个AdaBoost的程序简单修改:约束弱学习者,使每一个弱分类器返回时仅可依赖1个特征[2]。因此,每个改善过程的阶段,即选择一个新的弱分类器的过程,可以作为一个特征选择过程。 AdaBoost算法显示了一个有效的学习算法和良好的泛化性能[13,9,10]。
The third major contribution of this paper is a method for combining successively more complex classifiers in a cascade structure which dramatically increases the speed of the detector by focusing attention on promising regions of the image. The notion behind focus of attention approaches is that it is often possible to rapidly determine where in an image an object might occur [17, 8, 1]. More complex pro- cessing is reserved only for these promising regions. The key measure of such an approach is the “false negative” rate of the attentional process. It must be the case that all, or almost all, object instances are selected by the attentional filter.
本文的第三个主要成果是在一个在级联结构中连续结合更复杂的分类器的方法,通过将注意力集中到图像中有希望的地区,来大大提高了探测器的速度。在集中注意力的方法背后的概念是,它往往能够迅速确定在图像中的一个对象可能会出现在哪里[17,8,1]。更复杂的处理仅仅是为这些有希望的地区所保留。衡量这种做法的关键是注意力过程的“负误视”(在模式识别中,将属于物体标注为不属于物体)的概率。在几乎所有的实例中,对象实例必须是由注意力滤波器选择。
We will describe a process for training an extremely sim- ple and efficient classifier which can be used as a “super- vised” focus of attention operator. The term supervised refers to the fact that the attentional operator is trained to detect examples of a particular class. In the domain of face detection it is possible to achieve fewer than 1% false neg- atives and 40% false positives using a classifier constructed from two Harr-like features. The effect of this filter is to reduce by over one half the number of locations where the final detector must be evaluated.
我们将描述一个过程:训练一个非常简单又高效的分类器,用来作为注意力操作的“监督”中心。术语“监督”是指:注意力操作被训练用来监测特定分类的例子。在人脸检测领域,使用一个由两个类Haar特征构建的分类器,有可能达到1%不到的负误视和40%正误视。该滤波器的作用是减少超过一半的最终检测器必须进行评估的地方。
Those sub-windows which are not rejected by the initial classifier are processed by a sequence of classifiers, each slightly more complex than the last. If any classifier rejects the sub-window, no further processing is performed. The structure of the cascaded detection process is essentially that of a degenerate decision tree, and as such is related to the work of Geman and colleagues [1, 4].
这些没有被最初的分类器排除的子窗口,由接下来的一系列分类处理,每个分类器都比其前一个稍有复杂。如果某个子窗口被任一个分类器排除,那它将不会被进一步处理。在检测过程的级联结构基本上是一个退化型决策树,这点可以参照German和同事的工作[1,4]。
An extremely fast face detector will have broad prac- tical applications. These include user interfaces, image databases, and teleconferencing. In applications where rapid frame-rates are not necessary, our system will allow for significant additional post-processing and analysis. In addition our system can be implemented on a wide range of small low power devices, including hand-helds and embed- ded processors. In our lab we have implemented this face detector on the Compaq iPaq handheld and have achieved detection at two frames per second (this device has a low power 200 mips Strong Arm processor which lacks floating point hardware).
一个非常快速的人脸检测器有广泛实用性。这包括用户界面,图像数据库,及电话会议。在不太需要高帧速率的应用中,我们的系统可提供额外的重要后处理和分析。另外我们的系统能够在各种低功率的小型设备上实现,包括手持设备和嵌入式处理器。在我们实验室我们已经将该人脸检测系统在Compaq公司的ipaq上实现,并达到了两帧每秒的检测率(该设备仅有200 MIPS的低功耗处理器,缺乏浮点硬件)。
The remainder of the paper describes our contributions and a number of experimental results, including a detailed description of our experimental methodology. Discussion of closely related work takes place at the end of each section.
本文接下来描述我们的研究成果和一些实验结果,包括我们实验方法学的详尽描述。每章结尾会有对近似工作的讨论。
2. Features
Our object detection procedure classifies images based on the value of simple features. There are many motivations for using features rather than the pixels directly. The most common reason is that features can act to encode ad-hoc domain knowledge that is difficult to learn using a finite quantity of training data. For this system there is also a second critical motivation for features: the feature based system operates much faster than a pixel-based system.
2.特征
我们的目标检测程序是基于简单的特征值来分类图像的。之所以选择使用特征而不是直接使用像素,主要是因为特征可以解决特定领域知识很难学会使用有限训练资料的问题。对于这些系统来说,选择使用特征还有另外一个重要原因:基于特征的系统的运行速度要远比基于像素的快。
The simple features used are reminiscent of Haar basis functions which have been used by Papageorgiou et al. [10]. More specifically, we use three kinds of features. The value of a two-rectangle feature is the difference between the sum of the pixels within two rectangular regions. The regions have the same size and shape and are horizontally or ver- tically adjacent (see Figure 1). A three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle. Finally a four-rectangle feature computes the difference between diagonal pairs of rectangles.
上述简单特征是基于Haar基本函数设置的,Papageorgiou等人已使用过[10]。而我们则是更具体地选择了特定的三类特征。其中,双矩形特征的值定义为两个矩形区域里像素和的差。而区域则具有相同尺寸和大小,并且水平或垂直相邻(如图1)。而三矩形特征的值则是两个外侧矩形的像素和减去中间矩形的和所得的最终值。最后一个四矩形特征的值是计算两组对角线矩形的区别而得的。
Given that the base resolution of the detector is 24x24, the exhaustive set of rectangle features is quite large, over 180,000 . Note that unlike the Haar basis, the set of rectan- gle features is overcompleteMatlab implementation Viola Jones Detection