Joint Calibration for Semantic Segmentation arXiv:1507

CAESAR ET AL.: JOINT CALIBRATION FOR SEMANTIC SEGMENTATION 1

Joint Calibration for SemanticSegmentation

Holger [email protected]

Jasper [email protected]

Vittorio [email protected]

School of InformaticsUniversity of EdinburghEdinburgh, UK

Abstract

Semantic segmentation is the task of assigning a class-label to eachpixel in an image. We propose a region-based semantic segmentation frame-work which handles both full and weak supervision, and addresses threecommon problems: (1) Objects occur at multiple scales and therefore weshould use regions at multiple scales. However, these regions are over-lapping which creates conflicting class predictions at the pixel-level. (2)Class frequencies are highly imbalanced in realistic datasets. (3) Each pixelcan only be assigned to a single class, which creates competition betweenclasses. We address all three problems with a joint calibration methodwhich optimizes a multi-class loss defined over the final pixel-level outputlabeling, as opposed to simply region classification. Our method outper-forms the state-of-the-art on the popular SIFT Flow [18] dataset in boththe fully and weakly supervised setting by a considerably margin (+6% and+10%, respectively).

1 Introduction

Semantic segmentation is the task of assigning a class label to each pixel in animage (Fig. 1). In the fully supervised setting, we have ground-truth labels forall pixels in the training images. In the weakly supervised setting, class-labelsare only given at the image-level. We tackle both settings in a single frameworkwhich builds on region-based classification.

Our framework addresses three important problems common to region-basedsemantic segmentation. First of all, objects naturally occur at different scaleswithin an image [3, 37]. Performing recognition at a single scale inevitablyleads to regions covering only parts of an object which may have ambiguous ap-pearance, such as wheels or fur, and to regions straddling over multiple objects,whose classification is harder due to their mixed appearance. Therefore manyrecent methods operate on pools of regions computed at multiple scales, which

c© 2015. The copyright of this document resides with its authors.It may be distributed unchanged freely in print or electronic forms.

arX

iv:1

507.

0158

1v4

[cs

.CV

] 1

2 A

ug 2

015

Citation

Citation

{Liu, Yuen, and Torralba} 2011

Citation

Citation

{Carreira and Sminchisescu} 2010

Citation

Citation

{Uijlings, vanprotect unhbox voidb@x penalty @M {}de Sande, Gevers, and Smeulders} 2013

2 CAESAR ET AL.: JOINT CALIBRATION FOR SEMANTIC SEGMENTATIONBoat, Rock, Sea, Sky

Weaklysupervised

Fullysupervised

Sky, Building

Figure 1: Semantic segmentation is the task of assigning class labels to all pixelsin the image. During training, with full supervision we have ground-truth labelsof all pixels. With weak supervision we only have labels at the image-level.

have a much better chance of containing some regions covering complete ob-jects [3, 4, 11, 12, 16, 24, 45]. However, this leads to overlapping regions whichmay lead to conflicting class predictions at the pixel-level. These conflicts needto be properly resolved.

Secondly, classes are often unbalanced [2, 7, 14, 19, 20, 28, 29, 31, 35, 39,41, 42, 43]: “cars” and “grass” are frequently found in images while “tricycles”and “gravel” are much rarer. Due to the nature of most classifiers, without carefulconsideration these rare classes are largely ignored: even if the class occurs in animage the system will rarely predict it. Since class-frequencies typically followa power-law distribution, this problem becomes increasingly important with themodern trend towards larger datasets with more and more classes.

Finally, classes compete: a pixel can only be assigned to a single class (e.g.it can not belong to both “sky” and “airplane”). To properly resolve such compe-tition, a semantic segmentation framework should take into account predictionsfor multiple classes jointly.

In this paper we address these three problems with a joint calibration methodover an ensemble of SVMs, where the calibration parameters are optimized overall classes, and for the final evaluation criterion, i.e. the accuracy of pixel-level labeling, as opposed to simply region classification. While each SVM istrained for a single class, their joint calibration deals with the competition be-tween classes. Furthermore, the criterion we optimize for explicitly accounts forclass imbalance. Finally, competition between overlapping regions is resolvedthrough maximization: each pixel is assigned the highest scoring class over allregions covering it. We jointly calibrate the SVMs for optimal pixel labeling af-ter this maximization, which effectively takes into account conflict resolution be-tween overlapping regions. Experiments on the popular SIFT Flow [18] datasetshow a considerable improvement over the state-of-the-art in both the fully andweakly supervised setting (+6% and +10%, respectively).

2 Related workEarly works on semantic segmentation used pixel- or patch-based features overwhich they define a Condition Random Field (CRF) [30, 38]. Many mod-ern successful works use region-level representations, both in the fully super-vised [1, 4, 9, 11, 12, 16, 20, 24, 28, 29, 33, 34, 36, 43] and weakly super-vised [39, 40, 41, 42, 44, 45] settings. A few recent works use CNNs to learn adirect mapping from image to pixel labels [7, 17, 21, 22, 23, 27, 28, 29, 31, 46],although some of them [7, 28, 29] use region-based post-processing to impose

Citation

Citation


Citation

Citation

{Carreira, Caseiro, Batista, and Sminchisescu} 2012

Citation

Citation

{Girshick, Donahue, Darrell, and Malik} 2014

Citation

Citation

{Hariharan, Arbel{á}ez, Girshick, and Malik} 2014

Citation

Citation

{Li, Carreira, Lebanon, and Sminchisescu} 2013

Citation

Citation

{Plath, Toussaint, and Nakajima} 2009

Citation

Citation

{Zhang, Zeng, Wang, and Xue} 2015

Citation

Citation

{Byeon, Breuel, Raue, and Liwicki} 2015

Citation

Citation

{Farabet, Couprie, Najman, and LeCun} 2013

Citation

Citation

{Kekeç, Emonet, Fromont, Tr{é}meau, and Wolf} 2014

Citation

Citation

{Long, Shelhamer, and Darrell} 2015

Citation

Citation

{Mostajabi, Yadollahpour, and Shakhnarovich} 2015

Citation

Citation

{Sharma, Tuzel, and Liu} 2014

Citation

Citation

{Sharma, Tuzel, and Jacobs} 2015

Citation

Citation

{Shuai, Wang, Zuo, Wang, and Zhao} 2015

Citation

Citation

{Tighe and Lazebnik} 2013

Citation

Citation

{Vezhnevets, Ferrari, and Buhmann} 2011

Citation

Citation

{Xu, Schwing, and Urtasun} 2014

Citation

Citation


Citation

Citation

{Yang, Price, Cohen, and Ming-Hsuan} 2014

Citation

Citation


Citation

Citation

{Shotton, Winn, Rother, and Criminisi} 2009

Citation

Citation

{Verbeek and Triggs} 2007

Citation

Citation

{Boix, Gonfaus, vanprotect unhbox voidb@x penalty @M {}de Weijer, Bagdanov, Serrat, and Gonz{à}lez} 2012

Citation

Citation


Citation

Citation

{George} 2015

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation

{Tighe, Niethammer, and Lazebnik} 2014

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation

{Zhang, Gao, Xia, Lu, Shen, and Ji} 2014

Citation

Citation


Citation

Citation


Citation

Citation

{Lin, Shen, Reid, and vanprotect unhbox voidb@x penalty @M {}dan Hengel} 2015

Citation

Citation

{Papandreou, Chen, Murphy, and Yuille} 2015

Citation

Citation

{Pinheiro and Collobert} 2014

Citation

Citation


Citation

Citation

{Schwing and Urtasun} 2015

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation

{Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, and Torr} 2015

Citation

Citation


Citation

Citation


Citation

Citation



label smoothing and to better respect object boundaries. Other recent works useCRFs to refine the CNN pixel-level predictions [5, 17, 21, 27, 46]. In this workwe focus on region-based semantic segmentation, which we discuss in light ofthe three problems raised in the introduction.

Overlapping regions. Traditionally, semantic segmentation systems use su-perpixels [1, 9, 20, 28, 29, 33, 34, 36, 43], which are non-overlapping regionsresulting from a single-scale oversegmentation. However, appearance-basedrecognition of superpixels is difficult as they typically capture only parts of ob-jects, rather than complete objects. Therefore, many recent methods use over-lapping multi-scale regions [3, 4, 11, 12, 16, 24, 45]. However, these may leadto conflicting class predictions at the pixel-level. Carreira et al. [4] address thissimply by taking the maximum score over all regions containing a pixel. BothHariharan et al. [12] and Girshick et al. [11] use non-maximum suppression,which may give problems for nearby or interacting objects [16]. Li et al. [16]predict class overlap scores for each region at each scale. Then they create su-perpixels by intersecting all regions. Finally, they assign overlap scores to thesesuperpixels using maximum composite likelihood (i.e. taking all multi-scalepredictions into account). Plath et al. [24] use classification predictions over asegmentation hierarchy to induce label consistency between parent and child re-gions within a tree-based CRF framework. After solving their CRF formulation,only the smallest regions (i.e. leaf-nodes) are used for class prediction. In theweakly supervised setting, most works use superpixels [39, 40, 41, 42] and sodo not encounter problems of conflicting predictions. Zhang et al. [44] use over-lapping regions to enforce a form of class-label smoothing, but they all have thesame scale. A different Zhang et al. [45] use overlapping region proposals atmultiple scales in a CRF.

Class imbalance. As the PASCAL VOC dataset [6] is relatively balanced,most works that experiment on it did not explicitly address this issue [1, 4, 5,11, 12, 16, 17, 19, 23, 24, 27, 46]. On highly imbalanced datasets such as SIFTFlow [18], Barcelona [33] and LM+SUN [35], rare classes pose a challenge.This is observed and addressed by Tighe et al. [35] and Yang et al. [43]: for atest image, only a few training images with similar context are used to provideclass predictions, but for rare classes this constraint is relaxed and more train-ing images are used. Vezhnevets et al. [39] balance rare classes by normalizingscores for each class to range [0,1]. A few works [20, 41, 42] balance classes byusing an inverse class frequency weighted loss function.

Competing classes. Several works train one-vs-all classifiers separately andresolve labeling through maximization [4, 11, 12, 16, 20, 23, 24, 35]. Thisis suboptimal since the scores of different classes may not be properly cali-brated. Instead, Tighe et al. [33, 35] and Yang et al. [43] use Nearest Neigh-bor classification which is inherently multi-class. In the weakly supervisedsetting appearance models are typically trained in isolation and remain uncal-ibrated [39, 39, 41, 42, 44]. To the best of our knowledge, Boix et al. [1] is theonly work in semantic segmentation to perform joint calibration of SVMs. Whilethis enables to handle competing classes, in their work they use non-overlappingregions. In contrast, in our work we use overlapping regions where conflictingpredictions are resolved through maximization. In this setting, joint calibration

Citation

Citation

{Chen, Papandreou, Kokkinos, Murphy, and Yuille} 2015

Citation

Citation


Citation

Citation

{Papandreou, Chen, Murphy, and Yuille} 2015

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation

{George} 2015

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation

{Everingham, Vanprotect unhbox voidb@x penalty @M {}Gool, Williams, Winn, and Zisserman} 2010

Citation

Citation


Citation

Citation


Citation

Citation

{Chen, Papandreou, Kokkinos, Murphy, and Yuille} 2015

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


4 CAESAR ET AL.: JOINT CALIBRATION FOR SEMANTIC SEGMENTATION

is particularly important, as we will show in Sec. 4. As another difference, Boixet al. [1] address only full supervision whereas we address both full and weaksupervision in a unified framework.

3 Method

3.1 ModelWe represent an image by a set of overlapping regions [37] described by CNNfeatures [11] (Sec. 3.4). Our semantic segmentation model infers the label op ofeach pixel p in an image:

op = argmaxc, r3p

σ(wc · xr, ac,bc) (1)

As appearance models, we have a separate linear SVM wc per class c. TheseSVMs score the features xr of each region r. The scores are calibrated by a sig-moid function σ , with different parameters ac,bc for each class c. The argmaxreturns the class c with the highest score over all regions that contain pixel p.This involves maximizing over classes for a region, and over the regions thatcontain p.

During training we find the SVM parameters wc (Sec. 3.2) and calibrationparameters ac and bc (Sec. 3.3). The training of the calibration parameters takesinto account the effects of the two maximization operations, as they are op-timized for the output pixel-level labeling performance (as opposed to simplyaccuracy in terms of region classification).

3.2 SVM trainingFully supervised. In this setting we are given ground-truth pixel-level labelsfor all images in the training set (Fig. 1). This leads to a natural subdivisioninto ground-truth regions, i.e. non-overlapping regions perfectly covering a sin-gle class. We use these as positive training samples. However, such idealizedsamples are rarely encountered at test time since there we have only imperfect re-gion proposals [37]. Therefore we use as additional positive samples for a classall region proposals which overlap heavily with a ground-truth region of thatclass (i.e. Intersection-over-Union greater than 50% [6]). As negative samples,we use all regions from all images that do not contain that class. In the SVMloss function we apply inverse frequency weighting in terms of the number ofpositive and negative samples.

Weakly supervised. In this setting we are only given image-level labels on thetraining images (Fig. 1). Hence, we treat region-level labels as latent variableswhich are updated using an alternated optimization process (as in [39, 40, 41,42, 45]). To initialize the process, we use as positive samples for a class all re-gions in all images containing it. At each iteration we alternate between trainingSVMs based on the current region labeling and updating the labeling based onthe current SVMs (by assigning to each region the label of the highest scoringclass). In this process we keep our negative samples constant, i.e. all regionsfrom all images that do not contain the target class. In the SVM loss function

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation

{Everingham, Vanprotect unhbox voidb@x penalty @M {}Gool, Williams, Winn, and Zisserman} 2010

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation



Regions

Class scores

Uncalibrated

Calibrated

Image

Output

Boat Sky Car Boat Sky Car Boat Sky Car

Boat Sky Car Boat Sky Car Boat Sky Car

Figure 2: The first row shows multiple region proposals (left) extracted froman image (right). The following rows show the per-class SVM scores of eachregion (left) and the pixel-level labeling (right). Row 2 shows the results beforeand row 3 after joint calibration.

we apply inverse frequency weighting in terms of the number of positive andnegative samples.

3.3 Joint Calibration

We now introduce our joint calibration procedure, which addresses three com-mon problems in semantic segmentation: (1) conflicting predictions of overlap-ping regions, (2) class imbalance, and (3) competition between classes.

To better understand the problem caused by overlapping regions, considerthe example of Fig. 2. It shows three overlapping regions, each with differ-ent class predictions. The final goal of semantic segmentation is to output apixel-level labeling, which is evaluated in terms of pixel-level accuracy. In ourframework we employ a winner-takes all principle: each pixel takes the class ofthe highest scored region which contains it. Now, using uncalibrated SVMs isproblematic (second row in Fig. 2). SVMs are trained to predict class labels atthe region-level, not the pixel-level. However, different regions have differentarea, and, most importantly, not all regions contribute all of their area to the finalpixel-level labeling: Predictions of small regions may be completely suppressedby bigger regions (e.g. in Fig. 2, row 3, the inner-boat region is suppressed by theprediction of the complete boat). In other cases, bigger regions may be partiallyoverwritten by smaller regions (e.g. in Fig. 2 the boat region partially overwritesthe prediction of the larger boat+sky region). Furthermore, the SVMs are trainedin a one-vs-all manner and are unaware of other classes. Hence they are unlikelyto properly resolve competition between classes even within a single region. Theproblems above show that without calibration, the SVMs are optimized for thewrong criterion. We propose to jointly calibrate SVMs for the correct criterion,which corresponds better to the evaluation measure typically used for semanticsegmentation (i.e. pixel labeling accuracy averaged over classes). We do this by


applying sigmoid functions σ to all SVM outputs:

σ(wc · xr, ac,bc) = (1+ exp(ac ·wc · xr +bc))−1 (2)

where ac,bc are the calibration parameters for class c. We calibrate the param-eters of all classes jointly by minimizing a loss function L(o, l), where o is thepixel labeling output of our method on the full training set (o= {op; p= 1 . . .P})and l the ground-truth labeling.

We emphasize that the pixel labeling output o is the result after the maxi-mization over classes and regions in Eq. (1). Since we optimize for the accuracyof this final output labeling, and we do so jointly over classes, our calibrationprocedure takes into account both problems of conflicting class predictions be-tween overlapping regions and competition between classes. Moreover, we alsoaddress the problem of class imbalance, as we compensate for it in our lossfunctions below.

Fully supervised loss. In this setting our loss directly evaluates the desiredperformance measure, which is typically pixel labeling accuracy averaged overclasses [7, 19, 28, 33, 43]

L(o, l) = 1− 1C

C

∑c=1

1Pc

∑p; lp=c

[lp = op] (3)

where lp is the ground-truth label of pixel p, op is the output pixel label, Pc isthe number of pixels with ground-truth label c, and C is the number of classes.[·] is 1 if the condition is true and 0 otherwise. The inverse frequency weightingfactor 1/Pc deals with class imbalance.

Weakly supervised loss. Also in this setting the performance measure is typi-cally class-average pixel accuracy [39, 40, 42, 45]. Since we do not have ground-truth pixel labels, we cannot directly evaluate it. We do however have a setof ground-truth image labels li which we can compare against. We first ag-gregate the output pixel labels op over each image mi into output image labelsoi =∪p∈mi op. Then we define as loss the difference between the ground-truth la-bel set li and the output label set oi, measured by the Hamming distance betweentheir binary vector representations

L(o, l) =I

∑i=1

C

∑c=1

1Ic|li,c−oi,c| (4)

where li,c = 1 if label c is in li, and 0 otherwise (analog for oi,c). I is the totalnumber of training images. Ic is the number of images having ground-truth labelc, so the loss is weighted by the inverse frequency of class labels, measured atthe image-level. Note how also in this setting the loss looks at performance afterthe maximization over classes and regions (Eq. (1)).

Optimization. We want to minimize our loss functions over the calibrationparameters ac,bc of all classes. This is hard, because the output pixel labels opdepend on these parameters in a complex manner due to the max over classesand regions in Eq. (1), and because of the set-union aggregation in the case ofthe weakly supervised loss. Therefore, we apply an approximate minimization

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation



Building 0.5

Building 0.8

Sky 0.7Building 0.6...

Sky 0.9

Sky 0.8

Input image

Output target

Region hierarchy

Building 0.5

Building 0.8

Building 0.5

Sky 0.8Sky 0.8

Figure 3: Our efficient evaluation algorithm uses the bottom-up structure ofSelective Search region proposals to simplify the spatial maximization. We startfrom the root node and propagate the maximum score with its corresponding la-bel down the tree. We label the image based on the labels of its superpixels (leafnodes).

algorithm based on coordinate descent. Coordinate descent is different fromgradient descent in that it can be used on arbitrary loss functions that are not dif-ferentiable, as it only requires their evaluation for a given setting of parameters.

Coordinate descent iteratively applies line search to optimize the loss over asingle parameter at a time, keeping all others fixed. This process cycles throughall parameters until convergence. As initialization we use constant values (ac =−7, bc = 0). During line search we consider 10 equally spaced values (ac in[−12,−2], bc in [−10,10]).

This procedure is guaranteed to converge to a local minimum on the searchgrid. While this might not be the global optimum, in repeated trials we found theresults to be rather insensitive to initialization. Furthermore, in our experimentsthe number of iterations was roughly proportional to the number of parameters.

Efficient evaluation. On a typical training set with C = 30 classes, our jointcalibration procedure evaluates the loss thousands of times. Hence, it is impor-tant to evaluate pixel-level accuracy quickly. As the model involves a maximumover classes and a maximum over regions at every pixel, a naive per-pixel im-plementation would be prohibitively expensive. Instead, we propose an efficienttechnique that exploits the nature of the Selective Search region proposals [37],which form a bottom-up hierarchy starting from superpixels. As shown in Fig. 3,we start from the region proposal that contains the entire image (root node).Then we propagate the maximum score over all classes down the region hier-archy. Eventually we assign to each superpixel (leaf nodes) the label with thehighest score over all regions that contain it. This label is assigned to all pixelsin the superpixel. To compute class-average pixel accuracy, we normally needto compare each pixel label to the ground-truth label. However since we assignthe same label to all pixels in a superpixel, we can precompute the ground-truthlabel distribution for each superpixel and use it as a lookup table. This reduces

Citation

Citation



the runtime complexity for an image from O(Pi ·Ri ·C) to O(Ri ·C), where Pi andRi are the number of pixels and regions in an image respectively, and C is thenumber of classes.

Why no Platt scaling. At this point the reader may wonder why we do notsimply use Platt scaling [25] as is commonly done in many applications. Plattscaling is used to convert SVM scores to range [0,1] using sigmoid functions,as in Eq. (2). However, in Platt scaling the parameters ac,bc are optimized foreach class in isolation, ignoring class competition. The loss function Lc in Plattscaling is the cross-entropy function

Lc (σc, l) =−∑r

tr,c log(σc(xr))+(1− tr,c) log(1−σc(xr)) (5)

where N+ is the number of positive samples, N− the number of negative samples,and tr,c =

N++1N++2 if lr = c or tr,c = 1

N−+2 otherwise; lr is the region-level label. Thisloss function is inappropriate for semantic segmentation because it is defined interms of accuracy of training samples, which are regions, rather than in termsof the final pixel-level accuracy. Hence it ignores the problem of overlappingregions. There is also no inverse frequency term to deal with class imbalance.We experimentally compare our method with Platt scaling in Sec. 4.

3.4 Implementation DetailsRegion proposals. We use Selective Search [37] region proposals using a sub-set of the “Fast” mode: we keep the similarity measures, but we restrict the scaleparameter k to 100 and the color-space to RGB. This leads to two bottom-uphierarchies of one initial oversegmentation [8].

Features. We show experiments with features generated by two CNNs (AlexNet [15],VGG16 [32]) using the Caffe implementations [13]. We use the R-CNN [11]framework for AlexNet, and Fast R-CNN (FRCN) [10] for VGG16, in order tomaintain high computational efficiency. Regions are described using all pixels ina tight bounding box. Since regions are free-form, Girshick et al. [11] addition-ally propose to set pixels not belonging to the region to zero (i.e. not affectingthe convolution). However, in our experiments this did not improve results sowe do not use it. For the weakly supervised setting we use the CNNs pre-trainedfor image classification on ILSVRC 2012 [26]. For the fully supervised settingwe finetune them on the training set of SIFT Flow [18] (i.e. the semantic seg-mentation dataset we experiment on). For both settings, following [11] we usethe output of the fc6 layer of the CNN as features.

SVM training. Like [11] we set the regularization parameter C to a fixed valuein all our experiments. The SVMs minimize the L2-loss for region classification.We use hard-negative mining to reduce memory consumption.

4 ExperimentsDatasets. We evaluate our method on the challenging SIFT Flow dataset [18].It consists of 2488 training and 200 test images, pixel-wise annotated with 33

Citation

Citation

{Platt} 1999

Citation

Citation


Citation

Citation

{Felzenszwalb and Huttenlocher} 2004

Citation

Citation

{Krizhevsky, Sutskever, and Hinton} 2012

Citation

Citation

{Simonyan and Zisserman} 2015

Citation

Citation

{Jia} 2013

Citation

Citation


Citation

Citation

{Girshick} 2015

Citation

Citation


Citation

Citation

{Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei} 2015

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation



Method Class Acc.Byeon et al. [2] 22.6%Tighe et al. [33] 29.1%Pinheiro et al. [22] 30.0%Shuai et al. [31] 39.7%Tighe et al. [35] 41.1%Kekeç et al. [14] 45.8%Sharma et al. [28] 48.0%Yang et al. [43] 48.7%George et al. [9] 50.1%Farabet et al. [7] 50.8%Long et al. [19] 51.7%Sharma et al. [29] 52.8%Ours SVM (AlexNet) 28.7%Ours SVM+PS (AlexNet) 27.7%Ours SVM+JC (AlexNet) 55.6%Ours SVM+JC (VGG16) 59.2%

Method Class Acc.Vezhnevets et al. [39] 14.0%Vezhnevets et al. [40] 21.0%Zhang et al. [44] 27.7%Xu et al. [41] 27.9%Zhang et al. [45] 32.3%Xu et al. [42] 35.0%Xu et al. [42] 41.4%(transductive)

Ours SVM (AlexNet) 21.2%Ours SVM+PS (AlexNet) 16.8%Ours SVM+JC (AlexNet) 37.4%Ours SVM+JC (VGG16) 44.8%

Table 1: Class-average pixel accuracy in the fully supervised (left) and theweakly supervised setting (right) setting. We show results for our model on thetest set of SIFT Flow using uncalibrated SVM scores (SVM), traditional Plattscaling (PS) and joint calibration (JC).

class labels. The class distribution is highly imbalanced in terms of overall re-gion count as well as pixel count. As evaluation measure we use the popularclass-average pixel accuracy [7, 19, 22, 29, 33, 36, 40, 42, 43, 45]. For bothsupervision settings we report results on the test set.

Fully supervised setting. Table 1 evaluates various versions of our model inthe fully supervised setting, and compares to other works on SIFT Flow. UsingAlexNet features and uncalibrated SVMs, our model achieves a class-averagepixel accuracy of 28.7%. If we calibrate the SVM scores with traditional Plattscaling results do not improve (27.7%). Using our proposed joint calibration tomaximize class-average pixel accuracy improves results substantially to 55.6%.This shows the importance of joint calibration to resolve conflicts between over-lapping regions at multiple scales, to take into account competition betweenclasses, and generally to optimize a loss mirroring the evaluation measure.

Fig. 4 (column “SVM”) shows that larger background regions (i.e. sky,building) swallow smaller foreground regions (i.e. boat, awning). Many of thesesmall objects become visible after calibration (column “SVM+JC”). This issueis particularly evident when working with overlapping regions. Consider a largeregion on a building which contains an awning. As the surface of the awning issmall, the features of the large region will be dominated by the building, leadingto strong classification score for the ‘building’ class. When these are higher thanthe classification score for ‘awning’ on the small awning region, the latter getsoverwritten. Instead, this problem does not appear when working with super-pixels [1]. A superpixel is either part of the building or part of the awning, soa high scoring awning superpixel cannot be overwritten by neighboring build-ing superpixels. Hence, joint calibration is particularly important when workingwith overlapping regions.

Citation

Citation

{Byeon, Breuel, Raue, and Liwicki} 2015

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation

{Kekeç, Emonet, Fromont, Tr{é}meau, and Wolf} 2014

Citation

Citation


Citation

Citation


Citation

Citation

{George} 2015

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation



Regions Class Acc.FH [8] 43.4%SS [37] 55.6%

Table 2: Comparison of single-scale (FH) and multi-scale (SS) re-gions using SVM+JC (AlexNet).

Finetuned Class Acc.no 49.4%yes 55.6%

Table 3: Effect of CNN finetuningin the fully supervised setting usingSVM+JC (AlexNet).

Using the deeper VGG16 CNN the results improve further, leading to ourfinal performance 59.2%. This outperforms the state-of-the-art [29] by 6.4%.Weakly supervised setting. Table 1 shows results in the weakly supervisedsetting. The model with AlexNet and uncalibrated SVMs achieves an accuracyof 21.2%. Using traditional Platt scaling the result is 16.8%, again showing itis not appropriate for semantic segmentation. Instead, our joint calibration al-most doubles accuracy (37.4%). Using the deeper VGG16 CNN results improvefurther to 44.8%.

Fig. 5 illustrates the power of our weakly supervised method. Again rareclasses appear only after joint calibration. Our complete model outperforms thestate-of-the-art [42] (35.0%) in this setting by 9.8%. Xu et al. [42] addition-ally report results on the transductive setting (41.4%), where all (unlabeled) testimages are given to the algorithm during training.Region proposals. To demonstrate the importance of multi-scale regions, wealso analyze oversegmentations that do not cover multiple scales. To this end, wekeep our framework the same, but instead of Selective Search (SS) [37] regionproposals we used a single oversegmentation using the method of Felzenszwalband Huttenlocher (FH) [8] (for which we optimized the scale parameter). AsTable 2 shows, SS regions outperform FH regions by a good margin of 12.2% inthe fully supervised setting. This confirms that overlapping multi-scale regionsare superior to non-overlapping oversegmentations.CNN finetuning. As described in 3.4 we finetune our network for detection inthe fully supervised case. Table 3 shows that this improves results by 6.2% com-pared to using a CNN trained only for image classification on ILSVRC 2012.

5 ConclusionWe addressed three common problems in semantic segmentation based on re-gion proposals: (1) overlapping regions yield conflicting class predictions at thepixel-level; (2) class-imbalance leads to classifiers unable to detect rare classes;(3) one-vs-all classifiers do not take into account competition between multipleclasses. We proposed a joint calibration strategy which optimizes a loss definedover the final pixel-level output labeling of the model, after maximization overclasses and regions. This tackles all three problems: joint calibration deals withmulti-class predictions, while our loss explicitly deals with class imbalance andis defined in terms of pixel-wise labeling rather than region classification accu-racy. As a result we take into account conflict resolution between overlappingregions. Our method outperforms the state-of-the-art in both the fully and theweakly supervised setting on the popular SIFT Flow [18] benchmark.Acknowledgements. Work supported by the ERC Starting Grant VisCul.

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation



Image Ground-truth SVM SVM+JC

Figure 4: Fully supervised semantic segmentation on SIFT Flow. We presentuncalibrated SVM results (SVM) and jointly calibrated results (SVM+JC), bothwith VGG16.


Image Ground-truth SVM (AlexNet) SVM+JC (AlexNet) SVM+JC (VGG16)

Figure 5: Weakly supervised semantic segmentation on SIFT Flow. Wepresent uncalibrated SVM results (SVM) with AlexNet, jointly calibrated re-sults (SVM+JC) with AlexNet, and with VGG16.


References[1] X. Boix, J. Gonfaus, J. van de Weijer, A. D. Bagdanov, J. Serrat, and

J. Gonzàlez. Harmony potentials: Fusing global and local scale for se-mantic image segmentation. IJCV, 2012.

[2] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene labeling withLSTM recurrent neural networks. In CVPR, 2015.

[3] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for au-tomatic object segmentation. In CVPR, 2010.

[4] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmen-tation with second-order pooling. In ECCV, 2012.

[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille.Semantic image segmentation with deep convolutional nets and fully con-nected CRFs. In ICLR, 2015.

[6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010.

[7] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchicalfeatures for scene labeling. IEEE Trans. on PAMI, 35(8):1915–1929, 2013.

[8] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based imagesegmentation. IJCV, 2004.

[9] M. George. Image parsing with a wide range of classes and scene-levelcontext. In CVPR, 2015.

[10] R. Girshick. Fast R-CNN. arXiv, 1504.08083v1, 2015.

[11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchiesfor accurate object detection and semantic segmentation. In CVPR, 2014.

[12] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simultaneous detec-tion and segmentation. In ECCV, 2014.

[13] Y. Jia. Caffe: An open source convolutional architecture for fast featureembedding. http://caffe.berkeleyvision.org/, 2013.

[14] T. Kekeç, R. Emonet, E. Fromont, A. Trémeau, and C. Wolf. Contextuallyconstrained deep networks for scene labeling. In BMVC, 2014.

[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In NIPS, 2012.

[16] F. Li, J. Carreira, G. Lebanon, and C. Sminchisescu. Composite statisticalinference for semantic segmentation. In CVPR, 2013.

[17] G. Lin, C. Shen, I. Reid, and A. van dan Hengel. Efficient piecewise train-ing of deep structured models for semantic segmentation. arXiv preprintarXiv:1504.01013, 2015.

http://caffe.berkeleyvision.org/


[18] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing via labeltransfer. IEEE Trans. on PAMI, 33(12):2368–2382, 2011.

[19] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks forsemantic segmentation. In CVPR, 2015.

[20] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feedforward se-mantic segmentation with zoom-out features. In CVPR, 2015.

[21] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille. Weakly- andsemi-supervised learning of a deep convolutional network for semantic im-age segmentation. arXiv preprint arXiv:1502.02734, 2015.

[22] P. Pinheiro and R. Collobert. Recurrent convolutional neural networks forscene parsing. In ICML, 2014.

[23] P. Pinheiro and R. Collobert. From image-level to pixel-level labeling withconvolutional networks. In CVPR, 2015.

[24] N. Plath, M. Toussaint, and S. Nakajima. Multi-class image segmentationusing conditional random fields and global classification. In ICML, 2009.

[25] J. Platt. Probabilistic outputs for support vector machines and comparisonsto regularized likelihood methods. Advances in large margin classifiers,1999.

[26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei. Imagenetlarge scale visual recognition challenge. IJCV, 2015.

[27] A. Schwing and R. Urtasun. Fully connected deep structured networks.arXiv preprint arXiv:1503.02351, 2015.

[28] A. Sharma, O. Tuzel, and M.-Y. Liu. Recursive context propagation net-work for semantic scene labeling. In NIPS, pages 2447–2455, 2014.

[29] A. Sharma, O. Tuzel, and D. W. Jacobs. Deep hierarchical parsing forsemantic segmentation. In CVPR, 2015.

[30] J. Shotton, J. Winn, C. Rother, and A. Criminisi. TextonBoost for imageunderstanding: Multi-class object recognition and segmentation by jointlymodeling appearance, shape and context. IJCV, 81(1):2–23, 2009.

[31] B. Shuai, G. Wang, Z. Zuo, B. Wang, and L. Zhao. Integrating parametricand non-parametric models for scene labeling. In CVPR, 2015.

[32] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. In ICLR, 2015.

[33] J. Tighe and S. Lazebnik. Superparsing: Scalable nonparametric imageparsing with superpixels. In ECCV, 2010.

[34] J. Tighe and S. Lazebnik. Understanding scenes on many levels. In ICCV,2011.


[35] J. Tighe and S. Lazebnik. Finding things: Image parsing with regions andper-exemplar detectors. In CVPR, 2013.

[36] J. Tighe, M. Niethammer, and S. Lazebnik. Scene parsing with objectinstances and occlusion ordering. In CVPR, 2014.

[37] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeul-ders. Selective search for object recognition. IJCV, 2013.

[38] J. Verbeek and B. Triggs. Region classification with markov field aspectmodels. In CVPR, 2007.

[39] A. Vezhnevets, V. Ferrari, and J. M. Buhmann. Weakly supervised seman-tic segmentation with multi image model. In ICCV, 2011.

[40] A. Vezhnevets, V. Ferrari, and J. M. Buhmann. Weakly supervised struc-tured output learning for semantic segmentation. In CVPR, 2012.

[41] J. Xu, A. Schwing, and R. Urtasun. Tell me what you see and i will showyou where it is. In CVPR, 2014.

[42] J. Xu, A. G. Schwing, and R. Urtasun. Learning to segment under variousforms of weak supervision. In CVPR, 2015.

[43] J. Yang, B. Price, S. Cohen, and Y. Ming-Hsuan. Context driven sceneparsing with attention to rare classes. In CVPR, 2014.

[44] L. Zhang, Y. Gao, Y. Xia, K. Lu, J. Shen, and R. Ji. Representative discov-ery of structure cues for weakly-supervised image segmentation. In IEEETransactions on Multimedia, volume 16, pages 470–479, 2014.

[45] W. Zhang, S. Zeng, D. Wang, and X. Xue. Weakly supervised semanticsegmentation for social images. In CVPR, 2015.

[46] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du,C. Huang, and P. Torr. Conditional random fields as recurrent neural net-works. arXiv preprint arXiv:1502.032405, 2015.

Documents

Joint Calibration for Semantic Segmentation arXiv:1507