DISCRIMINATIVE TRAINING METHODS FOR THE GENERALIZED …

MEDIZINISCHE FAKULTAT

RHEINISCH-WESTFALISCHE TECHNISCHE HOCHSCHULE AACHEN

INSTITUT FUR MEDIZINISCHE INFORMATIK

Geschaftsfuhrender direktor: universitatsprofessor

DR. DR. KLAUS SPITZER

Communicated by

Priv.-Doz. Dr. rer. nat. Dipl.-Ing. Thomas Lehmann

Thesis

DISCRIMINATIVE TRAINING METHODS FORTHE GENERALIZED HOUGH TRANSFORM

Ana Belen Martın Recuero

Supervisors

Priv.-Doz. Dr. rer. nat. Dipl.-Ing. Thomas Lehmannand

Dr. Hauke Schramm y Dr. Peter Beyerlein

Philips Research Laboratories Aachen

23th January 2007

ii

Life is an optimization problem...

the experiences are observations,

family is a constant prior

and love a random variable.

iii

iv

Declaration

Hiermit erklare ich, dass ich die vorliegende Diplomarbeit selbstandig angefertigt habe. Eswurden nur die in der Arbeit ausdrucklich benannten Quellen und Hilfsmittel benutzt. Wortlichoder sinngemaß ubernommenes Gedankengut habe ich als solches kenntlich gemacht.

Ort, Datum Unterschrift

v

vi

Acknowledgements

First of all, I would like to express my gratitude to both of my advisors at Philips Research Lab-oratories Aachen, Dr. Hauke Schramm and Dr. Peter Beyerlein. From the scientific perspective,this piece of research would have never seen the light without their constant support and expertguidance. I would specially like to thank Dr. Hauke Schramm for his never-ending interest inmy work, constructive feedback and his heartfelt motivation to achieve our goals even in themost difficult moments during the project. From a personal point of view, his kindness, patienceand his incessant encouragement has earned my most sincere respect and admiration. I wouldalso like to express my gratitude to Dr.Peter Beyerlein. I am indebted to him for introducingme to the challenging field of the discriminative training. It would have been rather difficult toundertake this challenge without his overwhelming expertise in the topic.

I want to express my gratitude to Prof. Dr. Thomas Lehmann, my internal advisor at RWTHAachen. I am very thankful to him for accepting the supervision of my master thesis andfor giving me the opportunity to work at Philips on such an interesting topic in the field ofmedical imaging. I would also like to thank him for allowing me to work independently andfor his constructive feedback. My gratitude to Benedikt Fischer from Institut fur MedizinischeInformatik Universitatsklinikum Aachen for his collaboration with Thomas Lehmann in affairsregarding to my thesis.

I would like to thank all the staff and students at Philips for creating a familiar cosy atmosphereduring my internship. I felt pleased to repeat the experience for my master thesis. I am verygrateful to all of them for integrating me at Philips and let me experience a research atmospherein an industrial company. To Salvador, Roger and Helko for their good humor to relax me andfor their ”lunch discussions”. My most sincere thanks to them for their wise advices about thisproject and cheering me up unconditionally during all the course of this master thesis.

I would also like to express my sincere appreciation to the project colleagues, Dr.Jochen Petersand Dr.Olivier Ecabert, for helping me out with questions in Dr.Schramm’s absence.

This master thesis supposes the end of a long period of studies that began seven years agoin Madrid. I cannot forget all the people that have formed part of my life and have deeplycontributed somehow to my wellness during those years. They are many to be mentioned here,but they all have a place in my memory and my heart. Neither will I forget my friends here, forpresenting me with an unforgettable experience in Germany.

I could not finish my acknowledgements without deeply thanking my whole family. Their supporthas provided me with the energy necessary to be far away from home these two last years. Thanksto my siblings Jesus and Vicky, and thanks to Martın, Sidd and Vıctor for their encouragingmails and phone calls.

My last thanks go to my mother for believing in me and instilling the courage and perseverancein my studies. Unluckily, she will not enjoy my last achievements, but the way she educated meand her love will always remain in me and will mark the difference to develop myself personallyand professionally.

Gracias a todos de corazon por estar siempre conmigo allı donde este.

vii

viii

Summary

This master thesis presents two supervised discriminative training techniques for the optimiza-tion of the Generalized Hough Transform (GHT) [1]. An automatic procedure for detecting andsegmenting anatomical objects in 3D images is necessary for achieving a high level of automationin medical applications. The GHT is a technique extensively used to detect objects in an image.Its major advantage is the possibility of addressing new objects with little manual effort and therobustness to noise and to partial object occlusions.

The GHT relies on two main knowledge sources:

1. The remaining edges after preprocessing the image.

2. A shape model of the target object represented by a subset of model points referenced toe.g. the center of gravity of the object.

The GHT matches the model points of the shape model to each of the remaining edges in thetarget image to find the reference point of the object in the image.

This master thesis aims at optimizing both GHT knowledge sources. The optimization proce-dures are applied to detect the right femur in Computer Tomography (CT) images containingthe lower extremities and the pelvic region.

The state-of-the-art for 3D anatomical object detection using the GHT [2] reduces the compu-tational complexity by restricting the flexibility of the shape model (1) (e.g. no rotation, scalingor inter-individual shape model variability is considered) and by restricting the number of modelpoints by randomly choosing a subset from the initial shape model (2).

Despite of the object detection success rate reported [2], other solutions can be proposed inorder to enhance the trade-off between the detection accuracy and the computational ease ofthe GHT. During this master thesis, discriminative training methods are applied to carry out asmart selection of those edges and model points that contribute to a high detection accuracy.

Additionally, the current GHT-preprocessing framework and shape model generation necessitatesubstantial user interaction. The former requires to set a manual threshold based on grey andgradient magnitude to reduce the number of considered edges in the image. The latter requiresa manual delineation of each new object to detect. During this master thesis both GHT-preprocessing framework and shape model generation are also automated.

The optimization approaches that were proposed at the beginning of this master thesis are:

1. Automatic discriminative voxel classifier.

This system aims at reducing the number of voxels coming from concurrent shapes andkeeping as many edge voxels of the target object.

The problem consists of classifying each of the voxels in an image into one of the twoclasses: object vs. non-object. The class object (ground truth) is defined as the objectedges contained in a bit volume mask. A set of image features are extracted for eachof the voxels to classify. These features can be log-linearly combined into a classifier[3] which can be trained to optimally integrate all weighted features according to theirdiscrimination capability within the specific problem given. Image features with largerabsolute weight values have a stronger influence in the overall classification. A subsetof the most influent (or discriminative) features are selected. A compromise is reachedbetween the total number of selected features and the absolute classification error.

Results showed classification performance according to the quality of the features intro-duced into the classifier. Microscopic features of a voxel (e.g. mean grey value of the

ix

27-neighborhood) do not help to distinguish macroscopic properties (e.g. left vs. rightfemur). We compared the manual tuning set-up given the parameters currently used withone possible discriminative voxel classifier set-up. For similar image compression rates,slightly more object voxels remained. However, further comparison experiments shouldbe performed to generalize the goodness of this new method with respect to the currentmanual threshold approach.

Feature selection was possible according to their discrimination capability. However, theoptimization algorithm does not optimize weights considering feature dependencies. Afurther criterion based on the observation of constant misclassification rates is proposedto remove feature redundancy.

2. Automatic learn-discriminative shapes for the GHT.

This problem can be addressed by interpreting points of the shape model as individualknowledge sources (or models) [4]. Following the maximum-entropy principle, the de-pendency on the input models can be decomposed into ”smaller” dependencies that arelog-linearly combined into the decision function. A discriminative optimization of themodel weights allows to determine the importance of specific points for the classification.Thereby, it is possible to eliminate the less important (or even confusing) points in theshape model.

During this master thesis, we propose to create a random cloud of model points and providethem with the necessary discrimination potential to make the model feasible for 3D objectdetection using the GHT. Each model point in the cloud obtains a weight based on itsdiscrimination capability to detect the target object in the image. Thereafter, we selectthe model points in the cloud which minimize the error count in the object localization.Thereby, GHT computational complexity is reduced until detection (segmentation) failureis encountered.

The results achieved the creation of shape models from scratch containing 100 modelpoints. A first validation has been carried out on CT-right femur detection. A successfulsegmentation was possible on six of seven test images given the outcome of the GHT asinitial pose. In the failed case, only a small displacement took place.

This master thesis presents a novel method to detect anatomical objects discriminatively.This is a clear advantage for object detection tasks where other concurrent shapes arepresent. The new concept of shape model do not only contain information of the targetobject but also information of the remaining edges in the image. The most discriminativeparts of the object or its surroundings contribute with positive votes during the GHTalgorithm. This ”shape” information aids to localize the object in its correct position.The most discriminative parts of other concurrent shapes in the image contribute withnegative votes during the GHT algorithm. This ”anti-shape” information reduces theprobability of a wrong classification in favour of the concurrent shapes.

x

List of Tables Page: xi of 86

List of Tables

4.1 Feature selection given the initial features: gradient magnitude, prior class prob-ability and 27-neighborhood grey values. . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Results of classification using location features x, y, z . . . . . . . . . . . . . . . 334.3 Effects of different class priors during training on a test image for MR-heart voxel

classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4 Effects of different class priors during training on the training image for femur

voxel classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Feature selection with binary gradient and grey profile features . . . . . . . . . . 384.6 Classifier performance on test data. Features selected are the mean, x and z grey

profiles. The k0 → k0 corresponds to the specifity and k1 → k1 to the sensitivityof the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.7 Performance of the manual threshold approach using a threshold window of [1000-5000] of grey and gradient magnitude. The k0 → k0 corresponds to the specifityand k1 → k1 to the sensitivity of the system. . . . . . . . . . . . . . . . . . . . . 40

4.8 Comparison of the performance of the discriminative GHT-based voxel classifierwith the manual threshold preprocessing. . . . . . . . . . . . . . . . . . . . . . . 41

4.9 Mean and standard deviation of validation metrics for 10 CT images. Manualthreshold approach within the range [1000-5000]. Discriminative voxel classifierwith 27-neighborhood mean grey, grey z profile and grey x profile features. . . . . 41

5.1 DGHT performance along with model point reduction. Γ(k, kn) is the euclideandistance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2 Comparison GHT and DGHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.1 Performance of the DGHT using 1000, 500, 250, 100 and 50 cloud model points.Images ct1, ct2, ct3 were used for training. . . . . . . . . . . . . . . . . . . . . . 64

6.2 Mean and standard deviation of validation metrics for seven patients. . . . . . . 64

AACHEN UNIVERSITY OF TECHNOLOGY (RWTH)DEPARTMENT OF MEDICAL INFORMATICSMedical Imaging Division

Date: 8th February 2007Author:

Ana Belen Martin-Recuero

List of Tables

xii

List of Figures Page: xiii of 86

List of Figures

2.1 Object representation for the GHT. . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Unknown correspondence between model points and edge points. . . . . . . . . . 52.3 Voting procedure using gradient directions. . . . . . . . . . . . . . . . . . . . . . 62.4 Structure of a pattern recognition system. Based on [8] . . . . . . . . . . . . . . 72.5 Design cycle of a classifier [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Preprocessed image using the current GHT-preprocessing framework. . . . . . . . 173.2 Feature x1 and x2 respect to the voxel of class k for a classifier with two features

p(k|x1, x2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Grey value features in triangulated distributions to overcome non-homogeneity

within the target object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Grey value features in triangulated distributions to overcome non-continuity of

features within a class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.5 Effects in voxel classification without normalization (right image). . . . . . . . . 21

4.1 Index numbers of the 29 features referred to the voxel to classify (in the red square) 304.2 Bone voxel classifier using gradient magnitude information and grey value of the

voxel. Black means object and grey non-object. . . . . . . . . . . . . . . . . . . . 324.3 Voxel classification on a training image using location features x,y,z. The brighter

a voxel is, the larger probability of the voxel being classified as object. . . . . . . 334.4 Effects of different prior class probabilities during training (k0

k1%) on a test image

for MR-heart voxel classification. Green represents class k1 and red class k0 . . . 344.5 Effects of different prior class probabilities during training (k0

k1% ratios) on left

femur voxel classification. Green represents class k1 and red class k0. . . . . . . . 364.6 Bone voxel classifier using binary features of gradient and grey profiles. Green

means object and red non-object. . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.7 Bone voxel classifier on test image ct1 using binary features of gradient and grey

profiles. Green means object and red non-object. . . . . . . . . . . . . . . . . . . 394.8 Comparison of manual threshold preprocessing and discriminative GHT-based

voxel classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1 Triangulated surface model for right femur detection. . . . . . . . . . . . . . . . . 465.2 GHT framework. From up to down and from left to right: Original image, pre-

processed image by manual threshold [1000-5000] and HS applying femur-basedGHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.3 Split HS: (a) Complete HS, (b,c) HS from individual model points . . . . . . . . 505.4 Detection example of the lower part of a semicircle. . . . . . . . . . . . . . . . . . 535.5 GHT procedure to find semicircles using a circle as HM. . . . . . . . . . . . . . . 545.6 Model point selection of the problem in figure 5.4. Red means negative λ value.

Green means positive λ value. Yellow means model point not selected. . . . . . . 555.7 HS applying the SGHT and the DGHT using the circle shape models under the

HS images. Green means positive and red negative model point. . . . . . . . . . 57

6.1 HS using all model points (1024) from the femur shape model (Longitudinalperspective): Left GHT and right DGHT . . . . . . . . . . . . . . . . . . . . . . 61




List of Figures

6.2 Model point selection from the femur shape. Selected the 50 most discriminativeout of the 1024 contained in the shape. . . . . . . . . . . . . . . . . . . . . . . . . 62

6.3 Random cloud of 10000 model points. . . . . . . . . . . . . . . . . . . . . . . . . 636.4 HS after applying the GHT and DGHT with 1000 random model points. . . . . . 656.5 Model point selection (longitudinal perspective): from left to the right and from

up to down: The whole image with 1000 model points selected. Selection of1000, 500, 250 and 100 model points respectively. Red means negative and greenpositive weighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.6 Model point selection (radial perspective): from left to the right and from up todown: Selection of 1000, 500, 250 and 100 model points. Red means negative andgreen positive weighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.7 The HS of one slice after applying the DGHT with an optimized selection ofmodel points (longitudinal perspective): from left to the right and from up todown: 1000, 500, 250, 100 and 50 model points. . . . . . . . . . . . . . . . . . . . 68

6.8 Shape and anti-shape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

xiv

Contents Page: xv of 86

Contents

1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objectives of this master thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theoretical background 32.1 Object detection using the Generalized Hough Transform . . . . . . . . . . . . . 3

2.1.1 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Generalized Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 Generalized Hough Transform for 3D Object Detection . . . . . . . . . . 5

2.2 Pattern recognition theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1 Pattern recognition fundamentals and terminology . . . . . . . . . . . . . 72.2.2 Bayes’ decision rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 Maximum Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.4 Discriminative Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.5 Discriminative Model Combination . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Discriminative training methods for a GHT-based voxel classification 173.1 Scope of the current work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Overview of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Training Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Model choice and Minimum Error Classification . . . . . . . . . . . . . . . . . . . 223.6 Image Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.7 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Implementation and evaluation of a GHT-based voxel classifier 274.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Medical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3.1 Selection of image features . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3.2 Location features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3.3 Prior probability effect in classification . . . . . . . . . . . . . . . . . . . . 344.3.4 Testing a GHT-based voxel classifier . . . . . . . . . . . . . . . . . . . . . 37

4.4 Discussion on the experimental results . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Discriminative Training Methods for the GHT-Model Optimization 455.1 Scope of the current work: Object-based model . . . . . . . . . . . . . . . . . . 455.2 Overview of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.3 Automatic Shape model generation . . . . . . . . . . . . . . . . . . . . . . . . . . 485.4 Training Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.5 Features: Model point dependent distributions . . . . . . . . . . . . . . . . . . . 505.6 Model Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.7 Minimum Classification Error Training . . . . . . . . . . . . . . . . . . . . . . . . 51




Contents

5.8 Model Point Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.9 Model Point selection from a cloud of model points . . . . . . . . . . . . . . . . . 555.10 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.11 Discriminative-GHT (DGHT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Implementation and evaluation of a learn-model for the GHT 596.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Medical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3.1 Shape model-based optimization . . . . . . . . . . . . . . . . . . . . . . . 616.3.2 Generation of a GHT-based ”shape” model from scratch . . . . . . . . . . 63

6.4 Discussion on the experimental results . . . . . . . . . . . . . . . . . . . . . . . . 69

7 Conclusions and Outlook 71

Appendicces 73

A Sobel Edge Operator 75

B Gaussian distributions and Maximum Entropy models 77

C Discriminative multi-object detection using multivariate Gaussian distributions forHough Model Optimization 79

D Acronyms 81

E Mathematical symbols 83

Bibliography 83

xvi

Page: 1 of 86

1 Introduction

1.1 Introduction

An automatic procedure for detecting and segmenting anatomical objects in 3-D images isnecessary for achieving a high level of automation in many medical applications, like orthopedicplaning or diagnostic viewing of angiographic images. Today’s segmentation techniques typicallyrely on user input for initialization and do not allow for a fully automatic workflow. Althoughspecific localization algorithms exist for some anatomical objects like the heart or the lung, ageneral detection technique that can be easily adapted to new objects is lacking.

In this work, the Generalized Hough transform (GHT) [1] is used for detecting anatomical objectswith well-defined shape in 3-D images. The GHT is a shape-based technique for the localizationof well-defined arbitrary objects. This technique has frequently been applied to 2-D images. Itusually achieves high detection rates and allows addressing new objects with very little manualeffort. It is also robust to partial occlusion and noise. However, the computational complexityand memory requirements of the GHT are generally huge, especially in case of considering 3-Dimages and detection of natural shapes. As a consequence, the GHT has hardly been used forreal 3-D applications in the past.

Recently it has been demonstrated [2] that the GHT can be used for a coarse 3-D delineation ofanatomical objects with well-defined shape in medical images. However, the trade-off betweenthe computational complexity and detection performance can be improved. Solutions beyond asimple restriction of the GHT flexibility and a random sampling of the model points (as proposedin [2]) are desirable.

Since the considered remaining image edges (voxels) and the selected model points in the shapemodel directly influence the accuracy and computational complexity of the GHT, techniques arerequired which allow for a ”smart” selection of those points without decreasing the detectionperformance. Intuitively, a good choice would be to eliminate model and image parts which donot contribute to a high detection accuracy.

This master thesis will study discriminative training techniques to determine (and eliminate)voxels which are probably not part of the target object and remove or decrease the influence ofmodel points (or model parts) that do not help to achieve a correct object detection.

This work has been carried out at Philips Forschung Laboren Aachen (PFLA). Therefore, itfocuses on successful implementations and possible further applications.




Page: 2 of 86 1 Introduction

1.2 Objectives of this master thesis

The main objective of this master thesis is to improve the trade-off between computational com-plexity of the GHT and the object detection accuracy. Shape model-based 3D object detectionrequires large memory storage and computational complexity. Besides, shape models might par-tially match other concurrent shapes in the image appearing the possibility of a wrong objectdetection in favor of the concurrent shapes.

The current GHT-preprocessing framework is not discriminative either. That is, it does noteliminate edges coming from concurrent shapes, but however it might eliminate unique parts ofthe object that can help to detect it unequivocally.

This master thesis proposes a ”smart” selection of edges and model points. The GHT computa-tional complexity and accuracy can be improved if the selected edges and model points are themost decisive to achieve a high detection accuracy.

Thus, we firstly aim at building up a system that learns of the most important (or discriminative)parts of the shape model. Secondly, we aim at building up a system that eliminates the maximumnumber of concurrent edges and keeps as many edge voxels as possible. Such systems wouldimprove trade-off between GHT computational complexity and detection accuracy.

The current existing GHT-framework at PFLA requires substantial user interaction. Firstly, toset a manual threshold based on grey and gradient magnitude to reduce the number of considerededges in an image before applying the GHT. Secondly, the current GHT-shape model generationhas to be repeated for each new object to detect. Therefore, we also aim at automating bothprocedures during this master thesis.

Both of these approaches should not increase GHT computational complexity and improve orat least maintain the GHT accuracy (object detection performance).

Date: 8th February 2007Author:Ana Belen Martin-Recuero


Page: 3 of 86

2 Theoretical background

This chapter provides a quick overview of some key concepts concerning pattern recognition,Bayes’ decision rule, discriminative training, maximum entropy and the GHT. We will focusthe concepts towards our specific application so that we can establish a solid base to understandthe rest of this master thesis.

2.1 Object detection using the Generalized Hough Transform

2.1.1 Hough Transform

The Hough transform (HT) [5] or its variant, the Generalized Hough Transform (GHT) [1], arevery well-known techniques for detecting analytical curves and locating shapes respectively. Inthis thesis, the GHT is used to locate anatomical objects in Computer Tomography (CT) images.The main concepts on both HT and GHT will be explained here. We carry on with a 3D objectdetection using the GHT, which is the starting point of this master thesis.

The Hough transform is a method for detecting curves by determining which of these passthrough most of the points in an image. In order to determine if two points lie on the samecurve, a parametric representation is necessary. Every point in an image will vote for one of thequantized transformation parameters of the curve. We will refer to the quantized parameters asHough Space (HS). The transformation parameters with higher votes will determine which curvefits most closely to the data in an image. Let us illustrate the principle through an example. Aline is represented by the parametric equation

r = x cos θ + y sin θ (2.1.1)

As it is usual an edge operator (e.g. Sobel) is applied to the image to obtain boundary informa-tion. The edge pixels with coordinates (x0, y0) will vote for a certain transformation parameters(r, θ). If two points (x0, y0) and (x1, y1) belong to the same line they will vote for the sameparameters.

r(θ) = x0 cos θ + y0 sin θ = x1 cos θ + y1 sin θ (2.1.2)

Alternatively, O´Gorman and Clowes [6] suggest the use of gradient information to restrict theedges that vote for a curve. It improves computation complexity and performance. (See GHTvoting procedure using gradient direction in figure 2.3.)




Page: 4 of 86 2 Theoretical background

2.1.2 Generalized Hough Transform

A simple way of detecting objects in images is to utilize object boundary information. The GHT[1, 7] uses the shape information of an object to transform edges in an image into a probabilisticset of parameters.

Figure 2.1: Object representation for the GHT.

Thus, arbitrary shapes can be detected in images without the need of parametric representation.The underlying principle behind the GHT is the mapping of the edge space into an accumulatorspace (that we will refer as Hough space (HS)) to describe the transformation edge parametersusing object representation (R-table). The R-table contains length and orientation informationof the edge of an object with respect to a reference point (e.g. center of mass)(figure 2.1).Additionally, scales and rotations can be included at expense of computational complexity. TheGHT algorithm matches every edge in the image with the R-table of the model to find thereference point in a given image (see the problem figure 2.2).

The HS represents a translation of the reference point and alternatively quantizations of objectscale and rotation.

Every match will accumulate a vote for a set of parameters. The set of parameters with thehighest number of votes, determines the optimal transformation for matching the model to theimage. The GHT is robust to partial occlusions, noise, slight object variability and discontinuityin the edges from the image to detect. However it requires high computation and storagecomplexity.

The quality of the preprocessed image is determinant for the efficiency of the HT and GHT.Noise and speckle should be suppressed as much as possible and the visibility of concurrentshapes should be reduced to a minimum.



2.1 Object detection using the Generalized Hough Transform Page: 5 of 86

Figure 2.2: Unknown correspondence between model points and edge points.

2.1.3 Generalized Hough Transform for 3D Object Detection

A GHT approach for 3D object detection has been rarely addressed until now. It becomes evenmore prohibitive in the case of the localization of natural shapes since a higher flexibility in theGHT is required. Recently Schramm et al. [2] have proposed a method to detect anatomicalobjects in 3D medical images. We briefly describe the principles and implemented status foundat the beginning of this master thesis.

Firstly, a preprocessing of the image is carried out to extract the edge and gradient directioninformation. A grey and gradient threshold was used in this case. Thereafter the GHT procedureitself is applied. For this purpose a shape model is necessary. The current shape model is basedon the manual delineation of the shape in a training image to build triangulated surface models.The shape model is used to create a representation of the target object’s boundary. This is doneby determining the position of M model points with respect to a reference point (e.g. center ofmass). These model points are stored together with their corresponding gradient information inan storage structure that we will call Hough Model (HM). During this work a HM of 1024 modelpoints for a left femur will be used for the detection of the mentioned object in CT images.Presently, these kind of shape models do not cover any inter-individual variability.

Once the image is preprocessed, the object detection procedure is performed using a HM ofthe target. If we denote the remaining edges as E = {pe

1,pe2, ...,pe

Ne} and as model points

M = {pm1 ,pm

2 , ...,pmNm

}, the procedure consists in applying the transformation

pei = A · pm

j + t (2.1.3)

where A is the transformation matrix, which in this case only contains scale information. Thealgorithm aims at finding the optimal estimation of translation parameters t = (tx, ty, tz) for





matching the given shape model with the remaining edges in the image. If we rewrite theequation 2.1.3, the HS can be filled using the following the formula:

t(pmj ,pe

i ,A) = pei −A · pm

j (2.1.4)

The voting procedure (see figure 2.3) can be repeated for every set of transformation parametersin A. The algorithm presented by Schramm et al [2] restricts the number of parameters usedto three different scale representations of the object. A HS is obtained for each consideredtransformation parameter set. By performing a search in the HS, a maximum peak can befound out. This will correspond to the hypothetic solution of the object detection task.

Figure 2.3: Voting procedure using gradient directions.

To speed up the algorithm of the GHT, only a random subset of model points are used to matchall remaining edges in the image. Besides, a gradient direction is used for each voxel. This allowsto identify likely correspondences between model points and edge points in the image. Thereby,not all model points need to vote for every edge point in the image. This increases accuracy ofthe object localization and speeds up the processing time [Ballard81].

The computational complexity of the GHT algorithm is one of the important issues in thismaster thesis. We measure the computational complexity as

O(NmNeNp) (2.1.5)

where Nm is the number of model points, Ne the number of remaining edges in the image andNp the number of considered transformation matrices A.



2.2 Pattern recognition theory Page: 7 of 86

2.2 Pattern recognition theory

2.2.1 Pattern recognition fundamentals and terminology

To start with, we provide an insight in the fundamentals of pattern recognition. The designcycle followed during this thesis will be proposed, as well as the terminology that will be usedthroughout the rest of this master thesis. The concepts exposed here are based on the book [8]

Pattern recognition is a field within machine learning which aims at classifying raw data intoone of the categorized classes or patterns (see figure 2.4). In order to fix a terminology for therest of this thesis, let us assume,

• A set of classes (patterns) k = 1 . . .K

• Observation to classify denoted as x.

• A number of N observations. Each observation is denoted as xn, where n = 1, ..., N .

• The correct class of every observation xn denoted as kn. The tuple (xn, kn) used for thesupervised training and evaluation.

• Feature of vectors −→x ε<M , {x1 . . . xM} extracted from the previous observation x.

Figure 2.4: Structure of a pattern recognition system. Based on [8]

This way, a classifier will perform a classification relying on the features extracted from anobservation:

<M → {1 . . . K}x → r(x) (2.2.1)

For this purpose a decision function is necessary

r : x → r(x) = arg maxk{g(x, k)} (2.2.2)





where g(x,k) is called discriminant function and can be defined by many functional forms (e.g.neuronal networks, Bayes’ decision rule...). The only prerequisite during the design is a behavioras follows

g(x, k) 7→ 1 if k = kn

g(x, k) 7→ 0 if k 6= kn (2.2.3)

where a misclassification of an observation into another class can be denoted by a loss, measuredas a distance to the correct class.

One of the milestones in pattern recognition is the training of the classifier. The learning canbe supervised (if the classes are labeled or contain a cost for each pattern in the training set) orunsupervised (the system forms itself the clustering of the classes). We present in the figure 2.5a complete training cycle in pattern recognition.

Figure 2.5: Design cycle of a classifier [8]

1. Collection of data gathers all observations to be used in the training. The amount of datahas to be large enough to avoid overgeneralization1 and overfitting2 during the training.The collected data must be representative to ensure optimality on independent data test.

2. Feature choice, is highly dependent on the problem. The features should be easy toextract, apart from being invariant to transformations, robust to tolerate deterministic andstochastic errors and have continuity (e.g metric measurement). One interesting aspect to

1Inconsistency in the performance of the classifier if the training samples for a class are too small in comparisonwith the parameters to estimate

2The learning fits specifically the training data leading to a suboptimal classifier on independent test sets




consider is the introduction of noisy features in the classifier. These features reduce theclassifier performance, or do not add further information about classification.

3. Model choice for the statistical distributions of the features and their combination (e.g.class-dependent probability densities, or a priori probabilities, (log)-linear combination.)

4. Training a classifier that learns the free parameters of the model selected previously.

5. Evaluation of the system, trade-off between performance and computational complexity.

6. Alternatively, selection of features or variation in the number of classes to observe theperformance of the classifier.

2.2.2 Bayes’ decision rule

Bayes’ decision rule is a fundamental statistical approach for classification problems. It results ina minimum expected number of classification errors when the true p(k|x) posterior probabilitiesof class kε {1, ..., K} given the observation x are known. The Bayes decision rule decides towhich class this observation is more likely to belong to. The selected class maximizes the classposterior probabilities given an observation x:

x 7→ r(x) = arg maxk{p(k|x)} = arg max

k{p(k)p(x|k)} (2.2.4)

Here, p(k) is the a priori probability of the occurrence of class k and can be excluded in thecases where all classes have the same likelihood. Otherwise, it will have to be defined, e.g. asrelative frequencies in the occurrence of the class.

The equivalent discriminant function of equation 2.2.4 can be written as:

g(x, k, k′) =log p(k|x)log p(k′|x)

(2.2.5)

This function states the class boundaries for which the maximum correct number of classificationsgiven an observation will occur. However, when the posterior probabilities are approximationsof the true probabilities, the former statement does not hold true. Thus, extra techniques mustbe used to achieve a minimum classification error. For this problem, discriminative trainingmethods are suitable to overcome the deficiency of the classifier [9]. The data used in this workis taken out of medical images (e.g. gray values), which entails data variability. Therefore, truedistributions of the posterior probabilities are not known. Proper models for the distributionswill have to be chosen in this case.

As we will see in the forthcoming chapters, we will apply the rule in (equation 2.2.5) to dis-criminatively optimize the free parameters of our model and the Bayes’ decision rule (equation2.2.4) for the classification.





2.2.3 Maximum Entropy

Maximum Entropy (ME) is a powerful method that can be used to estimate class posteriorprobabilities in pattern recognition tasks. The principle has already extensively been used inspeech recognition, language modelling and text classification.

The term ”entropy” was originally defined by Claude E. Shannon to estimate compression ratesfor data transmission in communication channels. The entropy itself measures the averageuncertainty of a single random variable:

H(x) = −∑

xε X

p(x) log2 p(x) (2.2.6)

where p(x) is the probability distribution of the random variable X. The entropy tells us theaverage number of bits we need to transfer all the information contained in X.

The entropy can be optimized in such a way that when getting more information the uncertaintydecreases. At this point, Jaynes [10] introduced in 1957 the concept of ME (ME). But if wesay that entropy measures uncertainty, why do we maximize entropy when we actually want toreduce uncertainty? We can explain it through the principle of scientific objectivity:

Out of all probability distributions consistent with a given set of constraints, choose the one thathas maximum uncertainty [11].

We want to be maximally uncertain about what we do not know and at the same time reducethe uncertainty by using all the information given to us. Since there are infinity probabilitydistributions consistent with a given set of constraints, we choose the distribution that is closerto the uniform distribution or which is the same, the distribution that has the ME. Translated topattern recognition tasks, it means using all information given to us in the training data to setconstraints in the model distribution which aims at being maximally uniform. We represent theseconstraints as features fi, i = 1, ..., M or expected values related to labeled classes k, k = 1, ..., Kfrom the training corpus. Suppose that we define a set of features fi useful for the classification:

fi : <M × {1, ..., K} 7→ < : (x, k) 7→ fi(x, k) (2.2.7)

ME allows us to restrict the model distribution to have an expected value for that featureas given in the training corpus. The observed ”average” value of feature fi in a training setconsisting of N samples is

fi =1|N |

∑n

fi(xn, kn) (2.2.8)

Thus, we set the constraint for each feature i given the posterior distribution p(k|x)

∑n

∑

k

p(k|xn)fi(xn, k) = fi (2.2.9)

∑

k

p(k|x) = 1 (2.2.10)




Now, we search for the distribution that maximize the entropy:

maxp(k|x)

{−

∑n

∑

k

p(k|xn) log p(k|xn)

}(2.2.11)

It can be shown that the maximization of the previous distribution subject to the former con-straints results in the log-linear or exponential probability distribution of the maximum-entropyfamily [11].

pΛ(k|x) = e− log Z(Λ,x)+∑M

i=1 λi log pi(k|x) (2.2.12)

The value Z(Λ, x) is a normalization factor:

Z(Λ, x) =∑

k′exp

[M∑

i=1

λi log pi(k′|x)

](2.2.13)

The coefficients Λ = (λ1, ..., λM )T can be interpreted as weights of the features i within themodel combination.

An unique solution to the equation 2.2.12 is guaranteed, since the function is a sum of convexfunctions, and therefore the result of those is also convex. Convex functions have a single globalmaximum that can be calculated through effective algorithms that perform hillclimbing in thelikelihood space.

2.2.4 Discriminative Training

In most of the models for pattern recognition, a set of free parameters Λ : {λ1, ..., λm} needs tobe determined. The term discriminative training refers to the procedure used to optimize the Λset in such a way that the probability of misclassification is minimized. A discriminative modelis characterized by the usage of the information of the competing classes for the optimization.In this way, the optimization focus on the boundaries between the different classes, rather thanmaximizing the likelihood of an event. Thus, we maximize the log-likelihood of the class posteriorprobabilities,

Λ = arg maxΛ

N∑

n=1

log pΛ(kn|xn) (2.2.14)

If we rewrite the posterior probabilities using the Bayes’ theorem, we arrive at

N∑

n=1

logpΛ(xn|kn)p(kn)∑Kk=1 pΛ(xn|kn)p(k)

(2.2.15)

where the denominator is a normalization term. The training procedure attempts to maximizethe class conditional probabilities on the given training samples and to minimize a weighted sumover the class conditional probabilities of all competing classes.





2.2.5 Discriminative Model Combination

The Discriminative Model Combination [12, 13](DMC) aims at the optimal integration of allgiven models into one log-linear posterior probability distribution. These models are combinedinto a probability distribution of the ME family as explained in 2.2.3. Given this model, thefree parameters are optimized with respect to the discriminant function.

logpΛ(k|xn)pΛ(kn|xn)

=M∑

j=1

λj logpj(k|xn)pj(kn|xn)

(2.2.16)

Here, kn denotes the correct hypothesis.

The DMC training technique is different to the ME concept, where the lambda parameters aretrained to fit the distributions to the observed data. In the case here, they are optimized toobtain a Minimum Classification Error (MCE).

For the training procedure of the free Λ parameters, we assume samples xn, n = 1, ..., N, fromwhich we know their correct class assignment kn. A classification of an observation into one ofthe K competing classes produces a loss denoted as Γ(k, kn), where we can directly suppose aΓ(k, k) = 0. The model combination should then minimize the classification error count E(Λ)formulated as the total loss over all classifications on representative training data.

E(Λ) =N∑

n=1

Γ(

kn, arg maxk

(log

pΛ(k|xn)pΛ(kn|xn)

))= (2.2.17)

=1∑N

n=1 Γn

N∑

n=1

∑

k 6=kn

Γ(k, kn) · δ(

k, arg maxk′

(log

p(k′|xn, λ

p(kn|xn, λ

))(2.2.18)

where 1∑Nn=1 Γn

is a regularization constant expressing the maximum possible total loss. The δ

function selects the class k′ that maximizes the discriminant function 2.2.16.

For the optimization of the Λ, a gradient descendent procedure will be used. The func-tion (2.2.18) is non-continuous with respect to the free parameters Λ and consequently non-differentiable. Therefore, an approximation is proposed using a smoothing function, S(k, n,Λ).We aim at a behavior similar to the δ function above 2.2.18. A possible smoothing function thatfulfills this behavior is,

S(k, n,Λ) =pΛ(k|xn)η

∑k′ pΛ(k′|xn)η

, (2.2.19)

where η is a suitable constant. From here on, we reformulate the error count as

ES(Λ) =N∑

n=1

∑

k 6=kn

Γ(k, kn)S(k, n, Λ), (2.2.20)

This function is now differentiable and can be used to optimized the (Λ) set through an iterativegradient descendent scheme as follows [14]




λ(0)j = 0 (UniformDistribution)

λ(I+1)j = λ

(I)j − ε · η

H∑

n=1

∑

k 6=kn

S(k, n,Λ(I)) ·

·Γ(k, n, Λ(I)) · logpj(k|xn)pj(kn|xn)

Λ(I) = (λ(I)1 , . . . , λ

(I)M )T (2.2.21)

j = 1, . . . , M

Γ(k, n,Λ) = Γ(k, kn)−∑

k′ 6=kn

S(k′, n,Λ)Γ(k′, kn).

where ε is the step size for the gradient descendent scheme.

This iteration scheme reduces the λ weight of the models j within the combination that lead to awrong classification, and enhances the weights of those models that help out to achieve a betterclassification. Since the weight λj of the base model j within the combination depends on itsability to provide information for correct classification, this technique allows for the integrationof any set of base models into an optimal classifier.





2.3 Literature review

In this section, we will briefly review the studies and results obtained by other researchers whohave dealt with the main topics considered in this thesis. It is remarkable that more studies on MEand discriminative training have been done in speech recognition and text classification than inthe image-processing world; specially regarding the optimization of GHT. Through our study andliterature review, we will focus on those of our higher interest and discuss the feasibility of usingsuch techniques in two different frameworks: voxel classification in medical image processing andshape model optimization for the use of the GHT.

• Object classification for image processing.

Keysers et al. [3] propose a ME framework to estimate class posterior probabilities forpattern recognition. Through a study of the relation between ME and Gaussian models,they establish a correspondence between discriminative training for Gaussian densities anda functional form of the ME family with features of second order. This way, the principlesof ME to calculate posterior probabilities (see section 2.2.3) can be used to discriminativelyestimate parameters of a Gaussian model. They apply the principle to detect handwrittendigits in 2D images. In opposition to our 3D approach, a Generalized Iterative ScalingGIS,[15] is used to determine the free parameters Λ of the maximum entropy based model.The classification results are improved compared to simple ME model. This approachis of high interest for us, since we can exploit multivariate Gaussian distributions andME models duality for the optimal integration of all given models using the principles ofDMC. A DMC approach for multivariate Gaussian distributions supports this duality ofdiscriminative Gaussian distributions and ME models.

Lazebnik et al [16] propose a discriminative maximum entropy framework to learn theposterior probabilities of labeled classes given the occurrences of semi-local parts, or groupsof neighboring keypoints with distinctive appearance and geometry layout. That is, thisapproach based the object recognition on texture image features. This could be of ourinterest when defining features for the model combination.

• Optimization of GHT.

In the literature, two different concepts for the HT are introduced: The ProbabilisticHough transform and the Non-Probabilistic. The authors use the first term to refer to (1)the use of only a subset of image points for the HT calculation or (2) the modeling of theaccumulator space (or HS) by statistical models. For the first term,(which matches moreclosely our interest), most of the researchers propose the idea of random sampling. (e.g.[17] for 2-D HT or [2] for 3-D GHT sampling of the model shape). These techniques savecomputation and memory usage. However, the improvement in the accuracy is arguabledue to the randomized process. Compared to our approach, we go a step further andsuppose that certain points are more important than others for a specific object detectiontask. Instead of performing a random selection, we propose a ”smart” selection of pointsbased on discriminative training. This way, the trade-off between accuracy of the solutionand computational ease should improve or at least it will not be constrained to the goodnessof a random process.

Lee et al [18] propose an unsupervised learning of shape models to facilitate the locationand segmentation of natural objects in large-scale image analysis tasks. The methodconsists in clustering and averaging shapes when they are considered to form part of thesame class. To take the decision of clustering or not, firstly shapes are outlined andcompared with the extracted templates. If the computed standard Euclidean error fromthe shape and the templates is below a certain threshold, the new shape is added to themodel. Thereafter, a point distribution model can be created by averaging the shapes as



2.3 Literature review Page: 15 of 86

proposed in [19]. A variant of the method bases the clustering on the statistical variancerather than on the Euclidean distance [18]. This approach could be useful for our problem,since it introduces shape variability. However, it does not optimize the number of modelpoints and it does not discriminate between which parts (or points) of the shape aredecisive for the localization of the objects.

Deselaers et al [20] presents a method to recognize objects in images consisting of a collec-tion of parts that can be learned for every object class. Their assumptions approach to ourgoals in this master thesis. They state that changes in the geometrical relation betweenimage parts can be modeled to be flexible or even to be ignored and focus on image partsthat are more important to recognize the object. Their approach uses single Gaussian tomodel the class posterior probabilities of the different class objects given the collectionof patches. The improved method in [21] introduces more flexibility in the collection ofpatches to tolerate deformations. This approach applied to our problem could be interest-ing for some main reasons: (1) Clustering model points of our shape for the GHT can leadto a reduction of dimensionality for the training procedure and (2) using the clusteringof neighboring model points in our model could result into a new way of discriminativelytraining the GHT, based on patches instead of single model points. And last but not leastimportant, this is an example of discriminative training based on the maximization of theposterior class-probability modeled by Gaussian densities as explained by Keysers et al.[3].

• Optimization procedures.There are several training techniques for the optimization of the free parameters given ina pattern recognition problem. The most extensive studies are done on speech recognitionand text classification.

The optimization approaches found in the literature (for multivariate Gaussian distributionmodels) do not take into account the information of all competing classes. Therefore, theseapproaches are not discriminative. They just estimate the free parameters to maximizethe likelihood on the training data.

For the case of ME models, Beyerlein [12] introduced a discriminative model combination(DMC) applied to speech recognition. This approach is of our highest interest becauseour objective heads towards the estimation of the discrimination potential of our definedfeatures (i.e an image feature and a model point from the shape model).

Additionally, as far as we have researched, we did not find any studies concerning featureselection for discriminative training techniques.







Page: 17 of 86

3 Discriminative training methods for aGHT-based voxel classification

3.1 Scope of the current work

The aim of the preprocessing before applying the GHT is to extract shape information of theconsidered object from a given image while suppressing noise, other structures and as many con-current shapes as possible. This directly affects the performance and computational complexityof the GHT since it attempts to optimally match a shape model to the preprocessed featureimage.

The present approach extracts edge information using a standard Sobel edge detection technique(see Apendix A). By choosing a window of gradient and grey values, it is possible to limit theedge information in the image. The threshold values are learned from the boundaries of eachindividual object and training image. This introduces noise and other artifacts which shouldbe addressed in a subsequent processing step. This setup is not really discriminative either;other concurrent shapes are likely to remain. Besides, important parts of the target can beeliminated. For instance, femur epiphysis is faded using this approach (see fig. 3.1). This partdifferences one femur from the other, which can lead to a wrong solution of the GHT in favorof the concurrent shape.

Figure 3.1: Preprocessed image using the current GHT-preprocessing framework.

Through our approach we aim at the implementation of a discriminative filter to extract themost useful information that will be used by the GHT and consequently reducing the number ofimage points that will be computed. This project will study how the use of different features inan image and their combinations can support the removal of as much unnecessary informationas possible. Ideally, the result would be the direct segmentation of the object and consequentlydirect success of the object detection. In practice, the number of voxels that do not form part of




Page: 18 of 86 3 Discriminative training methods for a GHT-based voxel classification

the object to be detected should be considerably reduced, thus decreasing the computation timeand computational complexity for subsequent tasks. This can also be posed as a general image-processing problem, which aims at the classification of voxels into object versus non-object.

3.2 Overview of the system

Our main research interest for the discriminative image voxel filter is how to optimally combineimage features to reduce edges in an image which do not correspond to a predefined targetobject. To overcome this problem, we define an image voxel classification problem as follows

x 7→ r(x) = arg maxk{p(k|x)} = arg max

k{p(k)p(x|k)} (3.2.1)

where x represents a voxel and k one of the two classes k1, k0 for object and non-object respec-tively.

For the classification problem, it is necessary to determine a number of input image featuresthat provide the information for the classification. We assume xε<M , {x1 . . . xM}, representingimage features for the classification problem, e.g. x1 =gradient magnitude; x2 =grey value...

From here on, we can write the conditional probabilities from Bayes decision rule (3.2.1) asp(k|x1, x2...xM ) or alternatively p(x1, x2...xM |k). Since these probabilities correspond to astochastic process (image values vary within the object and inter-images), they need to beapproximated. We propose the use of multivariate Gaussian distributions for the modeling ofp(x1, x2...xM |k).

Introducing the Gaussian distributions into the Bayes decision rule, Keysers et al. [3] showedhow to arrive at classifier of the following functional form:

k = arg maxk

{λ0,k +

∑

i

λi,kfi(x)

}(3.2.2)

where k is the estimated class by the classifier for a voxel x. The free parameters This classmaximizes the overall sum of the weighted image features for the voxel x.

The class specific free parameters λi,k represent the importance of an image feature in the over allclassification. The λ0,k corresponds to the class prior probability. All class specific parameterscan be calculated through a MCE training based on a DMC approach [14].

Once the free parameters are calculated, the most important image features for a given objectdetection problem can be selected. Since the estimated parameters give the information aboutthe discrimination capability within the problem, the features considered less relevant can beeliminated. This way, the number of features to calculate in a fast preprocessing step is reduced,as well as the dimensionality of the classifier.

Finally, image voxel classification is carried out using the classifier explained above 3.2.2 withthe estimated weights of the selected image features.

In the forthcoming sections, we will explain the details of this image voxel classifier more pre-cisely.



3.3 Training Data Collection Page: 19 of 86

3.3 Training Data Collection

During the data collection four different considerations are of relative importance for the per-formance of the classifier.

• Number of images H. Important to ensure data variability, avoid overfitting to the trainingimage and achieve optimality on independent test data.

• Number of image features i = 1, ..., M . The larger the number the longer the compu-tation time. A compromise must be found between the number of features M and theclassification error rate.

• Number of voxels per image. Each observation n extracted for the classifier correspondsto a voxel in an image. We denote each of these observations as xn, n = 1, ..., N . Thisnumber should be large enough to avoid overgeneralization.

• Ratio between the number of voxels from class k0 and class k1. Due to the chosen approach,the prior probability p(k) plays a role during the optimization. The log p(k) is one of theconsidered parameters to estimate as we will see in section 3.5.

The precise number of these setting parameters will be given for each experiment performed.

3.4 Features

For each voxel x a set of image features xi, i = 1, ...,M is extracted. These features provide theinformation to perform the classification. Hence, the quality of the classifier strongly dependson the discriminative feature capability within a given context (e.g. within a body region andimage modality).

The definition of the features can be considered all an art by itself. The features must bethoughtfully chosen for each object. It is also important to take into consideration the rest ofstructures in an image. This way, the features selected should be discriminative with respectto the rest of the image. Many studies about image features have already been carried out byother researchers. Thus, we do not intend to master in this topic during this thesis. Our mainresearch interest is how to optimally combine a set of given image features to eliminate edges inan image which do not correspond to a predefined target object. Furthermore, how to determinethe importance of an image feature in a classification or segmentation problem.

Features should hold the following characteristics within our framework:

• Discrimination

• Easy to extract in a fast preprocessing step

• Homogenous within the boundary of the target object

• Robust to tolerate inter-image variances

• Continuity within every defined class.

It is also remarkable that features are not necessarily extracted from the observed voxel, butthey can be defined to refer to it. For instance, a gray value of a neighborhood voxel (see figure3.2).

Ideally, features should not be discontinuous within a class (see figures 3.3, 3.4). Our classifierset a threshold decision boundary where the cost of misclassifications is minimum. If thereexists feature discontinuity within classes, voxel misclassifications will always occur. E.g. If thedecision threshold is set to a grey value of 200 in figure 3.3, the voxels with grey values within the





Figure 3.2: Feature x1 and x2 respect to the voxel of class k for a classifier with two features p(k|x1, x2)

range of [400-600] will be misclassified as class k1. This problem was observed in cardiac images,where the heart boundary did not have an unique range of grey values (lack of homogeneitywithin the object boundary). To overcome this problem, we tested triangulated distributions(see figures 3.3 and 3.4) and ”binary”1 features. The possible range value of a feature is dividedinto bins of identical widths. A feature only obtains a constant positive value in the bin whereits value lies and negative for the rest of the bins. E.g. Let us assume a feature range value of2000 divided into bins of 200 units each. A feature with a value of 450 would obtain a valuee.g. of ’50’ in the third bin and ’-50’ in the rest of the bins. Each bin is considered a feature.Feature values of ’0’ and ’1’ were not possible. The feature value can only lie in one of thebins. Therefore, bins usually obtain a value of ’0’. During the feature normalization, mean andstandard deviation are prone to obtain ’0’ values and run into a division by zero.

Figure 3.3: Grey value features in triangulated distributions to overcome non-homogeneity within the targetobject.

Complexity of features can go from voxel grey values, passing by edge related features (e.g.gradient), region (e.g. grey profiles), texture (e.g. co-occurrence matrix) to object shape features.For the present master thesis, we chose simple features based on edge grey profile, gradient, andobject location values.

Our approach is carried out with normalized features, so that we can ensure consistency in thedata within the region of the smoothing function (3.5.7) used in the training procedure. (seefigure 3.5). Thus, we obtain new image features

1We use this term to express features that only have two possible values.



3.4 Features Page: 21 of 86

Figure 3.4: Grey value features in triangulated distributions to overcome non-continuity of features within aclass.

xi =(x′i − ai)

bi

ai =1N

N∑

n=1

x′i

bi =

√√√√(

1N

N∑

n=1

x′2i

)− a2

i (3.4.1)

where ai are the means and bi the standard deviations of the image features calculated over Nsamples in the image.

Figure 3.5: Effects in voxel classification without normalization (right image).

To speed up the procedure, we do not normalize the features for all the training data corpus. Asub-sampling of the image is performed to carry on with the normalization.





3.5 Model choice and Minimum Error Classification

In [13], Beyerlein refers to the possibility of extending the DMC explained previously in 2.2.5to specific class model combination. Here, we explain a methodology to achieve this, basedon [22] and the studies carried out by Keysers et al. [3] to find the relationship between log-linear distributions of the ME family and Gaussian models. Keysers et al. proved that if amultivariate Gaussian model is introduced in Bayes decision rule, it is possible to arrive at aME model. Thereby, any discriminative training criterion used for the ME model can be used todiscriminatively train the free parameters of a Gaussian model. Hence, we imply that we can usethe Gaussian distribution model in the same way the log-linear combination of the ME family isused in the DMC. In this case, we have to calculate the free class dependent parameters derivedfrom this approach.

The following equation developments can be extensively read in the appendix B. We model theprobabilities p(x|k) from Bayes’ decision rule through multivariate density Gaussian distributionN(x|µk,Σk) where Σk corresponds to the class specific covariance matrices and µk to the k classmean.

p(k|x) =p(k)N (x|µk, Σk)∑

k′ p(k′)N (x|µk′ ,Σk′)(3.5.1)

Expanding the equation, it can be shown (appendix B) that we arrive at an expression asfollows:

p(k|x) =exp[αk + λT

k x + Σ−1k xT x]∑

k′ exp[αk′ + λTk′x + Σ−1

k′ xT x](3.5.2)

where T denotes the transposed matrix.

This model corresponds to a ME model with second order features (please see appendix B againfor further explanations). Hence, we apply the same principles as in the DMC theory but witha Gaussian-based model. In this case the discriminant function 3.5.3 contains class specific freeΛ parameters.


=[(αk − αkn) +

(λT

k − λTkn

)x +

(Σ−1

k − Σ−1kn

)xT x

](3.5.3)

To simplify, we can reformulate this equation as the new function 3.5.4 using the features definedin the equations 3.5.5 and 3.5.6. A detailed explanation of the features and free Λ parameterscorrespondences with Gaussian distribution models can be seen in the appendix B.


= ΣM2+M+1i=0 (λi,k − λi,kn)fi(xn) (3.5.4)

where for αk we decompose the features,

fM+M2+1 = 1 with λM+M2+1,k = log p(k)f0 = 1 with λ0,k = αk − log p(k) (3.5.5)

and for the features fi,{i = 1 . . .M2 + M

}, we can write



3.5 Model choice and Minimum Error Classification Page: 23 of 86

fi = xi 1 ≤ i ≤ M (3.5.6)fi = xlxj M + 1 ≤ i ≤ M2 + M, i = lM + j

the equivalent λi,k parameters of the features fi, i = 1, ..., M2 + M to a Gaussian distributioncan be followed in equation B.0.7. Observe that the second order features can be divided intotwo symmetric sets. This property is also common in the covariance matrix. Thus, the freeparameter Λ for these features are also symmetric.

Now the problem consists of following the concepts explained for the DMC theory 2.2.5. Thefurther details have been designed by Beyerlein [22] and are still to be published. Unluckily,they cannot be available in this documentation. We give the differentiable error measure 3.5.7obtained in this case is:

Eη(Λ) =N∑

n=1

∑

k 6=kn

Γ(k, kn)

Σk′ exp∑M2+M+1

i=0 η(λi,k′ − λi,k)fi(xn)(3.5.7)

where the denominator corresponds to the smoothing function applied in DMC.

Applying a gradient descendent scheme the set Λ can be optimized iteratively.

We would like to point out that the calculation of a λi,k′ is affected by the rest of parameters λi,k

(see how the error rate is defined 3.5.7). In our case a two class classification problem is given.Hence class k0 is affected by k1 and viceversa. If the weight of one image feature for one class ispositive, the weight of the same feature for the other class is negative. This supposes two setsof class specific Λ with opposite signs. We apply this DMC to calculate the Λ corresponding tofeatures of first order. Alternatively, we can apply this method to calculate the λ of second orderfeatures. Note that the amount of free parameters increases from (M + 2)K to (M2 + M + 2)Kand more training data might be necessary.

For this training procedure representative number of samples of a class should be considered.This is an important issue due to the consideration of the log p(k) parameter in the combinationmodel. This approach can estimate the importance of the prior probabilities for each class. Ifthe system is independent of this prior probabilities, λ0,k will obtain a low value. To enforce classprior independencies, it is enough if data is collected in equal number of samples for each classduring the training. That is, k0

k1= 1 neutralizes the effect of the prior probabilities (log p(k)

effect). The class prior weight for each of the classes will be estimated with a value of zero(λ0,k = 0).





3.6 Image Feature Selection

Image feature selection allows for the optimal integration of a given set of image features. Thisis done by eliminating those features with lower discrimination capability within a defined imageclassification problem. This would reduce dimensionality of the classifier but more important,it reduces the number of necessary image features to calculate in a preprocessing step. A com-promise between the number of features and the classification performance must be established.Intuitively, from the pattern recognition theory, the more information given for the classification,the lower the classification error count. However precaution should be taken not to reduce theperformance by adding ”noisy features”2. At first sight, helpful features for the classificationmight have been chosen. In reality, it is likely that some are confusing or irrelevant for theclassification. Besides, features might be correlated within the classifier.

Keeping these aspects in mind a new question is brought up: ”how do we define the importanceof a feature within our Discriminative Model Combination context?”. It is stated in [12] thatoptimized weights can be interpreted as the ability to provide information for correct classifica-tion (quality measure); but the approach has not been used for model selection yet, specially notfor feature selection in image processing or discriminatively trained multivariate Gaussian-basedmodels. The answer to this question would add value to the scientific discipline that explainsdiscriminative training, e.g. DMC.

Our solution to the problem performs the selection of image features based on the optimizedweights in the model combination. After a number of iterations during the gradient scheme, aset of Λ for each feature and for each class is estimated. A ranking of discrimination capabilityof these features xi, i = 1, ...,M is given by the absolute value of the λi,k. A feature obtains aquality measure for a class with respect to the others (see how the error rate is defined 3.5.7). Inour case a two class classification problem is given. Our criterion for feature selection is basedon the absolute value of one of the class specific Λ sets, e.g. λi,k1 .

This criterion would eliminate confusing and irrelevant features. We iteratively leave out thefeature i with the lowest absolute λi,k1 value after each iteration scheme. Every time a feature isremoved from the model combination, the classification error is observed on the training data.The final set of features will contain the minimum number of features before the classificationerror increases.

This procedure does not necessarily eliminate dependent features (e.g. if redundancy is intro-duced). These kind of features will obtain their λi,k1 values according to their discriminationcapability within the model combination. Let us assume that a gradient magnitude feature ofthe observed voxel x is introduced twice into the model combination. Both of the identic featureswill obtain equal quality measure since they have the same discrimination capability. Moreover,these set of features will be ordered according to their discrimination capability in the same wayas if we would not have doubled the feature. If these features are not discriminative for the givenclassification problem, they can be eliminated. In the case of correlated features which are verydiscriminant compared to the rest of the set, the problem is not resolved. These features obtainhigher λi,k1 values than the rest and therefore a criterion based on the highest absolute λi,k1

value does not eliminate them. However, feature redundancy can be observed when a feature isremoved from the set and a constant error rate is achieved in the classification.

Notice that the criterion used is validated for first order features. Alternatively, for secondorder features, the same criterion might be used. However this problem should be looked intoexperimentally.

2Features that are not homogenous or vary within the target object



3.7 Classification Page: 25 of 86

Once a set of features is selected, these can be optimally integrated into a multivariate Gaussiandistribution.

3.7 Classification

Image voxel classification into object and non-object is carried out using a classifier as proposedby keysers et al. for multivariate Gaussian distributions [3]. The class specific parameters λi,k

estimated through our MCE approach are applied to the selected image features:

x 7→ r(x) = arg maxk

{λ0,k +

∑

i

λi,kfi(x)

}(3.7.1)

here i = M and fi(x) = xi when using first order features. For the case of second order featuresi = M + M2 and fi(x) = xlxj where i = lM + j. The λ0,k corresponds to the logarithm of theclass prior probability, log p(k).







Page: 27 of 86

4 Implementation and evaluation of aGHT-based voxel classifier

4.1 Implementation

The discriminative voxel classifier is implemented within the shape finder framework at PhilipsResearch Laboratories Aachen. This software is designed in Visual Studio .NET and C]. Threemain blocks are necessary for a final working system:

1. Software to collect training data.

2. Software to train the free parameters applied to the image features. (Software for speechrecognition was initially given on a Linux platform. Changes were carried out in thissoftware to be applied to image processing).

3. Software for image voxel classification and visualization.

4. Software for error count.

5. User interface apply a discriminative filter automatically and write and train data.

Additionally, Perl tools were helpful to process training data.

4.2 Medical data

The GHT-based discriminative voxel classifier has been tested on CT- images. The classifieraims at classifying the edges of the right femur as object voxels. A first validation was alsocarried out on Magnetic Resonance (MR) images for heart voxel classification. The system hasbeen transferred and is being successfully used to e.g. normalize MR heart images.

Here the characteristics of the CT images:

• Modality: CT images.

• Patient: each image corresponds has been taken from a different patient.

• Body region: Pelvis and lower extremities.

• Voxel extension: 0.94, 0.94 and 3 mm in x, y and z direction respectively.

• Image dimensions: 512, 512 in x and y direction, and a variable number between 54 and92 slices for each image used.

The medical data characteristics are fixed during the whole master thesis. If there are furtherremarks, they will be mentioned in the corresponding section.

Additionally, a bit volume mask is given containing the edges of the object (ground truth). Theground truth has been generated manually, delineating the object in each of the slices of eachimage used. Other operators (e.g. finning algorithms) are used to achieve higher accuracy ofthe delineations. The exactness of the object delineation will affect:




Page: 28 of 86 4 Implementation and evaluation of a GHT-based voxel classifier

1. The performance of the classifier. If the training data is labeled wrongly, confusability isintroduced into the classifier.

2. The evaluation of the classifier. If the test data is labeled wrongly, error counts can takeplace for correct classifications.

We only used one training image with the purpose of at least having a smooth delineation ofthe right femur and achieve the larger number of voxels with correct class labels.

4.3 Evaluation

The performance of a classifier is usually measured by its error rate on a particular data set. Anumber of observations are classified and the error rate is calculated as the ratio between thenumber of misclassifications and the total number of trials performed. For this purpose, theground truth must be known. We would like to point out that not all misclassification havethe same cost in an application. For instance, the decision of a patient not having cancer whilebeing ill (false negative) has worse consequences than a wrong cancer diagnosis (false positive),since further clinical studies would be done in this case, and the mistake could be found outlater on. Therefore, sometimes it is more important to have good estimates of the probabilityof samples belonging to a certain class rather than reducing the total error rate. This is alsoour case for image voxel classification as object or non-object. We are interested in reducing thenumber of edge voxels that do not form part of the object that we want to localize. This means,the reduction of the false positive (voxels misclassified as object).

We represent the outcome of the classification by firstly giving the ground truth class of thevoxel and subsequently the estimated class label of the voxel separated by an arrow. E.g a voxelof class k0 misclassified as class k1: k0 → k1.

All evaluation parameters are given in percentages (%) to ensure independency between the sizeof the images and the object under investigation.

• Error rate or absolute misclassification rate (%).

Ratio between the number of misclassifications and the total number of voxels classified.

absolute error =k0 → k1 + k1 → k0

k1 → k1 + k1 → k0 + k0 → k1 + k0 → k0(4.3.1)

Additionally, relative misclassification rates can be given:

• Misclassification rate into class object (%).Ratio between the number of class non-object voxels (ground truth) classified as objectdivided by the total number of class non-object voxels. The cost of this misclassificationis higher in the case of the GHT. The computational complexity increases and detectionaccuracy is decreased.

relative error : k0 → k1 =k0 → k1

k0 → k1 + k0 → k0(4.3.2)

• Misclassification rate into class non-object (%).Ratio between the number of class object voxels (ground truth) classified as non-objectdivided by the total number of class object voxels.

relative error : k1 → k0 =k1 → k0

k1 → k1 + k1 → k0(4.3.3)



4.3 Evaluation Page: 29 of 86

The GHT is robust to partial occlusions. Therefore, it would be enough if the object ispartially extracted. The cost of this error becomes only important if too few object voxelsfinally remain. This fact would be specially outstanding if the number of considered objectvoxels is much smaller than the remaining non-object voxels.

• Image compression rate (%).Ratio between the remaining voxels in the image (or all voxels classified as class object)divided by the total initial number of voxels in the image (in percentage %).

compression =k1 → k1 + k0 → k1

classified voxels(4.3.4)

• Efficiency of object classification: (remaining object voxels (%)).Ratio between the object voxels (ground truth) and the voxels classified as object (re-maining voxels in the image). This ratio can provide information about the efficiency ofthe filter. The larger the percentage of object voxels (ground truth) the better the finaldetection performance would be.

efficiency =k1 → k1

k1 → k1 + k0 → k1(4.3.5)

• SensitivityThe sensitivity is complementary to the misclassification into non-object. This commonstatistical parameter is defined in our problem as:

sensitivity : k1 → k1 =k1 → k1

k1 → k1 + k1 → k0(4.3.6)

• SpecificityThe specificity is complementary to the misclassification into object. This common statis-tical parameter is defined in our problem as:

specifity : k0 → k0 =k0 → k0

k0 → k0 + k0 → k1(4.3.7)

The experiments presented here aim at validating a new discriminative training procedure formedical GHT-based voxel classification. The following points are addressed:

• Viability of a new discriminative training technique applied to voxel classification.

• Selection of image features given their importance in a voxel classification problem.

• Effect of the prior probabilities p(k) in the classification.

• The goodness of a chosen discriminative filter set-up compared to the current manuallytuned preprocessing used for extracting the considered edges in the GHT.

The optimization procedure was always performed with 1000 iterations on the training data.The parameters ε and η are firstly optimized internally in the training software.





4.3.1 Selection of image features

The following experiment shows the results of image feature selection.

• Data collection: One training image is used to collect data for each of these experiments.The class k1 corresponds to the edge voxels from the left femur (target object to extract).All the voxels of class k1 were included during the training. The total number of voxelsare given in the table as well as the ratio k0

k1between classes. The goodness of the classifier

is also constrained to the exactness in the manual delineation of the ground truth in theimages.

• Features: The figure 4.1 describes the index numbers of the features used in this experi-ment.

Figure 4.1: Index numbers of the 29 features referred to the voxel to classify (in the red square)

We introduced the gradient magnitude of the voxel to classify and the grey value of the27-neighborhood voxels as referred in the figure 4.1. The gradient magnitude is extractedafter the application of a sobel operator. Note that a total of 28 features are introduced intothe classifier, which means a total of 56 λ parameters will be calculated (28 for each class)and additionally the log p(k) for both classes, object and non-object, are to be estimated.

• Feature selection: The selection of features is based on

1. Training the Λ for an initial set of i = 1, ..., M features. (M = 29 features in thiscase).

2. Leave out the feature i with the smallest absolute λi,k1 value3. Retrain the Λ for the remaining features.

The classification error rate is given on the training data in % when an image feature isleft out after each iteration (see the table 4.1).




Image ID ct7 ct9 ct12 ct15

Voxels (k0 + k1) 214515 154043 112243 206243

Object Voxels k1 64516 56246 56092 56246

Ratio k0k1

2.32 1.85 2 2.66

Features Error Feat removed Error Feat removed Error Feat removed Error Feat removed

29 10.12 5 10.08 22 9.79 23 8.69 17

28 9.90 22 9.63 24 6.75 1 8.51 14

27 9.78 26 9.59 1 6.21 8 8.50 18

26 9.75 17 9.58 5 6.16 6 8.44 13

25 9.74 4 9.59 6 6.13 26 8.40 12

24 9.55 7 9.73 14 6.14 5 8.40 16

23 8.70 15 9.70 26 6.16 24 8.39 11

22 8.64 13 9.66 15 6.12 3 8.36 10

21 8.54 6 9.64 13 6.13 22 8.32 1

20 8.48 9 9.64 20 6.15 7 8.29 3

19 8.45 24 9.67 23 6.19 14 8.25 6

18 8.45 8 9.59 25 6.30 9 8.25 8

17 8.45 20 9.62 19 7.84 20 8.26 23

16 8.44 16 9.59 16 6.14 25 8.27 5

15 8.43 1 9.60 11 6.13 4 8.25 4

14 8.45 21 9.64 17 6.32 17 8.21 26

13 8.45 25 9.60 12 6.17 13 8.16 7

12 8.47 14 9.61 8 6.35 2 8.19 20

11 8.47 18 9.57 7 6.33 15 8.22 24

10 8.49 2 9.67 3 6.33 27 8.24 2

9 8.52 3 9.62 10 6.34 11 8.19 15

8 8.52 23 9.63 4 6.19 21 8.11 22

7 8.55 12 9.59 18 6.16 16 7.90 9

6 8.60 11 9.52 27 6.33 19 7.74 24

5 8.59 27 9.59 9 6.17 12 7.44 19

4 8.83 19 9.65 21 6.17 18 6.57 27

3 9.39 10 9.43 2 6.44 10 6.18 0

2 12.27 0 20.00 28 17.23 0 20.09 21

1 69.72 28 25.33 0 56.09 28 76.31 28

Table 4.1: Feature selection given the initial features: gradient magnitude, prior class probability and 27-neighborhood grey values.

The results of the feature selection reflect that adding information about the grey neighborhooddoes not improve the classification performance considerably. The difference in the classificationperformances does not vary considerably from one iteration to another. This denotes featureredundancy. Furthermore, the 27 grey neighborhood features are the less discriminative withinthis classification problem. Thereby, the neighborhood grey values can be iteratively left out.Finally, we found a compromise between number of features and the relative error rate byselecting the last three most important features. These correspond to the gradient (1), to theprior probability (parameter log p(k)) (2) and to one of the grey neighborhood values (3).

Interestingly, if the gradient feature is introduced twice into this classifier, the three most dis-criminant features will correspond to the prior probability (log p(k)) and the gradient doublyconsidered in three of the four cases presented. Therefore, feature redundancy can only be re-moved by observing a constant error rate in the classification when one of the double featuresis left out.





We can visualize an example of the classification results on test data in figure 4.2. The weightedthree most discriminative features from the training of image c15 were used in this example ona test image.

Figure 4.2: Bone voxel classifier using gradient magnitude information and grey value of the voxel. Blackmeans object and grey non-object.




4.3.2 Location features

The previous classifier does not discriminate between different bone structures. The qualityof the features defined do not hold bone discrimination properties. Other features should beinvestigated to remove as many concurrent shapes as possible before the application of theGHT. These features should be related, for example, to the anatomy of the bone or to absoluteor relative position of the shape within a body region. Location features x, y and z of the voxelto classify were trained on a CT image only containing the lower extremities. These featurescorrespond to the voxel identifier in x, y and z direction respectively. The figure 4.3 illustrates

Figure 4.3: Voxel classification on a training image using location features x,y,z. The brighter a voxel is, thelarger probability of the voxel being classified as object.

the results for this experiment and table 4.2 measures the classifier performance.

voxels k1 voxels misclassified voxels error % k0 → k0 % k0 → k1 % k1 → k0 % k1 → k1 %

63504 780 5501 8.66245 91.241 8.75901 99.1026 0.897436

Table 4.2: Results of classification using location features x, y, z

The misclassified voxels correspond to the right femur (see figure 4.3). The test and trainingimages provided did not contain the same body regions. Therefore, the relative position of theright femur voxels in training and test is not similar. Thus, the approach did not perform ontest data.





4.3.3 Prior probability effect in classification

As we mentioned in 3.5 the log p(k) is one of the parameters to be estimated. We present here theeffects of introducing different ratios of voxels from class k0 and class k1 during the training.

Firstly, we present an example of MR heart voxel classification on MR cardiac images.

• Training image (512x512x124).

– Constant number of voxels of class k1 = 30000– Variable number of voxels of class k0 from 100% of the voxels of class k1 (k0 = 30000

and k0k1

= 1) to the 40% of 30000 (k0k1

= 0.4).– Initial features applied: grey of the voxel to classify, grey mean of the 27-neighborhood,

gradient and gradient mean of the 27-neighborhood with triangulated features and10% intervals (0 - 0.1, 0.1 - 0.2,..., 0.9 - 1.0).

– Selected features: grey mean of the 27-neighborhood within the range [200-400] andlog p(k).

Figure 4.4: Effects of different prior class probabilities during training (k0k1

%) on a test image for MR-heartvoxel classification. Green represents class k1 and red class k0




• Test image (512x512x147).

– Number of voxels to be classified: 37677344.– Number of heart voxels k1 = 2751098.– Ratio k0

k1= 12.69.

k0k1

%(training) Error % (Test) k0 → k1 % k1 → k1 % Compression %

40 28.3 30.5 99.9 36

60 21.2 22.8 99.2 28

80 9.2 9.5 94.7 16

100 6.4 4.8 73.3 10

Table 4.3: Effects of different class priors during training on a test image for MR-heart voxel classification.

The results can also be visualized in the image 4.4. We can observe that including different ratiosof k1

k0has an effect on the classification. The more number of zeros included during the training,

the larger the probability of classifying a voxel as zero. More voxels of class k0 are classified aszero. Additionally, the probability of misclassification of a voxel of class k1 is increased. For theGHT, it is acceptable to consider only the 73.3% of the object voxels. The outcome shows thatthe final number of voxels considered for the GHT are reduced to the 10% for the case of a ratioof k0

k1= 1.

The following experiment shows the effects of different class priors in training for femur voxelclassification (see figure 4.5). The chosen ratios of k0

k1are always bigger than one to consider

enough amount of voxels from the different objects in the image. The number of voxels clas-sified as one are reduced when incrementing the number of zeros in the training image. Thisobservation is specially noticeable on voxels of class k1. Thereby, the visualization effects of theclass prior are not so evident.

• Training image (512x512x86)

– Constant number of voxels of class k1 = 59750– Variable number of voxels of class k0 from 100% of the voxels of class k1 (k0

k1= 1) to

the 800% (k0k1

= 0.4).– Initial features:grey, gradient and grey and gradient means of the 27-neighborhood.– Selected features: gradient mean of the 27-neighborhood, grey and log p(k).

• Test image (same as training).

– Number of voxels to be classified: 24258016.– Number of right femur voxels k1 = 59750.– Ratio k0

k1= 404.99.

k0k1

%(training) Error % k0 → k1 % k1 → k1 % Compression %

100 27.37 27.39 79.75 27.52

200 26.62 26.63 75.18 26.61

300 26.75 26.76 75.63 26.88

500 26.47 26.47 72.18 26.59

800 26.46 26.44 66.64 26.54

Table 4.4: Effects of different class priors during training on the training image for femur voxel classifier.





Figure 4.5: Effects of different prior class probabilities during training (k0k1

% ratios) on left femur voxelclassification. Green represents class k1 and red class k0.




4.3.4 Testing a GHT-based voxel classifier

The following experiment builds up a discriminative right femur voxel classifier. More robustfeatures are included in this case (e.g. mean of 27-neighborhood). All features are ”binarized”.The features are divided into bins (e.g. bins of ’500’ or ’1000’ units each, starting with ’0’ andending with ’3500’). The feature obtains a constant value e.g. ’50’ in the corresponding bin ofthe feature value, otherwise the bin obtains a value of ’-50’. The improvement of the resultspresented in the following experiments is mainly due to the definition of more robust features.

The parameters of this experiment are as follows:

• Training image:

– Modality: CT image.– Body region: lower extremities.– Voxel extension: 0.94, 0.94 and 3 mm in x, y and z direction respectively.– Image dimensions: 512, 512, 5 in x, y and z direction respectively.

• Data collection:

– One image was used for the training and selection of features.– A number of 2390 voxels of class k1 are used.– A number of 62724 voxels of class k0 are used.– A ratio of k0

k1= 26.24.

• Features:

An initial number of 85 binary image features are introduced into the classifier. Thesefeatures define grey and gradient magnitude profiles:

– grey mean of the 27-neighborhood.– gradient magnitude mean of the 27-neighborhood.

The profile length of the following features is set to 9 voxels with the voxel to classifycentered in the profile.

– grey profile extended along the x, y and z direction.– gradient magnitude profile extended along the x, y and z direction.

• Features selected:

1. Mean grey of the 27-neighborhood voxels. Binary feature selected within the range[1500, 2000]. Lambda value λi,k1 = 2554.

2. Grey z profile (as described previously). Binary feature selected within the range[1500, 2500]. Lambda value λi,k1 = 2180.

3. grey x profile (as described previously). Binary feature selected within the values[1500, 2500]. Lambda value λi,k1 = 2157.

The table 4.5 shows the effect of feature reduction for this experiment. Error rates are givenin this case since our major interest is observing how this classifier performs when leaving outimage features.

The figure 4.6 illustrates the effect of feature selection on training images.

The described classifier has been tested on ten independent test images (table 4.6).

The classifier, using the weights of the three most important features, is applied to each of thevoxels in the test images. The parameters of this test are as follows:





Features misclassified voxels error (%) k0 → k0(%) k0 → k1 (%) k1 → k0 (%) k1 → k1 (%)

85 214 0.3 99.9 0.1 25.8 74.2

60 214 0.3 99.9 0.1 25.8 74.2

50 214 0.3 99.9 0.1 25.8 74.2

40 214 0.3 99.9 0.1 25.8 74.2

30 1829 2.9 97.4 2.6 25.8 74.2

20 1943 3.1 97.2 2.8 25.8 74.2

10 2431 3.8 97.2 2.8 25.8 74.2

3 217 0.3 99.9 0.1 26.2 73.8

2 406 0.6 99.9 0.1 26.3 73.7

1 406 0.6 99.9 0.1 50.6 49.4

Table 4.5: Feature selection with binary gradient and grey profile features

Figure 4.6: Bone voxel classifier using binary features of gradient and grey profiles. Green means object andred non-object.

• Test images:

– Ten test images.– Modality: CT image.– Body region: pelvis and lower extremities.– Voxel extension: 0.94, 0.94 and 3 mm in x,y and z direction respectively.– Image dimensions: 512, 512 in x and y direction respectively.– The number of voxels is given in the table.

The figure 4.7 illustrates the classifier outcome and the table 4.6 shows the classifier perfor-mance.




Figure 4.7: Bone voxel classifier on test image ct1 using binary features of gradient and grey profiles. Greenmeans object and red non-object.

The training is performed with voxels coming from different bone structures in the image, apartfrom the right femur. However, the defined features do not provide enough information todistinguish between different bone anatomies. This classifier might face difficulties in settingthe decision boundaries for right femur boundary and rest of the bones. This setup rather yieldsa bone classifier and therefore the measured error in classification is dependent on the bonevoxels in the image.





Image voxels k1 voxels Error (%) k0 → k0 (%) k0 → k1 (%) k1 → k1(%) k1 → k0 (%)

ct1 24258016 172742 0.71 99.45 0.54 28.81 71.19

ct2 19612864 159779 0.81 99.37 0.63 27.22 72.78

ct3 22451568 238924 1.06 99.09 0.91 37.33 66.67

ct4 24774144 238092 0.96 99.22 0.78 20.71 69.29

ct5 19612864 156716 0.79 99.41 0.59 23.37 76.63

ct6 12903200 103834 0.80 99.56 0.44 16.60 83.40

ct7 21677376 163986 0.76 99.43 0.57 27.07 72.93

ct8 17548352 165863 0.95 99.27 0.73 27.59 72.41

ct9 17548352 156061 0.89 99.36 0.64 23.41 76.59

ct10 19612864 173838 0.89 99.35 0.65 24.83 75.17

Table 4.6: Classifier performance on test data. Features selected are the mean, x and z grey profiles. Thek0 → k0 corresponds to the specifity and k1 → k1 to the sensitivity of the system.

Finally, we compare the results of image compression for this approach and the manual tuningof grey and gradient threshold. The table 4.7 presents the results for the current manual tuningset-up with a grey and gradient magnitude threshold within the values more frequently used inthe system: [1000-5000].

Image error (%) k0 → k0 (%) k0 → k1 (%) k1 → k1(%) k1 → k0 (%)

ct1 0.71 99.47 0.53 18.92 81.08

ct2 0.81 99.82 0.18 22.71 77.29

ct3 0.82 99.26 0.74 22.72 77.28

ct4 0.91 99.26 0.74 23.72 76.28

ct5 0.79 99.42 0.58 16.59 83.41

ct6 0.84 99.49 0.51 14.50 85.50

ct7 0.70 99.48 0.52 22.26 77.74

ct8 0.79 99.43 0.57 21.40 78.60

ct9 0.83 99.38 0.61 22.37 77.63

ct10 0.74 99.48 0.52 18.93 81.07

Table 4.7: Performance of the manual threshold approach using a threshold window of [1000-5000] of greyand gradient magnitude. The k0 → k0 corresponds to the specifity and k1 → k1 to the sensitivity of thesystem.

The most interesting information for the GHT refers to the total number of considered edges(affects the GHT computational complexity) and the relative percentage of those correspondingto the remaining object voxels (affects the GHT accuracy). These comparison numbers are givenin table 4.8. Additionally, sensitivity and specificity are given for both approaches in tables 4.7and 4.6. Mean and standard deviations are given for all validation parameters in table 4.9.

For similar compression rates in both approaches, the specific discriminative set-up providesslightly more information about the target object. The accuracy of the GHT is related to thenumber of remaining voxels of the target object. The percentage of remaining voxels of classk1 is slightly larger in all the cases tested here (slightly better efficiency). The sensitivity alsosupports this behaviour. Specificity and absolute error rates are very similar in both approaches,a slightly better for the manual threshold. The results do not seem to be conclusive in favor ofthe discriminative voxel classifier.

The figure 4.8 illustrates the comparison of a manual tuning preprocessing and a discriminativevoxel classifier.




Compression % Efficiency %

Image standard discriminative standard discriminative

ct1 0.60 0.59 7.31 11.22

ct2 0.68 0.70 7.65 10.07

ct3 0.80 1.00 8.34 9.08

ct4 0.79 0.86 7.18 9.43

ct5 0.62 0.66 6.64 9.61

ct6 0.56 0.50 10.15 14.29

ct7 0.57 0.64 9.39 11.19

ct8 0.63 0.80 9.50 10.48

ct9 0.67 0.71 9.21 10.90

ct10 0.57 0.73 9.10 10.66

Table 4.8: Comparison of the performance of the discriminative GHT-based voxel classifier with the manualthreshold preprocessing.

Method Absolute error (%) Sensitivity Specificity Compression Efficiency

Manual threshold 0.79 ± 0.35 20.41 ± 2.91 99.45 ± 0.15 0.65± 0.08 8.45 ± 1.10

Discriminative classifier 0.86± 0.12 25.69± 5.23 99.35± 0.46 0.71 ± 0.17 10.69 ± 1.39

Table 4.9: Mean and standard deviation of validation metrics for 10 CT images. Manual threshold approachwithin the range [1000-5000]. Discriminative voxel classifier with 27-neighborhood mean grey, grey z profileand grey x profile features.

Figure 4.8: Comparison of manual threshold preprocessing and discriminative GHT-based voxel classifier.





4.4 Discussion on the experimental results

The feature selection criterion chooses the features according to their discrimination capability.However, the discriminative training algorithm does not optimize feature weights taking intoaccount feature dependencies. The outcome of the feature weights corresponds to a ranking ofthe discriminative capability of the features within the classifier (as claimed in [12]). Thus, if twofeatures are very discriminative and they are correlated, their feature weights will be similar. Acriterion based on the absolute weight value of a feature does not eliminate feature redundancy.The introduction of features into the classifier that do not contribute with additional informationsupposes an overload and therefore it affects the computation time. Thus, feature redundancyshould be also taken into account and eliminated. A further criterion is proposed based on theobservation of constant error rates in the classification. Two correlated features obtain similarweight values. If one of them is eliminated a constant misclassification error will be observed.In this case, the feature can be regarded as redundant and consequently can be removed.

The GHT-based voxel classifier yielded a bone classifier rather than a right femur classifier. Theclassifier performs according to the quality of the features introduced. The proposed approachmight not be the best solution to the defined problem. Microscopic features of a voxel (e.g. a greyvalue) do not help to distinguish macroscopic properties of a structure (e.g right vs. left femur).The features defined based on grey and gradient magnitude profiles are common to any bonestructure. Thus, discrimination cannot be achieved if image features are not discriminativethemselves. During the course of this master thesis, the use of location features were alsoproposed. We would expect even better results for second order features, e.g. where a featuref, f = x.y would be considered. These kind of features should provide the discriminationinformation to distinguish between different bone structures depending on their localization inthe image. However, location features are only possible if training and test data contain theobjects in similar positions (e.g. images of the same body region). An ideal feature should forexample contain shape information. That is, discriminative characteristics of the object shouldbe directly introduced into the classifier.

We compared the manual threshold GHT-preproscessing framework with a threshold windowof [1000-5000] with a specific discriminative voxel classifier set-up given an initial set of imagefeatures. The discriminative filter presented a slightly better conservation of GHT relevantinformation given similar image compression rates as in the manual tuning approach. However,since the features introduced into the classifier are not discriminative, edges from concurrentstructures have not been eliminated. Anyhow, the goodness of the discriminative voxel classifiercannot be conclusive since there is not an extensive evaluation of the current manual thresholdmethod. Additionally, comparison results would vary if different image features or trainingparameters are chosen. Thus, further evaluation experiments should be performed in this case.Firstly, to evaluate the current status of the work already developed at PFLA. Secondly, betterfeatures should be investigated to make really worth the new approach and its correspondingsystematic evaluation in this specific application. Finally, crosslink experiments are necessaryto ensure a generalization of the results and to prove the goodness of this new method.

PFLA decided to apply the studies carried out on this method to other applications wherethe pose of the problem matches more closely the advantages of this specific methodology (e.g.feature selection for heart failure detection).

The prior class probability studies showed that the variations in the ratio of included numberof object and non-object voxels during the discriminative training affects the final classificationoutcome. The more object voxels included during the training, the less number of remainingedges will be considered for the GHT. However, the number of object voxels remaining is alsoreduced. In any case, it is acceptable for the GHT if the object is only partially extracted. This



4.4 Discussion on the experimental results Page: 43 of 86

effect seems clear from the theory: the more non-object voxels introduced during the trainingthe higher probability of classifying a voxel as non-object exists. However, this effect should bealso further investigated with other training set-ups (e.g. different set of features).







Page: 45 of 86

5 Discriminative Training Methods for theGHT-Model Optimization

This chapter explains how to optimize and generate GHT-shape models from scratch to detectanatomical objects in 3D medical images. Three different questions have been assessed duringthis thesis:

• Automatic procedure for the optimization of individual weights of a set of model pointsused by the GHT [4].

• Selection of a subset of the most discriminative model points.

• Shape model generation from scratch and its optimization based on the two last mentionedpoints.

In order to achieve a better understanding of the studies carried out, we will illustrate the mainconcepts through simple examples based on geometrical shapes. Thereafter we will present theresults on CT-medical images.

5.1 Scope of the current work: Object-based model

Presently, the generation of 3D shape models for pattern recognition tasks require substantialuser interaction. As of today, the shape models used are generated by the manual surfacedelineation of every single target object. Alternatively, internal structures can be included toenhance the object detection (e.g. heart chambers). The manual delineation of the targetobjects is used for building triangulated surface models (figure 5.1). The GHT relies on thisshape information to localize the object in an image. Thus, the performance and complexity ofthe algorithm is constrained to the drawbacks of the shape model that currently exists in ourframework:

• Considerable user interaction.

• Shape model confusability. The shape model can potentially match other shapes in theimage. See figure 5.2, a second maximum appears where the left femur is located.

• High computational complexity seen as the dependency on the number of model points.This problem is currently solved out by a random selection of model points [2].

• Shape adapted to a single image, thus no integration of shape variability is included (inthe case of the femur shape model at PFLA).

• Restriction of the flexibility of the GHT (e.g. if no rotation is considered).




Page: 46 of 86 5 Discriminative Training Methods for the GHT-Model Optimization

Figure 5.1: Triangulated surface model for right femur detection.

A new technique to generate shape models is proposed to overcome the former shortcomingsby

• Creation of a shape model from scratch.

• Shape model optimization based on a MCE training of model point specific weights.

• Selection of the most discriminative model points for a defined object detection task.

Shape model variability is incorporated from all training images used.



5.2 Overview of the system Page: 47 of 86

Figure 5.2: GHT framework. From up to down and from left to right: Original image, preprocessed imageby manual threshold [1000-5000] and HS applying femur-based GHT

5.2 Overview of the system

Discriminative training for GHT-model optimization aims at selecting and weighting shapemodel points according to the importance of a model point to localize the target object in theimage [4].

This master thesis investigates how to put this idea into practice for the requirements demanded.Finally a novel concept of ”shape model” for discriminative 3D anatomical object detection isdeveloped in this master thesis.

The input of the implemented system can be:

• The shape model of the object to detect. It optimizes the existing shape model but it doesnot rescind of the user interaction for the object manual delineation. This was the initialproposal: to eliminate or reduce the confusability of those model points in the shape-basedmodel that do not aid in the object detection task.

• A cloud of model points, randomly initialized or taken from remaining edges of an image.This initialization approach, proposed during the fruitful discussion of this master thesis,reduces the user interaction to a minimum. The ground truth location of the object ineach image used for training is only requested.

Now, we define the GHT as a classification problem. From a probabilistic point of view, the HSof the GHT can be interpreted as a posterior class probability p(k|x) where

• k is one of the classes kε{1, ..., K} which represents an arbitrary transformation parameterset from the GHT. We simplify the approach by considering a class as one of the translationparameters k = (tx, ty, tz). Therefore, k represents also an object location in an image.

• x (observation) is the HS of an image





The best hypothesis selected by the classifier is the class with the highest posterior probabilitiesp(k|x)

x 7→ r(x) = arg maxk{p(k|x)} (5.2.1)

In this case the p(k|x) are modeled as relative voting frequencies. Thus, the best hypothesiswill correspond to the cell of the accumulation space with the highest number of votes. Thishypothesized solution does not necessarily correspond to the ”correct” solution cell kn. Thisway,

• kn = (tx0 , ty0 , tz0) are the translation parameters in the accumulator space which corre-spond to the location center of the target shape in the image (ground truth).

• Γ(kn, k) is the distance measure between the correct translation parameters of class kn

and the parameters of the rival class k.

From here on, we extend the classifier explained above by

1. Splitting the HS into model point dependent contributions.

2. Using the ME model principle to log-linearly combine the previous model point contribu-tions into a class posterior probability.

3. Training the model point contribution weights through a MCE technique (2.2.5).

4. Alternatively, select model points based on the estimated weights calculated previously. Inthis case, the model weight training should be repeated to integrate all models optimally.

After estimating the weights of every model point in the log-linear combination, and optionally,eliminating points, Bayes decision rule (5.2.1) is applied. The class posterior probabilities areeach of the log-linear combinations of the weighted model point class dependent contributions.The classification outcome is the estimated location center of the target shape.

In the forthcoming sections, we give a better insight of the approach from data collection todata classification.

5.3 Automatic Shape model generation

A new technique for shape model generation is developed based on a MCE training of model pointspecific weights. We propose an automatic shape model generation from scratch to minimizeuser interaction.

A random cloud of model points, only requesting the location of the shape in a small set oftraining images and optionally a region of interest (e.g. radius) is created. Additionally, weinclude random gradient information in every model point of the random cloud. This leads tomodel points that, even having a correspondence with a shape in an image will only vote in theGHT procedure if the gradient direction of the edge and model point matches.

An alternative to the random cloud of model points is:

• Use the remaining image voxels after the preprocessing (with their corresponding imagegradients) to initialize the cloud of model points.



5.4 Training Data Collection Page: 49 of 86

This ”shape model” is intended to be used to detect an arbitrary object in an image. Theperformance of the GHT is constrained to the exactness and exclusivity of the match betweenthe model and the edges of the object to localize. Thereby, the use of a random shape model israther unfeasible for the localization of any object. For this reason, object-based shape modelsare only used at present.

In this chapter, we present how to provide the necessary knowledge to the random model cloudso that it can be used to detect an anatomical object in an image.

5.4 Training Data Collection

The collected data are the number of votes of each cell in the HS coming from every individualmodel point (see next section 5.5 for more precise information). During data collection threedifferent parameters are of relative importance:

• Number of model points M .

• Number of K considered classes extracted from the HS.

• Number of training images N .

The two first define the dimensionality of our classifier. The latter is important to guaranteeshape variability and avoid overfitting and overgeneralization, above all if the number of param-eters is large. The extracted data should be representative to obtain optimality on independentdata sets.

Another relevant aspect during data collection is the quantization steps of the transformationparameters used in the GHT algorithm. The quantization steps define the HS granularity. Notethat the total amount of possible cells K in a 3D problem is very large when quantization of thetransformation parameters is small. If they are increased, the GHT accuracy will be reduceddue to the quantization error. However, the algorithm is more robust to noise. The matchesof noisy edges to model points have a smaller effect. Another advantage of decreasing the sizeof the HS is the possibility of introducing more information about the HS into the classifierincluding less number of classes. Therefore, less amount of cells k during the training procedureare used, alleviating a possible overgeneralization and overfitting intricacy.

Coming back to the concept of GHT, the predefined preprocessing plays also an important role.Since the GHT will be applied to image features, the shape model generation should consideronly structures that can be rapidly and reliably extracted by the preprocessing. The numberof votes after applying the GHT will depend on the remaining edges in an image. Thereby, theshape model will be optimized according to them. Our application should be always applied toa fixed preprocessing for each image in the training as well as for the test data.

The implemented procedure to collect data in this application is as follows:

1. Run GHT with M given model points.

2. Choose N-best solutions from the GHT that will be considered representative for thetraining. We obtain better results if not all the HS is included as training data. A possiblereason for this could be the introduction of noise when adding cells from the HS. Thesecells might receive votes coming from remaining noisy points after the prepreprocessing.

3. Collect the number of votes in each of the N-best solutions coming from each model point.

4. Extract the ground truth translation parameters kn = (tx0 , ty0 , tz0) i.e. the location centerof the target.

5. Repeat the procedure for N images.





5.5 Features: Model point dependent distributions

Given the classification problem explained in the overview 5.2, the class posterior probabilityp(k|x) can be decomposed in a set of M simpler posterior probability models pi(k|xn), i =1, ..., M . These models correspond to relative voting frequencies of each (or group of) modelpoints into one of the categorized classes derived from the HS.

Figure 5.3: Split HS: (a) Complete HS, (b,c) HS from individual model points

The relative voting frequencies of each model point into one cell of the HS is modeled as

pi(k|xn) =N(i, k, xn)∑∀k′ N(i, k′, xn)

(5.5.1)

Here, N(i, k, xn) represents the number of votes from a model point (or region) i for hypothesisk if the features xn have been observed, (n = 1, ...N the number of images). Alternatively, theprobability distribution could be estimated by a multi-modal Gaussian mixture (see C).

Our approach is carried out with normalized features, so that we can ensure consistency in thedata within the region of the smoothing function 2.2.19 used in the training procedure. Thus,we obtain new model point contributions.

pi(k|xn) =(p′i(k|xn)− ai)

bi

ai =1N

N∑

n=1

p′i(k|xn)

bi =

√√√√(

1N

N∑

n=1

p′i(k|xn)2)− a2

i (5.5.2)

where ai are the means and bi the standard deviations of the model point contributions calculatedover the N samples of the training corpus.



5.6 Model Choice Page: 51 of 86

5.6 Model Choice

In order to make use of the Bayes decision rule in a pattern recognition problem, the definitionof a model for the class posterior probabilities p(k|x) is necessary. Our model consists of thelog-linear combination of the model points contributions into a probability distribution of theME family (as referred in the theory section 2.2.3).

pΛ(k|xn) = e− log Z(Λ,xn)+∑M

i=1 λi log pi(k|xn) (5.6.1)

The value Z(Λ, xn) is a normalization term to ensure the probability distribution with

Z(Λ, xn) =∑

k′exp

[M∑

i=1

λi log pi(k′|xn)

](5.6.2)

Here, the coefficients Λ = (λ1, ..., λM )T can be interpreted as the weights of the model points.Each model point i will vote in the GHT with its corresponding weight. This way, a frameworkwithout shape model optimization would apply uniform votes, e.g. Λ =

−→1 .

5.7 Minimum Classification Error Training

For the previously presented model, the free parameters Λ need to be estimated. In the literature,these parameters are usually optimized without taking into consideration the information ofcompeting classes. Applied to our case, this would mean that the weight of a model point wouldbe trained to maximize the likelihood of voting occurrences of a model point for each cell of theHS. However, we are interested in how important a model point is to exclusively find the objectand not only how likely it votes for the correct object location. Thus, the training procedurefollowed is a discriminative training method as proposed in [14]. The background of this theoryis summarized in previous sections (2.2.5). Herein we only focus on specific questions relativeto the GHT problem.

The underlying principle is based on the discrimination capability of each model point to aid inthe GHT classification problem. A direct comparison of the number of votes of a model pointbetween two different classes (cells in the HS) states its discrimination potential. In this specificproblem, we are interested in the discrimination between the ”correct” class kn and the rivalsk. Thus, our approach optimizes the coefficients Λ with respect to a classification error rate ofthe discriminant function 5.7.1.


=M∑

i=1

λi logpi(k|xn)pi(kn|xn)

(5.7.1)

Now assume, we are given a set of training images n = 1, . . . , N with correct class assignmentkn and generate a feature sequence xn for each volume (data collection). By performing apreliminary classification with equal weights (i.e. λi = const ∀i), a set of rival classes k 6= kn canbe determined. In order to quantify the classification error for each rival class k, an appropriatedistance measure Γ(kn, k) must be selected. In case of a translation classification problem, wherethe solution is a 3D position vector, the euclidean distance between the correct vector and itsrival could be used. Alternatively, other distances can be considered. For instance we also tried





out the distance measure of the L-1 form (also called Manhattan distance). Even a simpler ideais to use a binary distance measure, which is ’1’ for the correct class and ’0’ for all others. Thisway,

Γ(kn, k) = 1/r

√√√√3∑

i=1

(|−→tkni −

−→tki |)r (5.7.2)

where r = 1 and r = 2 for Manhattan (L-1) and euclidean (L-2) distances respectively.

The selected distance measures strongly depend on the target object and problem. For instance,binary distance measures should be preferred to L-distances if the goal is to discriminate amongshapes never mind of their position, but their cost of misclassification. Our choice of a L-pdistance measure aims at localizing the target object as close as possible to the ground truth tobe able to perform a subsequent segmentation.

In the next step, the model combination parameters (equation 5.6.1) should then minimize theclassification error count E(Λ) (equation 2.2.18) according to the definition of the discriminantfunction explained above (equation 5.7.1).

Through a gradient descendent scheme, the Λ can be iteratively estimated (equation 2.2.21).Here, the discriminative training adjusts the weights Λ depending on:

1. The difference in the number of votes of the model point for the correct solution kn andthe rival solutions k.

2. The distance measure Γ(k, kn) between the previous two solutions.

This iteration scheme reduces the weight λi of model points i which favor hypothesis with largeΓ(kn, k). On the other hand, the weight of model points which favor good hypothesis (smallΓ(kn, k)) are increased.

In direct connection with the discrimination potential of the method, we define a performanceparameter that measures the degree of concentration of votes around the object location: theSmooth Error Rate (SER). See the evaluation section 6.3.

5.8 Model Point Selection

Through a model point selection, we aim at reducing the computation complexity of the GHTO(NmNeNp), and at the same time improving or at least maintaining the classification perfor-mance. We decide to eliminate model points that are not decisive for the classification, thatis model points that are not discriminant. Now, the issue is how to define the discriminationpotential of a model point. Let us see the studies carried out in this master thesis. The con-cepts will be explained through a simple geometrical example implemented for the proof of theconcepts (see figure 5.4).

The model points with higher discrimination potential are those that only vote for the cell kn

or in turn for a cell k with a low loss Γ(k, kn) (1). Through our investigations, we realized thatmodel points with high number of votes in cells k but low number of votes in the cell kn alsoobtain large absolute λ values (2).



5.8 Model Point Selection Page: 53 of 86

Figure 5.4: Detection example of the lower part of a semicircle.

This fact can directly been observed in the iteration scheme for the calculation of the Λ (equation2.2.21). Assume that a solution cell k is considered a good1 hypothesis of the GHT and thereforeis selected by the smoothing function (equation 2.2.19) during the training. In this case, themodel point dependent contributions will be re-estimated according to the difference in numberof votes between the selected cell k and kn, as well as the distance measure Γ(k, kn).

For the first kind of model points (1), the gradient descendent iteration scheme (equation 2.2.21)increments the value of their λ. For the second kind of model points (2), a decrement of the λvalue of the model point takes place.

On the other hand, the model points with lower discrimination capability are those with moreuniform contributions to the HS. The smaller the difference in votes between the cell k and thecell kn, the more softly the λ value will vary.

Taking into account these reasonings, we define a discriminative model point for the GHT as

1. a model point matching the object or its surroundings, but not matching other remainingshapes/edges in the image. The λ of these model points are positive.

2. a model point exclusively matching concurrent shapes. The λ of these points are negative.

Both of these model points obtain large absolute λ values.

Let us observe the geometrical example. We intend to localize the lower part of a semicirclewith radius r in a 2D image with a HM of a whole circle of the same radius. We introducethe upper part of a semicircle of radius r as concurrent shape (figure 5.4). The translationparameter quantizations used in this experiment are 16, 16 for x and y coordinates respectively.Additionally a gradient magnitude thresholding has been performed to restrict the number ofconsidered edges in the image. The voting procedure can be seen in figure 5.5.

1The smoothing function selects a cell k according to their number of votes with respect to the whole HS. Thenumber of considered cells k can be controlled through the parameter η (equation 2.2.19).





Figure 5.5: GHT procedure to find semicircles using a circle as HM.

The application of the standard GHT (SGHT), with uniform model point contributions, localizesthe upper part of the circle in our framework. However we aim at detecting the lower part ofthe semicircle. Therefore, there is a cost error Γ(k, kn) = 9 which corresponds to the euclideandistance in HS units between both cells k and kn. This cost also corresponds to the distancebetween both semicircles.

Let us consider a circle HM optimized through our discriminative training algorithm. The modelpoints in the HM matching the lower part of the semicircle obtain a positive λ value (see 5.6codified in green). These model points aid in the detection of the target. The upper part obtainnegative λ value (in red). These model points aid to detect the concurrent geometrical shape. Wehave to do a remark for the model points lying on the y axis. These model points have exchangedsign values due to the remaining edges on the horizontal axis of both of the semicircles. Forexample, the gradient of the lowest model point matches the remaining horizontal edges of theupper part of a semicircle. Thus the GHT will give votes (on a wrong cell in the HS) due to thismatch. Of course, the gradient of this model point also matches the lower part of the semicircle.Therefore, it gives votes for the right solution cell kn but in less number of votes. These are thereasons for (1) obtaining a different sign of the λ value of these model points respect to theirneighbors and (2) for being the less discriminant model points (more uniformity in the votingprocedure). Subsequent selection of model points, removes model points on the horizontal axisand negative weighted model points. The latter are probably left out due to the initializationof the algorithm with weights equal to one (SGHT initialization).

We can also observe the discrimination and classification performance along with the reductionof model points in table 5.1.

Note that cells in the HS can receive negative votes as it is the case here. The SER directlydepends on the number of positive and negative model points as we can observe in the table.



5.9 Model Point selection from a cloud of model points Page: 55 of 86

Figure 5.6: Model point selection of the problem in figure 5.4. Red means negative λ value. Green meanspositive λ value. Yellow means model point not selected.

5.9 Model Point selection from a cloud of model points

Now, let us take the example of the cloud model point optimization. We assume that the modelpoints in the cloud that are not providing information for the classification get a low absoluteweight, while the model points that are discriminative get a high absolute weight. Followingthis principle, the model points in the cloud that should be selected will correspond to thosethat have a large absolute weight value.

It is remarkable, that a very negative weight applied to our features (model point dependentcontributions to the HS) would also mean an improvement in the discrimination of our objectdetection task. Explained in a different way, this model point is providing information for theclassification in two different ways:

1. related to the concurrent objects in the image. We have to consider that these modelpoints cause high number of votes in concurrent solution cells. The probability of decidingfor this cell as the best solution of the object detection task decreases along with the





Model points Positive negative Γ(k, kn) SER

20 11 9 0 -1.06

18 10 8 0 -1.15

11 10 1 0 0.0398

9 8 1 0 -0.34

8 8 0 0 0.376

Table 5.1: DGHT performance along with model point reduction. Γ(k, kn) is the euclidean distance.

negative contribution of this model point to the solution k. In this way, the probability ofa misclassification decreases and the SER will decrease too.

2. related to the target shape. That is, the cell kn or in turn cells with low cost Γ(k, kn). Theprobability of the classification of this cell as the best solution of the object detection willincrease when the number of votes coming from this model point in the present cell is low.Note the effects of normalization, a negative value of the normalized feature multiplied bythe negative value of the model point weight is equal to a positive increment to be addedto the cell. Consequently, the probability to classify this cell as the best hypothesis of theobject detection task will increase.

This observation leads to a shape model where the model points included in it, do not onlybelong to the object to detect, but also to other shapes remaining in the preprocessed image.

5.10 Classification

The outcome of the designed classifier is the class k that maximizes the posterior probabilityp(k|x) as given in equation 5.2.1. The p(k|x) were modeled as log-linear combinations of modelpoint dependent contributions of the maximum entropy family 5.6.1. This way we can write

x 7→ r(x) = arg maxk

{M∑

i=0

λipi(k|x)

}(5.10.1)

where M are the number of model points used and pi(k|x) are the relative voting frequencies ofeach model point i in a cell k.

The class k selected by the classifier corresponds to the class which maximizes the sum over allweighted model point dependent distributions. The solution is the hypothetic location of thetarget object in the image (a translation vector

−→t = (tx, ty, tz)).



5.11 Discriminative-GHT (DGHT) Page: 57 of 86

5.11 Discriminative-GHT (DGHT)

During this master thesis, the GHT has been extended to be able to detect anatomical objectsdiscriminatively. We extended the shape models used in the GHT algorithm to hold weightedmodel points according to their discrimination potential. The voting GHT procedure is carriedout with model point dependent contributions. That is, each model point i, i = 1, ..., M in theshape model votes with its corresponding λi value estimated during the MCE algorithm.

We would like to point out that the HS of the DGHT can contain negative or positive votes.The hypothesized solution of the DGHT will correspond to the cell with the highest positivepeak in the HS.

The figure 5.7 illustrates the discrimination potential of the DGHT in comparison with theSGHT for the semicircle example. The DGHT concentrates the peaks with the highest number

Figure 5.7: HS applying the SGHT and the DGHT using the circle shape models under the HS images. Greenmeans positive and red negative model point.

of positive votes around the ground truth of the object to localize. This is a clear advantagefor anatomical object detection where many other concurrent shapes are present. Furthermore,confusing HM parts which do not match the target object but other features in the image areweighted with negative model point contributions. Thereby, they lead to the negative accumu-lation cells in the HS where concurrent shapes are located. This reduces the probability of awrong object detection due to shape model confusability.

Applying the SGHT with uniform model point contributions, the highest peak corresponds toconcurrent object center. The SGHT distributes positive votes among both locations of theshapes in the image. The upper half of the circle HM aids in the location of the concurrentshape and the lower half to detect the target.

On the contrary, by applying the DGHT, the highest peak correspond to the target semicircle.The DGHT is even more efficient than a SGHT with a shape model that does not introduceconfusability (e.g. a shape model of a lower semicircle in this case. See figure 5.7). We canconclude through this example that the DGHT is more discriminative than the SGHT. Besides,





the extension of the GHT also allows for the use of shape models that do not contain targetinformation but other image features.

The table 5.2 shows the numerical results of detection (L-1 distance error measure) and discrim-inative properties (SER).

Number of model points Γ(k, kn) SER

DGHT 21 0 -1.06

GHT 21 9 4.85

Table 5.2: Comparison GHT and DGHT



Page: 59 of 86

6 Implementation and evaluation of alearn-model for the GHT

6.1 Implementation

The DGHT is implemented within the shape finder software at PFLA. The whole working systemconsists of the following main blocks:

• Functionality to generate a random cloud of model points for the GHT.The user provides a radius and optionally a concentration scattering parameter. Shape-based models or model clouds containing the remaining edges of training images can alsobe introduced into the system.

• Software for data collection.Different parameters are required to the user (e.g. GHT quantization parameters or num-ber of considered cells for the training).

• Software for model point weight optimization.Multi-model Gaussian-based training software for speech recognition was provided at thebeginning of this master thesis. Initially, studies and modifications were carried out forthis training software (see C).Through a more profound study during this master thesis, a better solution to the problemwas given by using models of the ME family. Training software for speech recognition wasalso provided at this point and the necessary modification were carried out.

• Software for the GHT-based discriminative 3D anatomical object detection.This software includes a selection of model points functionality.

• Evaluation and visualization of the Hough space functionalities.

• Segmentation software based on constrained deformable model by Jurgen Weese [23].

• User interface to apply automatic selection of model points, to collect training data andapply the DGHT.

Additionally, further software for the generation of geometrical HM and images containing ge-ometrical shapes (e.g. squares, circles, etc...) was developed. The DGHT and model pointselection was also applied and investigated for some of these examples.

6.2 Medical data

All experiments are carried out on CT images aiming at detecting the right femur. The sameCT images are used as for the voxel classifier (see characteristic in section 4.2). In additionto the ten CT images, a femur shape model containing 1024 model points with their gradientdirections is given.




Page: 60 of 86 6 Implementation and evaluation of a learn-model for the GHT

6.3 Evaluation

The performance of the DGHT can be given as:

1. A distance error measure to the ground truth in Hough space units.The distance of the cell solution after classification respect to the ground truth is givenas a performance parameter or error measurement. That is, e.g. an error measurement ofzero for a perfect classification. The same distance measures are used in the optimizationto penalize model points that vote for the concurrent solutions.

Γ(kn, k) = 1/r

√√√√3∑

i=1

(|−→tkni −

−→tki |)r (6.3.1)

where r = 1 and r = 2 for Manhattan (L-1) and euclidean (L-2) distances respectively.

2. Segmentation success given the DGHT outcome as initial pose.

3. Visualization of the concentration of votes in the HS.In the case of the GHT optimization, the discrimination ability of the classifier is alsoof great interest. Thus, we also evaluate the HS to observe where the best solutions arelocated and whether they are concentrated around the ground truth. We compare the HSafter running the SGHT and the DGHT. A good discrimination ability of the classifierwould lead us to large advantages in difficult object detection tasks where many similarshapes are present. The detection accuracy should improve along with a focused HS.

4. Smooth error rate: SERThe SER measures the degree of concentration of votes around the ground truth. TheSER is normalized to ensure objectivity in the comparison of different data sets. It isremarkable, that this parameter is dependent on the number of positive and negativemodel points. The SER does not provide a precise information about possible concurrentpeaks in the HS. For this purpose, only the visualization of the HS is really useful. TheSmooth Error Rate (SER) is measured as the number of weighted votes from all modelpoints in every cell of the HS (p(k|x)) multiplied by the distance of the cell to the groundtruth (Γ(kn, k)). The smaller the SER the better the more concentrated the votes of theGHT are around the object location.

SER =1∑K

k=1 p(k|x)

K∑

k=1

Γ(kn, k) · p(k|x) (6.3.2)




6.3.1 Shape model-based optimization

This master thesis aimed at reducing or eliminating the parts of the shape model that introduceconfusability in the object detection task. This section presents the optimization results of thefemur shape model. The experiment is carried out as follows:

• Preprocess the image. The remaining edges corresponds to the voxels with grey andgradient magnitude values within the range [1000-5000].

• Collect data:

– The number of considered cells in the HS is 100.– L-1 distance error measure of each of the considered cells to the ground truth is given.– Model point dependent contributions of the 1024 from each of the 100 selected cells

in the HS.– Quantization of the HS: 20, 20, 12 in x, y and z respectively.– Three training images were used.

• Train the classifier. The number of optimization iterations are 1000, ε = 8388.6 andη = 64 · 10−8.

• Run the DGHT. Evaluate and visualize the results.

The figure 6.1 illustrates the HS of the SGHT and DGHT in one slice (longitudinal perspective)using a femur shape model containing 1024 model points.

Figure 6.1: HS using all model points (1024) from the femur shape model (Longitudinal perspective): LeftGHT and right DGHT

The results do not enhance the object detection accuracy. Both images have very similar HS.There still exists a second maximum in the location of the left femur. The SGHT performs on thisapplication very efficiently (100% of success in the detection). The DGHT does not outperformsince the quality of the remaining edges used in the MCE algorithm does not provide enoughdiscriminative information (e.g. the femur epiphasys1 is faded). The most discriminative partsof the model do not match the remaining edges in the image. Thus, they are not useful to detectthe right femur.

1femur head





The figure 6.2 shows the fifty most discriminative model points after applying a shape modeloptimization. As it can be observed, these model points do not belong to the head femur.Detection results are also similar if a random selection of fifty model points and the SGHT iscarried out or a smart selection of fifty model points and the DGHT is applied. A more efficient

Figure 6.2: Model point selection from the femur shape. Selected the 50 most discriminative out of the 1024contained in the shape.

preprocessing that keeps more object edges is necessary to achieve a better efficiency in theshape model-based optimization approach. In this case, the model points from the epiphasyswould probably obtain the largest positive weights and they could be selected.




6.3.2 Generation of a GHT-based ”shape” model from scratch

We present here one of the experiments carried out to create optimized shape models fromscratch for 3D anatomical object detection.

1. Create a random cloud of model points.For this specific experiment the model cloud is initialized with the random edges comingfrom the heart boundary in several CT images (see figure 6.3).

Figure 6.3: Random cloud of 10000 model points.

2. Extract the location center of the anatomical object to detect in each of the trainingimages. (Ground truth).

3. Preprocess the image.A grey and gradient magnitude threshold is applied to reduce the number of considerededges in the image. A very strict preprocessing removes object parts that might be dis-criminative. Thus, the threshold window applied was big enough to still maintain enoughinformation from the target object. A window of [1000-5000] has been applied for bothgradient and grey value threshold.

4. Run data collection.After running the GHT, we select a number of k = 100 best solutions of the HS to trainthe classifier. Smaller numbers of solutions do not consider enough concurrent shapes andtoo large numbers might consider too many noisy features in the image.

Better results were also observed for larger quantization parameters of the HS. This way,the system is more robust to noise and more concurrent cells and relevant information canbe introduced during the training without running into an overgeneralization problem.We used for these experiments a quantization of the HS of 20, 20, 12 voxels in x, y, zcoordinates respectively.

A total number of three images was used to train the classifier (ct1, ct2 and ct3). As inany training procedure, the more training images included the more shape variability isachieved.

5. Training the classifierThe number of optimization iterations are 1000, ε = 8388.6 and η = 64 · 10−8. The





L-1 distance measure to the ground truth (measured in HS units) and the model pointcontributions are given for each of the 100 cells considered.

The λ weight contributions are obtained for each of the model points in the model cloud.

6. A selection of a set of model pointsThe selection is carried out based on the absolute λ value of each model point.

• Selection of the 1000 most discriminative model points and retrain the classifier.• Selection of the 250 most discriminative model points and retrain the classifier.

7. Run the DGHT for each set of cloud model points to observe the detection and discrim-ination performance of the classifier along with the reduction of model points.

8. Segmentation is performed taking the outcome of the DGHT as initial pose.

We evaluated the system on seven independent test images and on the training data. Theapplication was applied to CT right femur detection.

The table 6.1 presents the results for different number of model points selected from an initialnumber of 10000 random model points. The selection of model points has been performedaccording to the largest absolute value. We measured the euclidean distance to the groundtruth (in HS units) for the training and test data. A measured distance error bigger than 1.73 isdefined as segmentation failure in this problem. The larger the distance error measure the worseperformance of the classifier and the more difficult to perform a posterior segmentation.

Model points 1000 500 250 100 50

Image Γ(k, kn) Γ(k, kn) Γ(k, kn) Γ(k, kn) Γ(k, kn)

ct1* 1 1 1 1 1

ct2* 1.41 1.41 1.41 1.41 1.41

ct3* 1.73 1.73 1.41 1.41 1.41

ct4 1.41 1.41 1.41 1.41 11.1

ct5 1.73 1.73 1.73 1.73 1.73

ct6 1 1 1 1 1.41

ct7 1 1 1 1 1

ct8 1.73 1.73 1.41 0 1.41

ct9 1.41 1.41 1 1 1

ct10 1.41 1.41 1.41 0 1

Table 6.1: Performance of the DGHT using 1000, 500, 250, 100 and 50 cloud model points. Images ct1, ct2,ct3 were used for training.

The table 6.2 presents the means and standard deviations of the euclidean distances for seventest images. The best results corresponds to the cloud of 100 model points.

Model points 1000 500 250 100 50

Euclidean distance 1.38±0.30 1.38±0.30 1.28±0.26 0.88±0.6 2.66±3.46

Table 6.2: Mean and standard deviation of validation metrics for seven patients.

A successful segmentation was possible given the outcome of the DGHT using optimized shapemodels with 100 model points in six of the seven images tested. In the failed case, only a littlemisplacement took place.




Theoretically, the GHT can only be applied using shape models containing the delineation ofthe object to detect.We applied the GHT and DGHT with 1000 model points from the random cloud (see figure6.4). The GHT hypothesized solution is far away from the ground truth with a L-2 distancemeasure of Γ(kn, k) = 4.24. The HS is more uniform for the case of the GHT.

The result can be clearly seen in figure 6.4. A SER = 8.89 is given in comparison with a SER=4.14.

Figure 6.4: HS after applying the GHT and DGHT with 1000 random model points.





Figure 6.5: Model point selection (longitudinal perspective): from left to the right and from up to down: Thewhole image with 1000 model points selected. Selection of 1000, 500, 250 and 100 model points respectively.Red means negative and green positive weighted.

The figures 6.5 and 6.6 show the selection of model points. The center of the cloud is situatedon the DGHT cell solution.

It is remarkable that model points with positive weights (in green) do not only match thediscriminative parts of the target shape but also other image features that reference the centerof this target shape. This can be observed in model points matching the hip or the remainingedges of the body surface. These model points contribute with positive votes to detect the rightfemur around in its correct position.

Model points with negative weights (in red) match the image features of other confusable objectsbut do not match the target object. These model points contribute with negative votes topositions where concurrent shapes are located. Thus, the probability of a wrong detection infavor of concurrent locations is reduced.




Figure 6.6: Model point selection (radial perspective): from left to the right and from up to down: Selectionof 1000, 500, 250 and 100 model points. Red means negative and green positive weighted.

At first sight, there might also exist some model points that should be discriminative to detectthe femur. However, it should not be forgotten that the model cloud contain random gradients.The GHT only votes for those edges in the image whose gradient match the gradient of a modelpoint. Therefore, it is also possible to find negative model points overlapping the target femur.

The figure 6.7 illustrates the HS along with the reduction of model points from the model cloud.The right femur is segmented given the outcome of the DGHT. The last two images (DGHTusing 100 and 50 model points) show the segmentation outcome on the HS.





Figure 6.7: The HS of one slice after applying the DGHT with an optimized selection of model points(longitudinal perspective): from left to the right and from up to down: 1000, 500, 250, 100 and 50 modelpoints.



6.4 Discussion on the experimental results Page: 69 of 86

6.4 Discussion on the experimental results

The learn-shape methodology takes into consideration the remaining edges in an image after thepreprocessing. More precisely, the learn-shape is based on the matches of the shape model to theremaining structures in the image. Therefore, a fixed preprocessing must be always applied.

The femur model optimization did not outperform with respect to a non-optimized femur modeldue to the quality of the remaining edges in the image. The most relevant information todistinguish the right femur from its most direct rivals, i.e. the left femur, is removed duringthe preprocessing. Therefore, there is not a possible match of the most discriminative part ofthe shape model to the object and no votes are added to the HS. The shaft of the femur modelpotentially matches the remaining edges of both femurs. During the optimization procedure,the difference in number of votes between two cells from the right and left femur locations issmall. The model point weights vary very smoothly during the optimization procedure. Themost discriminative model points lie where the shaft has the a more pronounced arch.

The generation of shape models from scratch incorporates in the model other remaining imagestructures to support the object detection. This way, more information is used to detect theright femur.

We introduce a new concept of weighted positive and negative model point contributions forthe GHT. This approach supposes the introduction of a new discriminative object detectiontechnique. The absolute value of the weight refers to the capability of a model point in theshape model to discriminate between different solutions. Based on this assumption, it is possibleto iteratively exclude model points with the smallest absolute weight until a stop criterion isreached (e.g. number of model points, distance error measure or segmentation failure).

The sign of the weight refers to which remaining features in the image (i.e. edges) a model pointcorresponds to:

1. a model point belonging to the object or its surroundings, but not to other remainingshapes/edges in the image. The λ of these model points are positive. These model pointsonly vote to find the object in its correct position.

2. a model point exclusively matching concurrent shapes. The λ of these points are nega-tive. These model points reduce the probability of an object detection failure in favor ofconcurrent shapes.

This way, the combination of both sign and absolute value of a model point contribution, givesus useful information of which model points belong to the right femur and surroundings but notto other concurrent structures and viceversa. This supposes a novel concept of ”shape” (figure6.8) composed of the most discriminative parts of

• the object: ”shape” with positive contributions to the HS.

• other concurrent objects: ”anti-shape” with negative contributions.

We also demonstrate a considerable improvement compared with a random shape model withuniform model point contributions. The HS discrimination capability is enhanced as well asthe detection performance. This procedure makes feasible the generation of shape models fromscratch and additionally allows for detecting objects discriminatively. This new procedure alsooutperforms in detection accuracy with respect to the SGHT using femur shape models. Thesecond maximum peak corresponding to the left femur has been eliminated.

We can observe that reducing the set of model points reduces the discrimination potential of theclassifier (see the visualization of the HS in figure 6.7). The first assumption for this behavioris that the larger number of model points provides more information for the classification. This





Figure 6.8: Shape and anti-shape.

could help out to obtain a HS more focused. The detection performance measured as theEuclidean distance between the estimated object location given by the DGHT and the groundtruth is not necessarily affected by the reduction of model points. The reduction of the lessuseful information is possible. We reduce model points until the detection performance is notacceptable for a posterior segmentation.

We stopped the reduction of model points with an optimized random cloud with 100 modelpoints. A successful segmentation was possible on six of the seven images tested given theoutcome of the DGHT as initial pose. In the failed case, only a small displacement took place.

The reduction of model points has been carried out leaving out a subset of the less discriminativemodel points and retraining the classifier. Further validation experiments are recommendedremoving model points iteratively, i.e. leaving out a model point and retrain.

More training data in experiments provides more shape variability information. Therefore,better results can be expected on independent test data.

The results presented are a first step towards a more systematic validation (e.g. crosslinkexperiments).



Page: 71 of 86

7 Conclusions and Outlook

Automation requirements were fulfilled during this master thesis. The GHT-preprocessing hasbeen optimized, so that fixed parameters are given (feature weights) to apply to a small selectedsubset of the most discriminative features.

The algorithm proposed to create shape models from scratch reduces the necessary user inter-action to a minimum. The user is only requested to give the location of the shape in a set oftraining volumes and optionally a region of interest (e.g. a radius). This fact allows addressingthe shape generation for new objects without an extra user’s overload. In addition to that, thegenerated model is extended to hold discrimination capability.

The GHT-preprocessing system allows for the optimal integration of image features into dis-criminative Gaussian density mixtures. These distributions can be used to model the posteriorprobabilities in the Bayes’ decision rule. Hence, they can be applied in classification problems(e.g. voxel classification into object and non-object). To our knowledge, there are not studies onhow to discriminatively combine image features or how to train Gaussian distributions throughdiscriminative procedures.

We have shown experimentally in this master thesis, how Gaussian distributions can be discrim-inatively trained to estimate their means and covariances. These free parameters correspondto the class specific weight of image features of first order (if pooled 1 Gaussian distributionsare considered). The approach can be extended to second order image features as we explainin 3.5. This would suppose a step further towards the sophistication of the system, which canbe investigated in a future work. This master thesis sketches through the different sections howour proposal should be taken into practice for the specific case of image processing.

In direct connection to the MCE technique for Gaussian distributions, we introduce a method-ology to select image features based on their discriminative capability. The studies on featureselection add value to the scientific discipline that explains discriminative training. Given theabsolute values of the weighted features, their relevance for a given classification task is de-termined. Hence, our algorithm can be used to determine which features are more importantto segment a given anatomical object in a medical image (e.g in a certain body region andimage modality). Image feature dependencies can only be eliminated by observing constantmisclassification error rates.

The classifier performance is directly related to the quality of the image features selected for aspecific problem. This is the main reason for a low discrimination performance of the imple-mented system applied to our application as we explain in the discussion of the experiments.

The goodness of the procedure for the specific GHT problem should be further studied through amore extensive systematic evaluation with different parameter settings and initial set of features.In any case, the results presented show a first insight of the goodness of such an approach appliedto GHT preprocessing. Definition of discriminative feature are necessary to make this specificapproach worth. Discrimination in our problem can probably only be achieved given macroscopicproperties of the searched bone structure.

1Σk = Σ, the covariance matrix is calculated only once, instead of K times




Page: 72 of 86 7 Conclusions and Outlook

The problem of GHT-preprocessing could be probably addresed more efficiently discriminatevilylearning the posterior probabilities of the occurrences of semi-local parts, or cluster of neighbor-ing keypoints in the object with distinctive appearance and geometry layout. Some reaserchers,e.g. Lazebnik et al. [16], Deaslers et al. [20] or keysers et al. [24] have already investigatedsimilar approaches but using other optimization techniques (not discriminative) to estimate (andmodel) the class posterior probabilities.

We focused our studies on image voxel classification for edge extraction for the GHT. But thisapproach could be further used in other classification problems with similar contexts. The mostimportant prerequisite is the quality of the feature definition always keeping in mind the problemgiven. The studies carried out in this master thesis on discriminative feature selection and theiroptimal integration can be of relevant interest for many other medical applications, for instancedisease diagnosis given syntom features or applied to bioinformatics (e.g. gene expression). Thesystem can also be used on other image modalities. In fact, it is already being used to setMR-heart voxel probabilities and normalize cardiac MR images.

The learn-shape from scratch for the GHT supposes the introduction of a novel technique todetect objects discriminatively. This entails a new concept of shape which incorporates the mostdiscriminative information from all remaining structures in the image after being preprocessed.The object shape parts (”shapes”) are weighted positively and the concurrent shape parts (”anti-shapes”) are weighted negatively to reduce the probability of a wrong detection.

The DGHT has been applied efficiently to right femur detection, outperforming with respect toa SGHT using a femur model. The DGHT increases detection accuracy for similar GHT com-putational complexity. As we showed in the experimental discussion, a focused HS is achievedaround the object location. Additionally, shape variability is incorporated from all trainingimages.

The results presented seem promising. However, more training data and crosslink validation arenecessary to support the first evaluation experiments.

The system has been applied to right femur detection but it can also address the detection ofnew anatomical objects discriminatively. The DGHT supposes an advantage for object detectionproblems where many concurrent shapes are present.

Further work can also be done to optimize the system, by initializing the cloud of model pointsusing the remaining image features after the preprocessing. This way all the image informationis included from the initial stage.

Additionally, the studies carried out on learn-shape models based on multivariate Gaussian dis-tribution models can add scientific value to the discipline of multi-object detection (see appendixC). Discrimination properties can be introduced to ”multi-shape models” based on the sameprinciples of ”shape” and ”anti-shape” explained in this master thesis.

An integration of the two systems presented during this master thesis can be done. The learn-shape model algorithm would rely on the remaining edges, output of the GHT-based discrimi-native voxel classifier.

To sum up, this master thesis automated the 3D GHT algorithm for GHT-preprocessing andshape model generation. The learn-shape model supposes a successful approach to discrimina-tively detect anatomical object in 3D medical images. The voxel classifier necessitates furtherinvestigation in definition of image features or a redefinition of the problem using object patchestowards a direct object detection.



Page: 73 of 86

APPENDICCES




Page: 74 of 86 7 Conclusions and Outlook



Page: 75 of 86

A Sobel Edge Operator

The sobel edge operator is applied before applying a threshold in the current GHT-preprocessingframework. Let us assume that the intensity of a given image is defined by f(x, y, z). Since anedge is typically defined as a discontinuity in the intensity of an image, information about theexistence of an edge can be obtained by using the intensity gradient:

∇f(x, y, z) =(

∂f

∂x,∂f

∂y,∂f

∂z

)(A.0.1)

The gradient magnitude

m(x, y, z) =

√(∂f

∂x

)2

+(

∂f

∂y

)2

+(

∂f

∂z

)2

(A.0.2)

is a good measure for the strength of an edge in a specific voxel. As an alternative to the L2norm of the gradient vector, the L1 norm can be used to determine the gradient magnitude,i.e.

mL1(x, y, z) =∣∣∣∂f

∂x

∣∣∣ +∣∣∣∂f

∂y

∣∣∣ +∣∣∣∂f

∂z

∣∣∣ (A.0.3)

A discrete approximation of the partial derivative with respect to a certain direction can bedetermined by convolving the volume with a set of 3-D masks. A generalization of the well-known Sobel edge detection masks to 3-D is given by three 3 × 3 × 3 masks for x-, y-, andz-direction, respectively. Each of these masks is built up from a set of three 2-D masks:

−1 0 1−2 0 2−1 0 1

−2 0 2−3 0 3−2 0 2

−1 0 1−2 0 2−1 0 1

Convolving a given volume with these masks provides a 3-D gradient vector for each voxel fromwhich the gradient magnitude and direction can be derived.




Page: 76 of 86 A Sobel Edge Operator



Page: 77 of 86

B Gaussian distributions and MaximumEntropy models

This appendix is extracted and based on the published paper [3]. Keysers et al. prove how meanand standard deviations of multivariate Gaussian distributions can be discriminatively trainedusing the well-known algorithms of ME discriminative training.

We model the probabilities p(x|k) through multivariate density Gaussian distribution N(x|µk,Σk)where Σk corresponds to the class specific covariance matrices and µk to the k class mean andT denotes the transposed matrix.

p(x|k) = N(x|µk, Σk)

=1

(2πΣk)12

exp[−1

2(x− µk)

T Σ−1k (x− µk)

](B.0.1)

we can apply the logarithm to the last equation and expand the expression,

logN (x|µk, Σk) = −12

log det(2πΣk)− 12(x− µk)T Σ−1

k (x− µk)

= −12

log det(2πΣk)− 12Σ−1

k xT x + µTk Σ−1

k x− 12Σ−1

k µ−1k µk (B.0.2)

and introduce it into the Bayes’ decision rule,

p(k|x) =p(k)N (x|µk, Σk)∑

k′ p(k′)N (x|µk′ ,Σk′)

=exp

[(log p(k)− 1

2Σ−1k µT

k µk − 12 log det(2πΣk)

)+

(µT

k Σ−1k

)x− (

12Σ−1

k

)xT x

]∑

k′ exp[(

log p(k′)− 12Σ−1

k′ µTk′µk′ − 1

2 log det(2πΣk′))

+(µT

k′Σ−1k′

)x− (

12Σ−1

k′)xT x

] (B.0.3)

We group terms

p(k|x) =exp[αk + λT

k x + γTk xT x]∑

k′ exp[αk′ + λTk′x + γT

k′ xT x](B.0.4)

On the other hand, we assume a ME model with second order features as follows

fk,i(x, kn) = δ(k, kn)xixj , i ≥ j, (due to symmetry)fk,i(x, kn) = δ(k, kn)xi (B.0.5)fk(x, kn) = δ(k, kn)




Page: 78 of 86 B Gaussian distributions and Maximum Entropy models

where δ(k, kn) are defined, so that the features only take their value when their correspondingclass kn is selected.

In this way

δ(k, k′) =

{1, if k = kn

0, if k 6= kn

(B.0.6)

Thereby, we consider image features dependant only on one of the categorized classes denotedas fk,i.

Coming back to the ME theory, we introduce the features in the log-linear combination arrivingat

pΛ(k|x) =exp

[αk +

∑Mi=0 λi,kxi +

∑M2

i,j=0 γk,i xixj

]

∑k′ exp

[αk′ +

∑Mi=0 λi,k′xi +

∑M2

i,j=0 γk′,i xixj

]

=exp

[αk + λT

k x + γTk xT x

]∑

k′ exp[αk + λT

k′x + γTk′ xT x

] (B.0.7)

The previous equation B.0.7 can be simplified by denoting the features as follows:

fM+M2+1 = 1 with λM+M2+1,k = log p(k)f0 = 1 with λ0,k = αk − log p(k)fi = xi 1 ≤ i ≤ M (B.0.8)fi = xlxj M + 1 ≤ i ≤ M2 + M, i = lM + j

where the new λi,k for the features fi = xi and fi = xlxj corresponds to the parameters in theequation B.0.7:

λi,k = λk,i for fi = xi (B.0.9)λi,k = γk,i for fi = xlxj

The simplified formula (used in the section 3.5) is denoted as:

pΛ(k|x) =expΣM2+M+1

i=0 (λi,k − λi,k)fi(x)∑k′ expΣM2+M+1

i=0 (λi,k′ − λi,k′)fi(x)(B.0.10)

Observing the function B.0.7 and B.0.4 we can identify terms in both equations. Following thesame reasoning but with pooled (Σk = Σ) multivariate Gaussian distributions, we arrive at aME model with features of first order. Hence, we can conclude that a discriminative trainingof multivariate Gaussian distributions can be used in the same way as a maximum entropymodel with features of first or second order within our Discriminative Model Combination. Thisapproach leads to class specific free parameters and the possibility of features of second order.



Page: 79 of 86

C Discriminative multi-object detection usingmultivariate Gaussian distributions for HoughModel Optimization

Even though we did not prove true a Hough model optimization choosing a model based onmultivariate Gaussian distributions, we describe our attempts tried, mainly looking into possiblefuture works on the field. The main future potential application is a discriminative multi-objectdetection.We intend to explain the drawbacks of this approach for the initial problem given, the trialsperformed and possible solutions to make this approach feasible. Despite of the belief of theapproach to be taken into practice successfully, the final solution proposed in this thesis resultssimpler from the conceptual and implementation point of view. Besides it fulfills our problemspecifications more closely. Therefore as Einstein says ... ”Make everything as simple as possible,but not simpler.”

The first studies on learn-shape models for the GHT where carried out modeling the model pointdependent contributions as multivariate Gaussian distributions. The insight of the problemshowed some weaknesses for the problem definition. However, a better understanding of theoptimization criterion lead us to think that this approach can be applied to discriminative muti-object detection. We expose the main problems and possible solutions:

• Advantages:A Gaussian density model approach has the advantage of refining the calculation of theΛ for each considered class of a pattern recognition problem with respect to any errorcost Γ(kn, k). This allows for the calculation of the discrimination potential of a modelpoint with respect to a specific class in the HS i.e. the discriminative detection of differentobjects simultaneously. Additionally, the importance of second order features can beestimated. The discrimination potential of the combination of two (or group) of modelpoints can be estimated. However, certain drawbacks of this algorithm applied to shapemodel optimization were observed during our experiments:

• Problems:

– Overgeneralization:In our specific GHT problem, the number of classes k that must be included in thestudy is very high. Moreover if we consider a 3D framework and the inclusion ofthe whole HS. This has a direct influence on the number of Λ parameters to becalculated. An increment of K · M parameters occurs for considering a first orderfeature problem of M model points and K extracted classes from the HS. Thus, theamount of necessary training data should be considerably larger not to run into anovergeneralization problem.

– Class specific free parameter optimization:The calculation of a λi,k′ parameter is affected by the rest of parameters λi,k. In ourproblem specifications, we aim at optimizing the importance of a set (or groups) ofmodel points to localize an anatomical object as close as possible to its ground truthfor a subsequent object segmentation. Therefore, we are not interested in optimizing




Page: 80 of 86 C Discriminative multi-object detection using multivariate Gaussian distributions for Hough Model Optimization

the set of Λ with respect to k rival hypothesis of the GHT, but between kn and its krivals.

– Class consistency throughout different set of training images:Another issue derived from the class specific Λ approach is the requirement of dataregistration if more than one observation x (that is more than one image) is included inthe training corpus. This fact is strictly necessary to overcome overgeneralization andoverfitting. Thus, a 3D CT/CT image registration should be performed previouslyto the training to obtain consistency in the class specific training.

• Solutions:

– HS registration:We applied an easier methodology to register the data. Every cell in the HS obtainits identification ID in reference to the class kn (ground truth) e.g. according to thedistance measure Γ(kn, k) or correlatively where kn has a ’0’ ID.

– Reduce the dimensionality of the HS:To overcome the overgeneralization intricacy, firstly we decreased the HS by usinghigher quantizations of the accumulation space. Thereby, dimensionality and thenumber of free parameters are reduced.

– HS cell clustering:Secondly, we tried to cluster cells in the HS. This way a class k in our classificationproblem does not correspond to a cell of the HS, but to a region. A larger number ofsamples for a class is achieved while the number of classes is reduced. At first sight,it seems that overgeneralization could be alleviated.However, clustering classes according to the distance to kn did not perform properly.First assumption could still be encountering an overgeneralization problem. A secondassumption points at the definition of the class. Two cells in the HS with a commonclass k, k = Γ(kn, k) do not necessarily obtain similar number of votes. This happensbecause votes come from different edges/areas in the image. Thus, the optimization ofa specific class λ could result inconsistent while incrementing or reducing the weightfor a feature that is not invariant within a class. In terms of model point contributions,the invariancy is measured in number of votes for a given class. If two observations ofthe same class have very different model point dependent contributions, the weight ofa model point would alternatively increment or decrease. Note that we encountereda similar problem while defining image features for the image voxel classifier.

– HS cell clustering based on a distinctive layout:A clustering approach based on proximity of cells in the HS can be studied. Howeverthe invariance of votes from a model point cannot be guaranteed and strongly dependson the size of the cluster. A priori information to perform the training could beintroduced by fixing cluster centers corresponding to concurrent shapes in the image.This approach is more similar to a patch-based object recognition [20].

– Coarse multi-object detection by drastic reduction of the HS:A possible alternative is decreasing the HS until the problem of overgeneralization isovercome. The cells in this case are registered correlatively with an unique ID withinan observation x. It supposes a compromise between smoothing the HS by increasingits quantization and the total number of cells. The final solution of the classificationproblem is approximated to one of the defined areas k in the image.

As we can see, tackling the problem through this approach looks on a higher level of a scientificproblem.



Page: 81 of 86

D Acronyms

(CT) Computer Tomography

(DGHT) Discriminative-GHT

(DMC) Discriminative Model Combination

(GHT) Generalized Hough Transform

(HM) Hough Model

(HS) Hough Space

(HT) Hough Transform

(MR) Magnetic Resonance

(MCE) Minimum Classification Error

(ME) Maximum Entropy

(MP) Model Point

(PFLA) Philips Forschung Laboren Aachen

(SER) Smooth Error Rate

(SGHT) Standard GHT




Page: 82 of 86 D Acronyms



Page: 83 of 86

E Mathematical symbols

The list describes important variables used in this master thesis. Some additional symbols andthe specific meaning for both of the applications implemented are explained in the sections inwhich they were used.

x Observation.

N Number of observations.

xM Observation feature vector xM = x1, ..., xM .

M Number of features.

K Number of classes.

k Class labeled as k = k1, ..., kK .

k Estimated class by the classifier.

xn Current observation from of a total set of N observations.

kn Class label from the possible K classes for the current observation n. (Groundtruth).

Γ(k, kn) Cost of misclassification between a class k and the correct class kn of thecurrent observation n.

p(k|x1, ...xM ) Class posterior probability of class k given the feature vectors of the obser-vation x.

λi,k Weight value of a feature i for a specific class k.

Λ Set of all λKm.

pΛ(k|xn) Weighted class posterior probability of the class k given the current observa-tion xn




Page: 84 of 86 E Mathematical symbols



Bibliography Page: 85 of 86

Bibliography

[1] Ballard H. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognit1981;13:111-22

[2] Schramm H, Ecabert O, Peters J, Philomin V, Weese J. Toward fully automatic objectdetection and segmentation. In: Reinhardt JM, Pluim JP, Josien PW, editors. MedicalImaging 2006: Image Processing. Proc. SPIE 2006 March;6144:11-20.

[3] Keysers D, Och FJ, Ney H..Maximum Entropy and Gaussian Models for Image ObjectRecognition, DAGM 2002, pp. 498-506, Zurich, Switzerland, September 2002.

[4] Schramm H. Automated Generation of Shape-Variant Hough Models for the GeneralizedHough Transform. Patent No. PCT/IB2006/054912; Automated Generation of Shape-Variant Hough Models for the Generalized Hough Transform”, patent filed in 2006, In-ternational Application Number PCT/IB2006/054912; 2006.

[5] Hough PVC, Method and means for recognizing complex patterns. U.S. Patent No. 3069654;1962.

[6] O´Gorman F, Clowes MB. Finding picture edges through collinearity of feature points.Proc. 3rd Int. Joint Conf. Artif Intell 1973;543-55.

[7] Duda RO, Hart PE. Use of the hough transform to detect lines and curves in pictures.Communications of the Association for Computing Machinery 1972;15:11-15.

[8] Duda RO, Hart PE. Pattern Classification and Scene Analysis. New York:John Wiley &Sons; 1973.

[9] Dahmen J, Schluter J, Ney H. Discriminative Training of Gaussian Mixture Densities forImage Object Recognition. In: Forstner W, Buhmann J, Faber A, Faber P. 21 DAGMSymposium Mustererkennung,Bonn, Germany; 1999 Sep. p. 205-12.

[10] Jaynes ET: On the Rationale of Maximum Entropy Models. Proc IEEE 1982 Sep;70(9):939-52.

[11] Kapur JN, kesavan HK. Entropy Optimization Principles with Applications. NewYork:Academic Press; 1992.

[12] Beyerlein P. Discriminative Model Combination. Proceedings of the IEEE Workshop onAutomatic Speech Recognition and Understanding, Santa Barbara, USA; 1997 Dec. p. 238-45.

[13] Beyerlein, P. Diskriminative Modellkombination in Spracherkennungssystemen mit grossemWortschatz. PhD. diss. RWTH Aachen, Computer Science Department, Aachen, Germany,2001.

[14] Beyerlein, P. Discriminative Model Combination. Proceedings of the International Confer-ence on Acoustics, Speech, and Signal Processing (ICASSP); 1998 May 5-15, Seattle, WA,USA. 1998. Vol 1 p. 481-4.

[15] Darroch JN, Ratcliff D. Generalized Iterative Scaling for Log-Linear Models. Annals ofMathematical Statistics 1972;43(5):1470-80.

[16] Lazebnik S, Schmid C, Ponce J. A Maximum Entropy Framework for Part-Based Textureand Object Recognition. Proceeding of the IEEE on International Conference on ComputerVision, Beijing, China; 2005 Oct. vol 1 p. 832-8.




Page: 86 of 86 Bibliography

[17] Xu, L. Oja, E, Kultanen, P. A new curve detection method:Randomized Hough Transform(RHT). Pattern. Recognit 1990;11(5)631-5.

[18] Lee k, Stree WN. Generalized Hough Transforms with Flexible Templates. Proceeding ofthe Int. Conf. on Artificial Intelligence (IC -AI), Las Vegas, NV; 2000 Jun. vol 3 p. 1133-9.

[19] Cootes TF, Taylor CJ, Cooper H, Graham J. Active shape models-their training and ap-plication. Comput Vis Image Underst 1995;61(1):38-59.

[20] Deselaers T, Keysers D, Ney H. Discriminative Training for Object Recognition using ImagePatches. CVPR 2005, International Conference on Computer Vision and Pattern Recogni-tion, San Diego, CA; 2005 Jun. vol 2 p. 157-62.

[21] Deselaers T, Keysers D, Ney H. Improving a Discriminative Approach to Object Recognitionusing Image Patches. DAGM 27th Symposium on Pattern Recognition, Vienna, Austria,2005. vol 3663 p. 326-33.

[22] Beyerlein B. DMC-based Density Parameter Estimation for Classification in Speech Recog-nition and Medical Informatics. 2006 to be published.

[23] Weese J, Kaus R, Lorenz C, Lobregt S, Truyen R, Pekar V. Shape constrained deformablemodels for 3D medical image segmentation. In: Insana MF, Leahy RM, editors. ImageProcessing in Medical Imaging. Heidelberg (Berlin):Springer;2001. p. 380-7. (LNCS; vol2082).

[24] keysers, D. Modeling of image variability for recognition. PhD. diss. RWTH Aachen, Com-puter Science Department, Aachen, Germany, 2006.

[25] A. Einstein, Physicist ”Brainy Quotes” 1879-1955

[26] Jaakkola T, Meila M, Jebara T. Maximum Entropy Discrimination. In: Adavances in NeuralInformation Processing Systems 12, MIT Press, Cambridge, MA; 2000. p. 470-6.

[27] Macherey W, Keysers D, Dahmen J, and Ney H. Improving Automatic Speech RecognitionUsing Tangent Distance. Eurospeech 2001, 7th European Conference on Speech Communi-cation and Technology. Aalborg, Denmark; 2001 Sep. Vol 3 p. 1825-8.

[28] Keysers D, Ney H. Linear Discriminant Analysis and Discriminative Log-linear Modeling.ICPR 2004, International Conference on Pattern Recognition, Cambridge, UK; 2004 Aug.vol 1 p.156-9.

[29] Ney H. On the Probabilistic Interpretation of Neural Network Classifiers And DiscriminativeTraining Criteria. IEEE Trans Pattern Anal Mach Intell 1995 Feb;17(2):107-119.

[30] Nigam K, Lafferty J, McCallum A. Using Maximum Entropy for Text Classification. IJCAI-99 Workshop on Machine Learning for Information Filtering, Stockholm, Sweden; 1999. p.61-7.



Documents

DISCRIMINATIVE TRAINING METHODS FOR THE GENERALIZED …