DLID: Deep Learning for Domain Adaptation by …...DLID: Deep Learning for Domain Adaptation by Interpolating between Domains impose structural constraints on the classi er using unlabeled

DLID: Deep Learning for Domain Adaptationby Interpolating between Domains

Sumit Chopra [email protected] Balakrishnan [email protected] Gopalan [email protected]

AT&T Labs Research, Building 103, Florham Park, NJ 07302

Abstract

In many real world applications of machinelearning, the distribution of the training data(on which the machine learning model istrained) is different from the distribution ofthe test data (where the learnt model is actu-ally deployed). This is known as the problemof Domain Adaptation. We propose a noveldeep learning model for domain adaptationwhich attempts to learn a predictively usefulrepresentation of the data by taking into ac-count information from the distribution shiftbetween the training and test data. Our keyproposal is to successively learn multiple in-termediate representations along an “inter-polating path” between the train and test do-mains. Our experiments on a standard objectrecognition dataset show a significant perfor-mance improvement over the state-of-the-art.

1. Problem Motivation and Context

Oftentimes in machine learning applications, we haveto learn a model to accomplish a specific task usingtraining data drawn from one distribution (the sourcedomain), and deploy the learnt model on test datadrawn from a different distribution (the target do-main). For instance, consider the task of creating amobile phone application for “image search for prod-ucts”; where the goal is to look up product specifi-cations and comparative shopping options from theinternet, given a picture of the product taken with auser’s mobile phone. In this case, the underlying ob-ject recognizer will typically be trained on a labeledcorpus of images (perhaps scraped from the internet),and tested on the images taken using the user’s phonecamera. The challenge here is that the distribution oftraining and test images is not the same. A naively

Appeared in the proceedings of the ICML 2013, Workshopon Representation Learning, Atlanta, Georgia, USA, 2013.

trained object recognizer, that is just trained on thetraining images and applied directly to the test images,cannot be expected to have good performance. Suchissues of a mismatched train and test sets occur notonly in the field of Computer Vision (Duan et al., 2009;Jain & Learned-Miller, 2011; Wang & Wang, 2011),but also in Natural Language Processing (Blitzer et al.,2006; 2007; Glorot et al., 2011), and Automatic SpeechRecognition (Leggetter & Woodland, 1995).

The problem of differing train and test data distribu-tions is referred to as Domain Adaptation (Daume &Marcu, 2006; Daume, 2007). Two variations of thisproblem are commonly discussed in the literature. Inthe first variation, known as Unsupervised DomainAdaptation, no target domain labels are provided dur-ing training. One only has access to source domainlabels. In the second version of the problem, calledSemi-Supervised Domain Adaptation, besides access tosource domain labels, we additionally assume access toa few target domain labels during training1.

Previous approaches to domain adaptation canbroadly be classified into a few main groups. Oneline of research starts out assuming the input repre-sentations are fixed (the features given are not learn-able) and seeks to address domain shift by mod-eling the source/target distributional difference viatransformations of the given representation. Thesetransformations lead to a different distance metricwhich can be used in the domain adaptation classifi-cation/regression task. This is the approach taken, forinstance, in (Saenko et al., 2010) and the recent linearmanifold papers of (Gopalan et al., 2011; Gong et al.,2012). Another set of approaches in this fixed repre-sentation view of the problem treats domain adapta-tion as a conventional semi-supervised learning (Berg-amo & Torresani, 2010; Dai et al., 2007; Yang et al.,2007; Duan et al., 2012). These works essentially con-struct a classifier using the labeled source data, and

1Often, the number of such labelled target samples arenot sufficient to train a robust model using target dataalone.

DLID: Deep Learning for Domain Adaptation by Interpolating between Domains

impose structural constraints on the classifier usingunlabeled target data.

A second line of research focusses on directly learn-ing the representation of the inputs that is somewhatinvariant across domains. Various models have beenproposed (Daume, 2007; Daume et al., 2010; Blitzeret al., 2006; 2007; Pan et al., 2009), including deeplearning models (Glorot et al., 2011).

There are issues with both kinds of the previous pro-posals. In the fixed representation camp, the typeof projection or structural constraint imposed oftenseverely limits the capacity/strength of representa-tions (linear projections for example, are common). Inthe representation learning camp, existing deep modelsdo not attempt to explicitly encode the distributionalshift between the source and target domains.

In this paper we propose a novel deep learning modelfor the problem of domain adaptation which combinesideas from both of the previous approaches. We callour model (DLID): Deep Learning for domain adap-tation by Interpolating between Domains. By oper-ating in the deep learning paradigm, we also learn hi-erarchical non-linear representation of the source andtarget inputs. However, we explicitly define and usean “interpolating path” between the source and targetdomains while learning the representation. This inter-polating path captures information about structuresintermediate to the source and target domains. Theresulting representation we obtain is highly rich (con-taining source to target path information) and allowsus to handle the domain adaptation task extremelywell.

There are multiple benefits to our approach comparedto those proposed in the literature. First, we are ableto train intricate non-linear representations of the in-put, while explicitly modeling the transformation be-tween the source and target domains. Second, in-stead of learning a representation which is indepen-dent of the final task, our model can learn repre-sentations with information from the final classifica-tion/regression task. This is achieved by fine-tuningthe pre-trained intermediate feature extractors usingfeedback from the final task. Finally, our approach cangracefully handle additional training data being madeavailable in the future. We would simply fine-tune ourmodel with the new data, as opposed to having to re-train the entire model again from scratch.

We evaluate our model on the domain adapta-tion problem of object recognition on a standarddataset (Saenko et al., 2010). Empirical results showthat our model out-performs the state of the art by asignificant margin. In some cases there is an improve-ment of over 40% from the best previously reportedresults. An analysis of the learnt representations shedssome light onto the properties that result in such ex-

cellent performance (Ben-David et al., 2007).

2. An Overview of DLID

At a high level, the DLID model is a deep neural net-work model designed specifically for the problem ofdomain adaptation. Deep networks have had tremen-dous success recently, achieving state-of-the-art perfor-mance on a number of machine learning tasks (Bengio,2009). In large part, their success can be attributed totheir ability to learn extremely powerful hierarchicalnon-linear representations of the inputs. In particu-lar, breakthroughs in unsupervised pre-training (Ben-gio et al., 2006; Hinton et al., 2006; Hinton & Salakhut-dinov, 2006; Ranzato et al., 2006), have been criticalin enabling deep networks to be trained robustly.

As with other deep neural network models, DLIDalso learns its representation using unsupervised pre-training. The key difference is that in DLID model,we explicitly capture information from an “interpolat-ing path” between the source domain and the targetdomain. As mentioned in the introduction, our in-terpolating path is motivated by the ideas discussedin Gopalan et al. (2011); Gong et al. (2012).

In these works, the original high dimensional featuresare linearly projected (typically via PCA/PLS) to alower dimensional space. Because these are linear pro-jections, the source and target lower dimensional sub-spaces lie on the Grassman manifold. Geometric prop-erties of the manifold, like shortest paths (geodesics),present an interesting and principled way to transi-tion/interpolate smoothly between the source and tar-get subspaces. It is this path information on the man-ifold that is used by Gopalan et al. (2011); Gong et al.(2012) to construct more robust and accurate classi-fiers for the domain adaptation task. In DLID, wedefine a somewhat different notion of an interpolatingpath between source and target domains, but appealto a similar intuition.

Figure 1 shows an illustration of our model. Let theset of data samples for the source domain S be denotedby DS , and that of the target domain T be denotedby DT . Starting with all the source data samples DS ,we generate intermediate sampled datasets, where foreach successive dataset we gradually increase the pro-portion of samples randomly drawn from DT , and de-crease the proportion of samples drawn from DS . Inparticular, let p ∈ [1, . . . , P ] be an index over the Pdatasets we generate. Then we have Dp = DS forp = 1, Dp = DT for p = P . For p ∈ [2, . . . , P − 1],datasets Dp and Dp+1 are created in a way so that theproportion of samples from DT in Dp is less than inDp+1. Each of these data sets can be thought of as asingle point on a particular kind of interpolating pathbetween S and T .


Unsup Trainer

S T

D1 D2 D3 D4

FW1FW2

FW3FW4

(a)

FW1FW2

FW3FW4

Xi

Zi1 Zi

2 Zi3 Zi

4

Classifier/Regressor

(b)

Interpolating Path

Unsup Trainer Unsup Trainer Unsup Trainer

Figure 1. Overview of DLID model. (a) Schematic showingthe interpolating path between the source domain S andtarget domain T , consisting of the intermediate domainsthat we create. The figure also shows the unsupervised pre-trainer. (b) Once the feature extractors are trained, anyinput Xi is represented by concatenating the the outputsof all each the feature extractors. This representation ispassed through classifier/regressor.

For each intermediate dataset Dp, we train a deep non-linear feature extractor FWp

, with parameters Wp, inan unsupervised manner. Abstractly, the feature ex-tractor FWp

can be thought of as a parametric functionthat learns how to represent the inputs so that theirmost salient characteristics can be reconstructed fromits outputs (imagine a non-linear denoising process).This provides a very powerful and flexible mechanismfor representation learning as we can impose structuralconstraints on the form of the feature extractors FWp

and then learn their optimal form under these con-straints. We use the Predictive Sparse Decomposition(PSD) algorithm (Kavukcuoglu et al., 2008) to trainFWp . See Figure 1(a) for schematic of this part ofDLID. The details of the architecture of the FWp

and PSD are discussed in the next section. Note thatfor any input sample Xi, each of the trained featureextractors generate hierarchical non-linear representa-tions Zi

p = FWp(Xi), attuned to capturing salient in-

formation particular to the intermediate dataset Dp itis trained on. Once feature extractors correspondingto all points on the path are trained, any input sampleis then represented by concatenating all of the outputsfrom the feature extractors together to create “path”features for the input. More precisely, the DLID pathrepresentation Zi for an input sample Xi is given by

Zi = [FW1(Xi) FW2(Xi) . . . FWP(Xi)]

= [Zi1 Z

i2 . . . Z

iP ].

Our hope is that this path representation will be

more effective at domain adaptation because it isconstructed to capture information about incrementalchanges between the source and target domains.

After we create the path representation of the inputs,the last stage of DLID is the final classifier/regressor,which depends on the task at hand. This classi-fier/regressor is trained on the training pairs (Zi, Y i)generated from source domain data DS by minimizingan appropriate loss function (Figure 1(b)).

Note that the entire DLID model, from the intermedi-ate feature extractors to the final classifier/regressor,is a deep neural network. This implies we can train allthe parameters of the model together using standardgradient based techniques with the final loss (com-monly referred to as fine-tuning). This ability to fine-tune in the light of a final classification/regression taskconfers huge advantages of DLID over the previousapproaches: there is usually a significant performanceadvantage that fine-tuned features provide as the thefinal features are optimized to the task at hand, newtraining data can be incorporated via fine-tuning etc.

Once the final classifier/regressor is fully trained, test-ing proceeds by forward propagating the inputXo

through the DLID model to obtain final class proba-bilities/regression estimates (the representation Zo isimplicit in this step).

The architecture of the feature extractors FWp , andthe associated unsupervised algorithm for their (pre-)training are crucial elements in the DLID model. Thebest choices of these elements will be application do-main and task dependent. For instance, when deal-ing with unstructured text, one might want to usea fully connected neural network, trained using thede-noising auto-encoder regime used in Glorot et al.(2011). For computer vision problems, convolutionalneural networks seem a natural choice to capture thespatial correlation of images. Pre-training of the con-volution networks could be done by a sparse codingalgorithm (Ranzato et al., 2006; Kavukcuoglu et al.,2008). For speech applications, Time Delay NeuralNetworks (TDNN) might work best, since they cancapture the temporal structure in an audio signal, andso on. In the next section we detail the architecture ofthe feature extractors we use for our object recognitioncase study.

3. Deep Representation Learning

We use convolutional networks for the feature ex-tractors in our object recognition task (LeCun et al.,1998). Convolutional networks were one of the firstdeep networks that were reliably trainable and havehad widespread success in a large number of computervision applications. The architecture for our featureextractor is similar to the one proposed by Jarrett


et al. (2009). Each stage of the hierarchical featureextractor is composed of four components cascadedtogether, namely, Filter Bank Units (FB), Rectifica-tion Units (R), Contrast Normalization Unit (N), andBoxcar Pooling Unit (P). Figure 2 illustrates this sin-gle stage cascade. Each component in the cascade hasa different purpose. For their detailed description andpurpose see Jarrett et al. (2009).

Filter Bank (FB)

Rectification (R)

ContrastNormalization (N)

Boxcar Pooling (P)

Figure 2. A single stage feature extractor FS1Wp

: FB −R−N −P generated by cascading four basic units, namely Fil-ter Bank, Rectification, Contrast Normalization, and Box-car Pooling. A two stage feature extractor FS2

Wpis built by

cascading the outputs of one single stage feature extractorto another single stage feature extractor.

We denote a single stage of the feature extractor,by FS1

Wp, and it is composed of these four units cas-

caded together (FS1

Wp: FB − R −N − P ). The learn-

able parameters Wp of the feature extractor are thefilter weights kij and associated gain coefficients gjof the FB unit. We also use two-stage feature ex-tractors FS2

Wp, which we create by cascading two sin-

gle stage feature extractors together. Thus, FS2

Wp:

FB1−R−N −P −FB2−R−N −P , with the filtersand gain coefficients of both FB1 and FB2 being thelearnable parameters. The hyper-parameters associ-ated with the feature extractors, such as, the size of thefilters, the number of output feature maps generatedby each stage, the connections from input to outputfeature maps etc, are dependent on the dataset andshould be chosen using cross validation. The valueswe use in our experiments are reported in Section 5.

Having described the architecture of our convolutionalnetwork, we now describe the algorithm we use to trainits parameters (the filter weights kij and the gain co-efficient gj of the FB units).

3.1. Unsupervised Pre-Training using PSD

As has become standard in training deep models,we train each layer of our network separately andin an unsupervised manner (only using unlabeleddata). The unsupervised taining algorithm we useper layer is called Predictive Sparse Decomposition(PSD) (Kavukcuoglu et al., 2008; Jarrett et al., 2009).

When training the single stage feature extractor FS1

Wp,

the input X to the PSD is an image patch equal to thesize of the filters. When training the second stage of

the two stage feature extractor FS2

Wp, we first train the

the first stage filter bank FB1 and then use its outputsas the input to the second stage (this is the layer wiseunsupervised training we referred to earlier).

Also note that even though we train the filter bank ona collection of patches (of size is equal to the size ofthe filters), the trained filters can be applied directly tothe whole image to generate its feature representation.This is achieved by convolving the filters of the filterbank with the entire image.

4. DLID Case Study

Now that our DLID model is fully specified, in thissection we describe the experiments we perform togauge DLID′s performance on an object recognitiontask across multiple domains.

We use the dataset provided in Saenko et al. (2010)2,which consists of images of 31 object categories fromthree different domains. The categories of objectsconsists of the commonly used objects, in a typicalhouse/office environment, such as, mugs, notebooks,(computer) monitors etc. The three domains are,Amazon (product images of the objects collected fromthe www.amazon.com website), Dslr (images of theobjects taken with a Dslr camera), and Webcam (im-ages of objects taken with a webcam). There are atotal of 4110 images with an average of 90 images perclass for Amazon, 16 images per class for Dslr, and26 images per class for Webcam. See Figure 3 forsample images from the three domains. The imageshighlight the difficulty of the object recognition taskand the challenges of domain adaptation. Amazon im-ages in particular are quite distinct from the other twodomains: for the most part the Amazon objects arebetter lit, well framed, have little to no backgroundclutter and are far more visually appealing than im-ages from other two domains. This should come as nosurprise since the Amazon images are typically pro-fessionally taken photographs of various products forsales purposes. The images from Dslr and Webcamhave inconsistent lighting, low resolution and signifi-cantly more background clutter.

We minimally pre-process the images to create the in-put for DLID: 1) We convert the images from colorto gray scale. 2) We re-scaled all images to size 90×90pixels. 3) Each pixel of the image was subtracted bythe mean pixel value of that image, and the result wasdivided by the standard deviation of the pixel values(a form of image normalization). 4) Contrast Normal-ization (N) was applied to the resulting image, with aneighborhood window size of 9×9. This process high-lights the edges, and texture in an image. After takinginto account the border effects, these steps resulted in

2Available from the first author’s website


Amazon Dslr Webcam

Figure 3. Example images for the objects in the Monitorcategory, Printer, and Headphones categories.

an image of size 82× 82.

4.1. Unsupervised Pre-training

We now walk through the unsupervised pre-trainingprocedure for DLID. For clarity, we consider a run-ning example with Amazon as the source domain (S)and Webcam as the target domain (T ) for the remain-der of this section. Starting with the 82 × 82 pre-processed images from Amazon and Webcam, we cre-ated three data sets D1, D2, and D3. Dataset D1

constituted of only the Amazon images, D2 consistedof Amazon and Webcam images, and D3 consistedof only Webcam images. In other words we consid-ered just one intermediate dataset on the interpolat-ing path between S and T . For each of these datasetsDi : i = {1, 2, 3}, we trained one single stage feature

extractor FS1

Wiand one two stage FS2

Wifeature extrac-

tor.

Single Stage FS1

Wi: There are a total of 64 filters of

size 7 × 7 in the FB Unit. Each filter is connected tothe single input feature map (patch of a grayscale pre-processed image), generating 64 output feature maps.The size of the neighborhood window of N (ContrastNormalization Unit) is set to 7 × 7. The window sizeof P (Boxcar Pooling Unit) is 8×8, with a step size of

4. The filters of each FS1

Wiare trained using the PSD

algorithm with the data Di. The training inputs Xi tothe algorithm are a collection of 7×7 patches sampledat random from various locations in the preprocessedimages. The PSD sparsity coefficient λ = 1.

Two Stage FS2

Wi: The two stage feature extractor is

composed of two layers cascaded together, with thefirst layer being identical to FS1

Wi. The output of the

first layer is passed as an input to the second layer.The FB units of the second layer use filters of size5×5. This unit takes as input a set of 64 feature mapsand generates 256 output feature maps. Each outputfeature map is connected to 8 randomly chosen inputfeature maps, resulting in a total of 2048 (= 256 × 8)filters in the FB Unit. The neighborhood window of

N is 5 × 5, and of P is 4 × 4 with a step size of 2.As mentioned earlier, the training of two stage fea-ture extractors is performed layer-wise. For the firststage, FS1

Wi, we use the single stage training procedure

described in the previous paragraph. For the secondstage, we forward propagate the entire training datasetDi through the pre-trained first layer feature extractorto generate input features for second layer. With theseinputs, training for the second stage follows the sameprocedure as that for the first layer. The only differ-ence lies in the inputs used: it is now a patch with 64feature maps obtained from the outputs generated bythe first layer. The PSD sparsity coefficient λ = 1.

We use a multinomial logistic regression classifier forthe final object recognition task . The inputs to theclassifier are the path representations of the images(the Zis, which concatenate the three feature extrac-tor outputs when applied to the input images). Theoutput is the class probability vector of size 31.

4.2. Fine-Tuning of Feature Extractor

As we mentioned before, since DLID is a deep neuralnetwork, we have the ability to fine-tune all the param-eters (feature extractors as well as classifier parame-ters) with respect to the task (multi-class classificationin our present case). We perform such fine-tuning byback propagating gradients from the classifier all theway to the feature extractors. This is a significantadvantage of DLID over the previously proposed so-lutions, since we are now able to learn intermediaterepresentations of the input which by design work wellwith the task at hand.

5. Empirical Results

We depart slightly from the standard methodology ofevaluating models on the Saenko et al. (2010) dataset.Instead of randomly sampling from the full dataset togenerate 20 smaller train/test folds, we instead use thecomplete labeled source dataset as training data in theUnsupervised problem (the complete target dataset isthe test data). We made this decision primarily tohave more training data per domain adaptation task,and also for the convenience of only having to train ononly one dataset. For the Semi-supervised dataset, wefollow the protocol in Saenko et al. (2010) to select asmall number of labeled training target domain imagesin addition to the labeled source images for training.The rest of the target samples serve as test data.

To place our results in the context of literature, wetrain a baseline model and a state-of-the-art model onour dataset. The state-of-the-art model trained is theGeodesic Flow Kernel method (GFK) (Gong et al.,2012), using the code provided by the authors.

For each of the three pairs of domains, Tables 1 and 2


Table 1. Recognition accuracies on target domains in un-supervised domain adaptation setting. A: Amazon, W:Webcam, D: Dslr

Models A → W D → W W → D

Baseline Saenko 21.76 55.40 60.84GFK (PCA,PCA) 10.94 42.52 45.18GFK (PLS, PCA) 12.20 39.50 43.78

Deep Feature ExtractionDL− S1 16.98 58.74 71.08DL− S2 22.40 64.65 78.31DL− S1 − FT 18.11 62.14 76.71DL− S2 − FT 23.65 66.42 82.53

Deep Intermediate Feature ExtractionDLID − S1 17.10 60.00 73.89DLID − S2 23.27 67.54 80.21DLID − S1 − FT 18.36 63.52 77.71DLID − S2 − FT 26.13 68.93 84.94

compares the object recognition accuracy on the tar-get domain for the Unsupervised and Semi-Superviseddomain adaptation setting respectively. The methodswe report results on are:

Baseline: We have a very simple, yet strong baselinemodel. Starting with features generated by Saenkoet al. (2010) (SIFT-based features), we train a multino-mial logistic regression classifier on the source domaindata and test it on the target domain. This model isdenoted by “Baseline Saenko” in the result tables.

GFK: The Geodesic Flow Kernel method (GFK)(Gong et al., 2012), using the code provided by theauthors. We train both GFK (PCA, PCA), and GFK(PLS, PCA) (see their paper for details) on our datasetand report the results. GFK is understood to be thestate-of-the-art.

DL: In order to quantify the benefit of intermediaterepresentation learning, we also train a single stageand a two stage feature extractor (Section 3) using theinputs from all the three domains, ignoring the notionof the interpolating path between the domains. Theunderlying philosophy behind this model is exactly thesame as in Glorot et al. (2011). The only difference isin the form of the parametric feature extractor and thepre-training algorithm. The single stage and two stageversions of this model are denoted by DL − S1 andDL − S2. Their fine-tuned counterparts are denotedby DL− S1 − FT and DL− S2 − FT .

DLID: DLID − S1 refers to the DLID model withsingle stage feature extractors FS1

Wp, pre-trained using

the PSD algorithm (see Section 4). The same modelwith fine-tuned filters is denoted by DLID−S1−FT .Similarly, we denote the two stage feature extractorversion of this model by DLID−S2, and its fine-tunedcounterpart by DLID − S2 − FT .

Table 2. Recognition accuracies on target domains in semi-supervised domain adaptation setting.

Models A → W D → W W → D

Baseline Saenko 34.93 67.92 70.62GFK (PCA,PCA) 34.90 59.12 54.32GFK (PLS, PCA) 34.33 57.69 55.06

Deep Feature ExtractionDL− S1 38.90 65.38 76.54DL− S2 42.31 69.94 80.49DL− S1 − FT 43.87 70.10 83.21DL− S2 − FT 44.87 75.21 84.94

Deep Intermediate Feature ExtractionDLID − S1 39.74 65.95 78.52DLID − S2 42.74 74.93 87.16DLID − S1 − FT 46.16 71.23 82.71DLID − S2 − FT 51.85 78.21 89.88

We make a number of observations based on our re-sults. First, our DLID models perform significantlybetter than the baseline and the current state-of-the-art in both unsupervised and semi-supervised set-ting. Second, the baseline using Saenko’s features isquite strong. In fact it gives better results than boththe versions of GFK algorithms trained on the samedataset. Note that GFK was trained using the codeprovided by the authors. Third, the single layer mod-els perform worse than their two layer counter parts.This is in agreement with literature and provides fur-ther justification for learning deep hierarchical repre-sentations instead of “shallow” representations. Fi-nally, also in accordance with existing deep learningliterature, fine-tuning helps improve the performanceacross the board. Thus the best performing model isDLID − S2 − FT by far, which shows improvementsover the baseline by up to 40% in some cases.

6. Analysis of Representations

So why does DLID work? We try to explore reasons byperforming an analysis on our learnt representations inthe theoretical framework of Ben-David et al. (2007).In their paper, the authors bound the generalizationerror of classifying the target samples (in the Unsuper-vised setting) with five terms (Theorem 2 in Ben-Davidet al. (2007)). Of the five terms, two asymptote tozero as the number of source examples grow, and oneanother term simply encodes the difficulty of domainshift. Thus the only two terms in the bound whichdepend on the representation of the examples are: 1.The source generalization error (SGE), and 2. The A-distance between the source and target distributions.Intuitively, the error bound implies that a representa-tion will have a better predictive performance on thetarget domain if it satisfies two criteria simultaneously.First, the representation should enable high prediction


accuracy on the source domain data (SGE is small).Second, the representation should make it difficult todistinguish between examples from the source and tar-get distributions (A-distance between the two domainsis small).

We conjecture that the excellent performance ofDLID model is attributed to two key properties.First, by learning/fine-tuning the feature extractorswe obtain predictively good representations, and thusaffect the SGE term in the error bound. Second,by explicitly including the intermediate domains, thelearnt representations are able to capture the struc-tures/concepts that contain information from thesource, target and intermediate domains. This resultsin a rich representation, which in some manner bridgesthe gap between the source and target distributions,hence decreasing the A-distance between the two do-mains.

We examined both SGE and A-distance for the unsu-pervised domain adaptation non-fine tuned results insection 5. However, since the A-distance is NP-hardto approximate for arbitrary distributions, along thelines of Ben-David et al. (2007), we evaluate a proxyfor it, called the proxy A-distance (PAD). PAD is es-timated by training a classifier to distinguish betweensamples from the source and the target domain (Ben-David et al., 2007; Glorot et al., 2011). Concretely,PAD is defined as PAD= 2(1−2ε(S, T )), where ε(S, T )is the generalization error of the binary classifier sep-arating the source and target examples. Glorot et al.(2011) have shown that the standard application ofdeep learning for feature extraction, results in de-creased SGE and increased PAD. Our experimentsthus focus on how learnt intermediate representationsfare in terms of these quantities.

Table 3. How representation affects the generalization er-ror bound terms. A: Amazon, W: Webcam, D: Dslr. DL:Deep Learning model, DLID: Intermediate Deep LearningModel, S1: single stage model, S2: two stage model.

Task Repr. DL DLID

SGE PAD SGE PADA → W S1 0.403 1.88 0.403 1.87A → W S2 0.355 2.00 0.336 1.79D → W S1 0.419 0.74 0.366 0.78D → W S2 0.380 0.71 0.303 0.78W → D S1 0.373 0.78 0.308 0.86W → D S2 0.289 0.70 0.265 0.85

The results are shown in Table 3. They lend supportto our conjecture. According to the error bound, dom-inating points of pairs in the table (where both SGEand PAD for one model are lower than the correspond-ing SGE and PAD for another model) are expected tohave better generalization performance. In every case

where there is a dominant condition (for instance, in-case of DLID − S2 v/s DL − S2 for A → W), ourresults in the previous section agree. It is harder toargue for non-dominating points, but nonetheless, sev-eral interesting observations can be made. First, in allthe cases DLID improves SGE over DL. This is im-portant because, as shown in the experimental resultsin (Ben-David et al., 2007), when there is no domi-nating condition, SGE appears to be the more criticalterm in determining target generalization error. Weattribute credit for this to the intermediate representa-tions, in that including them helps define predictivelyuseful structures in the representation. Second, in A→ W, it appears DLID also improves PAD relativeto DL. As stated previously, we believe this is dueto creating a representation that contains informationfrom both the source and target domains. We alsonote that this table corroborates the need to go deep.Results from S2 are expected to be far superior to S1,which is borne out in our experiments.

Figure 4. Schematic illustrating the quality of the learntrepresentations for the D → W domain pair. Shown are2D projections of the learned representations from DL (leftbox) and DLID (right box) models, of all the letter tray(black circles) and mobile phone (red diamonds) imagesfrom the target domain (Webcam).

Finally, we also visually inspect the learnt rep-resentations by DL and DLID model, using thetSNE (van der Maaten & Hinton, 2008) algorithm,by projecting them onto a 2D space. Figure 4 showsthe result of projecting all the test/target domain im-ages for just two (out of the 31) classes on the D →W adaptation task. In this case, the DLID represen-tations appear to be far better for classification.

7. Conclusion and Future Work

We present a model for domain adaptation where welearn the representation of the examples while tryingto take into account domain shift. Our proposal isa contribution to deep learning methodology for do-main adaptation which takes into account data distri-butions which are interpolated between the source andtarget data distributions. Our experiments on objectrecognition show excellent results and we shine lighton the inner workings of our proposal, via an analy-sis in Ben-David et al. (2007)’s theoretical framework.


We envision many extensions to our work. One avenuewould be to try to design a model that more directlytrains using not just unsupervised pre-training, butalso perhaps proxy A-distance or its surrogate. Otherapplication domains like NLP or speech seem ripe forexploration.

References

Ben-David, S., Blitzer, J., Crammer, K., and Pereira,F. Analysis of representations for domain adapta-tion. Advances in Neural Information Processing Sys-tems (NIPS), 2007.

Bengio, Y. Learning deep architecture for ai. Foundationsand Trends in Machine Learning, 2(1):1 – 127, 2009.

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H.Greedy layer-wise training of deep networks. Advancesin Neural Information Processing Systems (NIPS), 2006.

Bergamo, A. and Torresani, L. Exploiting weakly-labeledweb images to improve object classification: A domainadaptation approach. Neural Information ProcessingSystems (NIPS), pp. 181 – 189, 2010.

Blitzer, J., McDonald, R., and Pereira, F. Domain adap-tation with structural correspondence learning. Emperi-cal Methods on Natural Language Processing (EMNLP),2006.

Blitzer, J., Dredze, M., and Pereira, F. Biographies, bolly-wood, boom-boxes and blenders: Domain adaptation forsentiment classification. Association of ComputationalLinguistics (ACL), 2007.

Dai, W., Xue, G.R., Yang, Q., and Yu, Y. Transferringnaive bayes classifiers for text classification. Associationfor the Advancement of Artificial Intelligence (AAAI),2007.

Daume, H. Frustratingly easy domain adaptation. Associ-ation of Computational Linguistics (ACL), 2007.

Daume, H. and Marcu, D. Domain adaptation for statisti-cal classifiers. JAIR, 26(1):101 – 126, 2006.

Daume, H., Kumar, A., and Saha, A. Co-regularizationbased semi-supervised domain adaptation. Advances inNeural Information Processing Systems (NIPS), 2010.

Duan, L., Tsang, I., Xu, D., and Maybank, S. Domaintransfer svm for video concept detection. Computer Vi-sion and Pattern Recognition (CVPR), pp. 1375 – 1381,2009.

Duan, L., Tsang, I., and Xu, D. Domain transfer multiplekernel learning. T-PAMI, 34(3):465 – 479, 2012.

Glorot, X., Bordes, A., and Bengio, Y. Domain adaptationfor large-scale sentiment classification: A deep learningapproach. International Conference on Machine Learn-ing (ICML), pp. 97 – 110, 2011.

Gong, B., Shi, Y., Sha, F., and Grauman, K. Geodesic flowkernel for unsupervised domain adaptation. ComputerVision and Pattern Recognition (CVPR), 2012.

Gopalan, R., Li, R., and Chellappa, R. Domain adapta-tion for object recognition: An unsupervised approach.Internations Conference on Computer Vision (ICCV),2011.

Hinton, G.E. and Salakhutdinov, R. Reducing the dimen-sionality of data with neural networks. Science, 313(5768):504 – 507, 2006.

Hinton, G.E., Osindero, S., and Teh, Y. A fast learningalgorithm for deep belief nets. Neural Computations,18:1527 – 1554, 2006.

Jain, V. and Learned-Miller, E. Online domain adaptationof a pre-trained cascade of classifiers. Computer Visionand Pattern Recognition (CVPR), pp. 577 – 584, 2011.

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun,Y. What is the best multi-stage architecture for objectrecognition? Internations Conference on Computer Vi-sion (ICCV), 2009.

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. Fast in-ference in sparse coding algorithms with applications toobject recognition. Technical report, CBLL, CourantInstitute, NYU, 2008.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.Gradient-based learning applied to document recogni-tion. Proceedings of the IEEE, 86(11):2278 – 2324, 1998.

Leggetter, C. and Woodland, P. Maximum likelihood linearregression for speaker adaptation of continuous densityhidden markov models. Computer Speech and Language,9(2):17 – 185, 1995.

Olshausen, B.A. and Field, D.J. Sparse coding with anover-complete basis set: A strategy employed by v1?Vision Research, 37:3311 – 3325, 1997.

Pan, S. J., Tsang, I.W., Kwok, J.T., and Yang, Q. Do-main adaptation via transfer component analysis. IEEETrans. NN, 99:1 – 12, 2009.

Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. Ef-ficient learning of sparse overcomplete representationswith energy-based model. Advances in Neural Informa-tion Processing Systems (NIPS), 2006.

Saenko, K., Kulis, B., Fritz, M., and Darrell, T. Adapt-ing visual category models to new domains. EurpoeanConference on Computer Vision (ECCV), 2010.

van der Maaten, L.J.P. and Hinton, G.E. Visualizing datausing t-sne. Journal of Machine Learning Research, 9:2579 – 2605, 2008.

Wang, M. and Wang, X. Automatic adaptation of a genericpedestrain detector to a specific traffic scene. Com-puter Vision and Pattern Recognition (CVPR), pp. 3401– 3408, 2011.

Yang, J., Yan, R., and Hauptmann, A. Cross-domain videoconcept detection using adaptive svms. ACM MM, 2007.

Documents

DLID: Deep Learning for Domain Adaptation by …...DLID: Deep Learning for Domain Adaptation by Interpolating between Domains impose structural constraints on the classi er using unlabeled