Bayesian Networks Darwiche

  • Upload
    roblim1

  • View
    227

  • Download
    1

Embed Size (px)

Citation preview

  • 8/12/2019 Bayesian Networks Darwiche

    1/11

    80 COMMUNICATIONS OF THE ACM | DECEMB ER 2010 | VOL. 53 | NO. 12

    review articles

    DOI:1 0 .11 4 5 /1 8 5 9 2 0 4 .1 8 5 9 2 2 7

    What are Bayesian networks and why are theirapplications growing across all elds?

    BY ADNAN DARWICHE

    Bayesian

    Networks

    problems that span across domainssuch as computer vision, the Web, andmedical diagnosis.

    So what are Bayesian networks,and why are they widely used, eitherdirectly or indirectly, across so manyelds and application areas? Intui-tively, Bayesian networks provide asystematic and localized method forstructuring probabilistic informa-tion about a situation into a coher-ent whole. They also provide a suiteof algorithms that allow one to auto-matically derive many implications of

    this information, which can form thebasis for important conclusions anddecisions about the correspondingsituation (for example, computing theoverall reliability of a system, ndingthe most likely message that was sentacross a noisy channel, identifying themost likely users that would respondto an ad, restoring a noisy image,mapping genes onto a chromosome,among others). Technically speaking,a Bayesian network is a compact rep-resentation of a probability distribu-

    tion that is usually too large to be han-dled using traditional specicationsfrom probability and statistics suchas tables and equations. For example,Bayesian networks with thousands of variables have been constructed andreasoned about successfully, allowingone to efciently represent and reasonabout probability distributions whosesize is exponential in that number of variables (for example, in genetic link-

    B AY E SI A N N E T W O R K S H AV E been receiving considerableattention over the last few decades from scientistsand engineers across a number of elds, includingcomputer science, cognitive science, statistics, andphilosophy. In computer science, the developmentof Bayesian networks was driven by research inarticial intelligence, which aimed at producing apractical framework for commonsense reasoning. 29 Statisticians have also contributed to the developmentof Bayesian networks, where they are studied underthe broader umbrella of probabilistic graphicalmodels. 5,11

    Interestingly enough, a number of other morespecialized elds, such as genetic linkage analysis,speech recognition, information theory and reliabilityanalysis, have developed representations that can bethought of as concrete instantiations or restrictedcases of Bayesian networks. For example, pedigreesand their associated phenotype/genotype information,reliability block diagrams, and hidden Markov models(used in many elds including speech recognitionand bioinformatics) can all be viewed as Bayesiannetworks. Canonical instances of Bayesian networksalso exist and have been used to solve standard

    key insights Bayesian networks provide a

    systematic and localized method forstructuring probabilistic informationabout a situation into a coherentwhole, and are supported by a suiteof inference algorithms.

    Bayesian networks have beenestablished as a ubiquitous toolfor modeling and reasoning underuncertainty.

    Many applications can be reducedto Bayesian network inference,allowing one to capitalize on Bayesiannetwork algorithms instead of having

    to invent specialized algorithms foreach new application.

  • 8/12/2019 Bayesian Networks Darwiche

    2/11

    DECEMBER 2010 | VOL. 53 | NO. 12 | COMMUNICATIONS OF THE ACM 81

    I M

    A G E

    B Y

    R O B E

    R T H

    O D

    G I

    N

    age analysis, 12 low-level vision, 34 andnetworks synthesized from relationalmodels 4).

    For a concrete feel of Bayesian net- works, Figure 1 depicts a small net- work over six binary variables. EveryBayesian network has two compo-nents: a directed acyclic graph (calleda structure), and a set of conditionalprobability tables (CPTs). The nodesof a structure correspond to the vari-ables of interest, and its edges havea formal interpretation in terms ofprobabilistic independence. We will

    discuss this interpretation later, butsufce to say here that in many prac-tical applications, one can often inter-pret network edges as signifying directcausal inuences. A Bayesian networkmust include a CPT for each variable, which quanties the relationship be-tween that variable and its parents inthe network. For example, the CPT for variable A species the conditionalprobability distribution of A given itsparents F and T . According to this CPT,the probability of A = true given F =

    true and T = false is Pr ( A=true | F =true ; T = false ) = .9900 and is calleda network parameter .a

    A main feature of Bayesian net- works is their guaranteed consistencyand completeness as there is one andonly one probability distribution thatsatises the constraints of a Bayesiannetwork. For example, the network inFigure 1 induces a unique probabilitydistribution over the 64 instantiationsof its variables. This distribution pro- vides enough information to attributea probability to every event that can beexpressed using the variables appear-ing in this network, for example, theprobability of alarm tampering givenno smoke and a report of people leav-ing the building.

    Another feature of Bayesian net- works is the existence of efcientalgorithms for computing suchprobabilities without having to explic-

    a Bayesian networks may contain continuous

    variables, yet our discussion here is restrictedto the discrete case.

  • 8/12/2019 Bayesian Networks Darwiche

    3/11

    82 COMMUNICATIONS OF THE ACM | DECEMBER 2010 | VOL. 53 | NO. 12

    eview articles

    itly generate the underlying probabilitydistribution (which would be compu-tationally infeasible for many interest-ing networks). These algorithms, to bediscussed in detail later, apply to anyBayesian network, regardless of its to-pology. Yet, the efciency of these algo-rithmsand their accuracy in the caseof approximation algorithmsmay bequite sensitive to this topology and thespecic query at hand. Interestinglyenough, in domains such as genetics,reliability analysis, and informationtheory, scientists have developed algo-rithms that are indeed subsumed bythe more general algorithms for Bayes-ian networks. In fact, one of the mainobjectives of this article is to raiseawareness about these connections.The more general objective, however,

    is to provide an accessible introduc-tion to Bayesian networks, which al-lows scientists and engineers to moreeasily identify problems that can be re-duced to Bayesian network inference,putting them in a position where they

    can capitalize on the vast progress thathas been made in this area over the lastfew decades.

    Causality and Independence We will start by unveiling the centralinsight behind Bayesian networks thatallows them to compactly represent very large distributions. Consider Fig-ure 1 and the associated CPTs. Eachprobability that appears in one ofthese CPTs does specify a constraintthat must be satised by the distribu-tion induced by the network. For ex-ample, the distribution must assignthe probability .01 to having smoke without re, Pr (S = true | F = false ),since this is specied by the CPT of variable S. These constraints, however,are not sufcient to pin down a unique

    probability distribution. So what addi-tional information is being appealedto here?

    The answer lies in the structure ofa Bayesian network, which speciesadditional constraints in the form of

    probabilistic conditional independen-cies. In particular, every variable in thestructure is assumed to become inde-pendent of its non-descendants onceits parents are known. In Figure 1, variable L is assumed to become inde-pendent of its non-descendants T , F ,S once its parent A is known. In other words, once the value of variable A isknown, the probability distribution of variable L will no longer change due tonew information about variables T , F and S. Another example from Figure1: variable A is assumed to becomeindependent of its non-descendantS once its parents F and T are known.These independence constraints areknown as the Markovian assumptionsof a Bayesian network. Together withthe numerical constraints specied by

    CPTs, they are satised by exactly oneprobability distribution.Does this mean that every time

    a Bayesian network is constructed,one must verify the conditional inde-pendencies asserted by its structure?This really depends on the construc-tion method. I will discuss three mainmethods in the section entitled How Are Bayesian Networks Constructed?that include subjective construction,synthesis from other specications,and learning from data. The rst

    method is the least systematic, buteven in that case, one rarely thinksabout conditional independence when constructing networks. Instead,one thinks about causality, adding theedge X Y whenever X is perceived tobe a direct cause of Y . This leads to acausal structure in which the Markov-ian assumptions read: each variablebecomes independent of its non-ef-fects once its direct causes are known.The ubiquity of Bayesian networksstems from the fact that people arequite good at identifying direct causesfrom a given set of variables, and at de-ciding whether the set of variables con-tains all of the relevant direct causes.This ability is all that one needs forconstructing a causal structure.

    The distribution induced by aBayesian network typically satisesadditional independencies, beyondthe Markovian ones discussed above.Moreover, all such independenciescan be identied efciently using agraphical test known as d-separation .29 According to this test, variables X and

    Figure 1. A Bayesian network with some of its conditional probability tables (CPTs).

    Fire ( F )

    Alarm ( A)

    Leaving ( L) Report ( R)

    Smoke ( S )

    Tampering ( T )

    Fire Smoke s | f true true .90

    false true .01

    Fire Tampering Alarm a | f,t true true true .5000

    true false true .9900

    false true true .8500

    false false true .0001

    Figure 2. A hidden Markov model (HMM) and its corresponding dynamic Bayesian

    network (DBN).

    a

    x

    b

    y

    c

    z

    .30

    .15

    (a) HMM (state diagram) (b) DBN

    S 1

    O 1

    S 2

    O 2

    S 3

    O 3

    S n

    O n

  • 8/12/2019 Bayesian Networks Darwiche

    4/11

    review articles

    DECEMBER 2010 | VOL. 53 | NO. 12 | COMMUNICATIONS OF THE ACM 83

    Y are guaranteed to be independentgiven variables Z if every path between X and Y is blocked by Z . Intuitively, apath is blocked when it cannot be usedto justify a dependence between X andY in light of our knowledge of Z . For anexample, consider the path : S F A T in Figure 1 and suppose weknow the alarm has triggered (that is, we know the value of variable A). Thispath can then be used to establish adependence between variables S and T as follows. First, observing smoke in-creases the likelihood of re since reis a direct cause of smoke according topath . Moreover, the increased likeli-hood of re explains away tamperingas a cause of the alarm, leading to adecrease in the probability of tamper-ing (re and tampering are two com-

    peting causes of the alarm accordingto path ). Hence, the path could beused to establish a dependence be-tween S and T in this case. Variables S and T are therefore not independentgiven A due to the presence of thisunblocked path. One can verify, how-ever, that this path cannot be used toestablish a dependence between S andT in case we know the value of vari-able F instead of A. Hence, the path isblocked by F .

    Even though we appealed to the no-

    tion of causality when describing thed-separation test, one can phrase andprove the test without any appeal tocausalitywe only need the Markov-ian assumptions. The full d-separationtest gives the precise conditions under which a path between two variables isblocked, guaranteeing independence whenever all paths are blocked. Thetest can be implemented in time lin-ear in the Bayesian network structure, without the need to explicitly enumer-ate paths as suggested previously.

    The d-separation test can be usedto directly derive results that havebeen proven for specialized probabi-listic models used in a variety of elds.One example is hidden Markov mod-els (HMMs), which are used to modeldynamic systems whose states arenot observable, yet their outputs are.One uses an HMM when interestedin making inferences about thesechanging states, given the sequenceof outputs they generate. HMMs are widely used in applications requiringtemporal pattern recognition, includ-

    ing speech, handwriting, and gesturerecognition; and various problems inbioinformatics. 31 Figure 2a depicts anHMM, which models a system withthree states ( a, b, c) and three outputs(x, y, z). The gure depicts the possibletransitions between the system states, which need to be annotated by theirprobabilities. For example, state b cantransition to states a or c, with a 30%chance of transitioning to state c. Eachstate can emit a number of observable

    outputs, again, with some probabili-ties. In this example, state b can emitany of the three outputs, with output z having a 15% chance of being emittedby this state.

    This HMM can be represented bythe Bayesian network in Figure 2b. 32 Here, variable St has three values a, b,c and represents the system state attime t , while variable Ot has the val-ues x, y, z and represents the systemoutput at time t . Using d-separationon this network, one can immediatelyderive the characteristic property ofHMMs: once the state of the system attime t is known, its states and outputsat times > t become independent of itsstates and outputs at times < t .

    We also note the network in Figure2b is one of the simplest instancesof what is known as dynamic Bayes-ian networks (DBNs).9 A number ofextensions have been considered forHMMs, which can be viewed as morestructured instances of DBNs. Whenproposing such extensions, however,one has the obligation of offering a

    corresponding algorithmic toolboxfor inference. By viewing these extend-ed HMMs as instances of Bayesiannetworks, however, one immediatelyinherits the corresponding Bayesiannetwork algorithms for this purpose.

    How are Bayesian NetworksConstructed?One can identify three main methodsfor constructing Bayesian networks. 8 According to the rst method, which

    is largely subjective, one reects ontheir own knowledge or the knowledgeof others (typically, perceptions aboutcausal inuences) and then capturesthem into a Bayesian network. Thenetwork in Figure 1 is an example ofthis construction method. The net- work structure of Figure 3 depictsanother example, yet the parametersof this network can be obtained frommore formal sources, such as popula-tion statistics and test specications. According to this network, we have apopulation that is 55% males and 45%females, whose members can sufferfrom a medical condition C that ismore likely to occur in males. More-over, two diagnostic tests are availablefor detecting this condition, T 1 and T 2, with the second test being more effec-tive on females. The CPTs of this net- work also reveal that the two tests areequally effective on males.

    The second method for construct-ing Bayesian networks is based on au-tomatically synthesizing them fromsome other type of formal knowledge.

    Figure 3. A Bayesian network that models a population, a medical condition, and twocorresponding tests.

    Sex ( S )C = yes means thatthe condition C ispresent, and T i = +vemeans the result of

    test T i is positive.

    Test ( T 1) Test ( T 2)

    Condition ( C )S smale .55

    S C T 2 t 2 | s,c

    male yes +ve .80

    male no +ve .20

    female yes +ve .95

    female no +ve .05

    S C c | smale yes .05

    female yes .01

    C T 1 t 1| c

    yes +ve .80

    no +ve .20

  • 8/12/2019 Bayesian Networks Darwiche

    5/11

    84 COMMUNICATIONS OF THE ACM | DECEMBER 2010 | VOL. 53 | NO. 12

    eview articles

    For example, in many applicationsthat involve system analysis, such asreliability and diagnosis, one can syn-thesize a Bayesian network automati-cally from a formal system design.Figure 4 depicts a reliability block dia-gram (RBD) used in reliability analysis.The RBD depicts system componentsand the dependencies between theiravailability. For example, Processor 1requires either of the fans for its avail-ability, and each of the fans requirespower for its availability. The goal hereis to compute the overall reliabilityof the system (probability of its avail-ability) given the reliabilities of each

    of its components. Figure 4 shows alsohow one may systematically converteach block in an RBD into a Bayes-ian network fragment, while Figure 5depicts the corresponding Bayesiannetwork constructed according to thismethod. The CPTs of this gure can becompletely constructed based on thereliabilities of individual components(not shown here) and the semantics ofthe transformation method. 8

    The third method for constructingBayesian networks is based on learn-ing them from data, such as medicalrecords or student admissions data.Consider Figure 3 and the data set de-

    picted in the table here as an example.

    Sex S Condition C Test T 1 Test T 2

    male no ? ve

    male ? ve +ve

    female yes +ve ?

    Each row of the table corresponds toan individual and what we know aboutthem. One can use such a data set tolearn the network parameters given itsstructure, or learn both the structureand its parameters. Learning parame-ters only is an easier task computation-ally. Moreover, learning either struc-ture or parameters always becomeseasier when the data set is completethat is, the value of each variable isknown in each data record.

    Since learning is an inductive pro-cess, one needs a principle of induc-tion to guide the learning process. Thetwo main principles for this purposelead to the maximum likelihood and Bayesian approaches to learning (see,for example, the work of 5,8,17,22,27). Themaximum likelihood approach favorsBayesian networks that maximize theprobability of observing the given dataset. The Bayesian approach on the oth-er hand uses the likelihood principlein addition to some prior information

    which encodes preferences on Bayes-ian networks.Suppose we are only learning net-

    work parameters. The Bayesian ap-proach allows one to put a prior distri-bution on the possible values of eachnetwork parameter. This prior distri-bution, together with the data set, in-duces a posterior distribution on the values of that parameter. One can thenuse this posterior to pick a value forthat parameter (for example, the dis-tribution mean). Alternatively, one candecide to avoid committing to a xedparameter value, while computing an-swers to queries by averaging over allpossible parameter values according totheir posterior probabilities.

    It is critical to observe here that theterm Bayesian network does not nec-essarily imply a commitment to theBayesian approach for learning net- works. This term was coined by JudeaPearl 28 to emphasize three aspects:the often subjective nature of the in-formation used in constructing thesenetworks; the reliance on Bayes condi-

    Figure 4. A reliability block diagram (top), with a systematic method for mapping its blocksinto Bayesian network fragments (bottom).

    Power

    Supply

    Block B

    Sub-System 1 Sub-System 1Available

    Sub-System 2AvailableSub-System 2

    Hard

    Drive

    Fan 2

    Fan 1 Processor 1

    Processor 2

    Block BAvailable

    OR AND

    Figure 5. A Bayesian network generated automatically from a reliability block diagram.

    Fan 1 ( F 1)

    Fan 2 ( F 2)

    PowerSupply ( E )

    HardDrive (D)

    Processor 1 ( P 1)

    Processor 2 ( P 2)

    System ( S )

    Variables E , F 1, F 2, P 1, P 2 and D representthe availability of corresponding components,and variable S represents the availabilityof the whole system. Variables Ai correspondto logical ands, and variables Oi correspondto logical ors.A1

    A2 A4

    O1 O2

    A3

  • 8/12/2019 Bayesian Networks Darwiche

    6/11

    review articles

    DECEMBER 2010 | VOL. 53 | NO. 12 | COMMUNICATIONS OF THE ACM 85

    tioning when reasoning with Bayesiannetworks; and the ability to performcausal as well as evidential reasoningon these networks, which is a distinc-tion underscored by Thomas Bayes. 1

    These learning approaches aremeant to induce Bayesian networksthat are meaningful independently ofthe tasks for which they are intend-ed. Consider for example a network which models a set of diseases and acorresponding set of symptoms. Thisnetwork may be used to perform di-agnostic tasks, by inferring the mostlikely disease given a set of observedsymptoms. It may also be used for pre-diction tasks, where we infer the mostlikely symptom given some diseases.If we concern ourselves with only oneof these tasks, say diagnostics, we can

    use a more specialized induction prin-ciple that optimizes the diagnosticperformance of the learned network.In machine learning jargon, we say we are learning a discriminative modelin this case, as it is often used to dis-criminate among patients accordingto a predened set of classes (for ex-ample, has cancer or not). This is to becontrasted with learning a generativemodel , which is to be evaluated basedon its ability to generate the given dataset, regardless of how it performs on

    any particular task. We nally note that it is not un-common to assume some canoni-cal network structure when learningBayesian networks from data, in or-der to reduce the problem of learningstructure and parameters to the eas-ier problem of learning parametersonly. Perhaps the most common suchstructure is what is known as naveBayes: C A1, , C An, where C isknown as the class variable and vari-ables A1,, An are known as attributes.This structure has proven to be verypopular and effective in a number ofapplications, in particular classica-tion and clustering. 14

    Canonical Bayesian Networks A number of canonical Bayesian net- works have been proposed for model-ing some well-known problems in a variety of elds. For example, geneticlinkage analysis is concerned withmapping genes onto a chromosome,utilizing the fact that the distance be-tween genes is inversely proportional

    to the extent to which genes are linked(two genes are linked when it is morelikely than not that their states arepassed together from a single grand-parent, instead of one state from eachgrandparent). To assess the likelihoodof a linkage hypothesis, one uses apedigree with some information aboutthe genotype and phenotype of asso-ciated individuals. Such informationcan be systematically translated into aBayesian network (see Figure 6), where

    the likelihood of a linkage hypothesiscorresponds to computing the prob-ability of an event with respect to thisnetwork. 12 By casting this problem interms of inference on Bayesian net- works, and by capitalizing on the state-of-the-art algorithms for this purpose,the scalability of genetic linkage analy-sis was advanced considerably, lead-ing to the most efcient algorithms forexact linkage computations on generalpedigrees (for example, the SUPER-

    Figure 6. A Bayesian network generated automatically from a pedigree that contains threeindividuals.

    Variables GP ij and GM ij represent the states of gene j for individual i , which are inherited fromthe father and mother of i , respectively. Variable P ij represents the phenotype determined bygene j of individual i . Variables SP ij and SM ij are meant to model recombination events, whichoccur when the states of two genes inherited from a parent do not come from a single grand-parent. The CPTs of these variables contain parameters which encode linkage hypotheses.

    GP 11 GP 21

    GP 12 GP 22

    GP 13 GP 23

    GM 11 GM 21

    SP 31

    SP 32

    SP 33

    SM 31

    SM 32

    SM 33

    GP 31

    GP 32

    GP 33

    GM 31

    GM 32

    GM 33

    P 31

    P 32

    P 33

    GM 12 GM 22

    GM 13 GM 23

    P 11 P 21

    P 12 P 22

    P 13 P 23

    Father: 1 Mother: 2

    Child: 3

    Figure 7. A Bayesian network that models a noisy channel with input ( U 1,,U 4, X 1,, X 3) andoutput ( Y 1,,Y 7).

    Bits X 1,, X 3 are redundantcoding bits and their CPTscapture the coding protocol.The CPTs of variables Y i capture the noise model.This network can be used torecover the most likely sentmessage U 1,,U 4 given areceived message Y 1,,Y 7.

    U 1

    Y 1

    U 2

    Y 2

    U 3

    Y 3

    U 4

    Y 4 Y 5

    X 1

    Y 6

    X 2

    Y 7

    X 3

  • 8/12/2019 Bayesian Networks Darwiche

    7/11

    86 COMMUNICATIONS OF THE ACM | DECEMB ER 2010 | VOL. 53 | NO. 12

    eview articles

    LINK program initiated by Fishelsonand Geiger 12).

    Canonical models also exist formodeling the problem of passing in-formation over a noisy channel, wherethe goal here is to compute the mostlikely message sent over such a chan-nel, given the channel output. 13 Forexample, Figure 7 depicts a Bayesiannetwork corresponding to a situation where seven bits are sent across a noisychannel (four original bits and threeredundant ones).

    Another class of canonical Bayes-ian networks has been used in variousproblems relating to vision, including

    image restoration and stereo vision.Figure 8 depicts two examples of imag-es that were restored by posing a queryto a corresponding Bayesian network.Figure 9a depicts the Bayesian networkin this case, where we have one node Pi for each pixel i in the imagethe val-ues pi of Pi represent the gray levels ofpixel i. For each pair of neighboringpixels, i and j , a child node Cij is added with a CPT that imposes a smoothnessconstraint between the pixels. That is,the probability Pr (Cij = true | Pi = pi, P j = p j ) specied by the CPT decreases asthe difference in gray levels | pi p j |increases. The only additional infor-

    mation needed to completely specifythe Bayesian network is a CPT for eachnode Pi, which provides a prior distri-bution on the gray levels of each pixeli. These CPTs are chosen to give thehighest probability to the gray level vi appearing in the input image, with theprior probability Pr ( Pi = pi) decreasingas the difference | pi vi | increases. Bysimply adjusting the domain and theprior probabilities of nodes Pi, whileasserting an appropriate degree ofsmoothness using variables Cij , onecan use this model to perform otherpixel-labeling tasks such as stereo vision, photomontage, and binarysegmentation. 34 The formulation ofthese tasks as inference on Bayesiannetworks is not only elegant, but hasalso proven to be very powerful. For

    example, such inference is the basisfor almost all top-performing stereomethods. 34

    Canonical Bayesian network mod-els have also been emerging in recent years in the context of other importantapplications, such as the analysis ofdocuments, and text. Many of thesenetworks are based on topic modelsthat view a document as arising froma set of unknown topics, and providea framework for reasoning about theconnections between words, docu-

    ments, and underlying topics.2,33

    Topicmodels have been applied to manykinds of documents, including email,scientic abstracts, and newspaperarchives, allowing one to utilize infer-ence on Bayesian networks to tackletasks such as measuring documentsimilarity, discovering emergent top-ics, and browsing through documentsbased on their underlying content in-stead of traditional indexing schemes.

    What Can One Do with aBayesian Network?Similar to any modeling language, the value of Bayesian networks is mainlytied to the class of queries they support.

    Consider the network in Figure 3 foran example and the following queries:Given a male that came out positiveon both tests, what is the probabilityhe has the condition? Which group ofthe population is most likely to testnegative on both tests? Consideringthe network in Figure 5: What is theoverall reliability of the given system? What is the most likely conguration

    Figure 8. Images from left to right: input, restored (using Bayesian network inference)and original.

    Figure 9. Modeling low-level vision problems using two types of graphical models:Bayesian networks and MRFs.

    P i P i C ij P j P j

    (a) Bayesian network (b) Markov random eld (MRF)

  • 8/12/2019 Bayesian Networks Darwiche

    8/11

    review articles

    DECEMBER 2010 | VOL. 53 | NO. 12 | COMMUNICATIONS OF THE ACM 87

    of the two fans given that the system isunavailable? What single componentcan be replaced to increase the over-all system reliability by 5%? ConsiderFigure 7: What is the most likely chan-nel input that would yield the channeloutput 1001100? These are examplequestions that would be of interest inthese application domains, and theyare questions that can be answeredsystematically using three canonicalBayesian network queries. 8 A mainbenet of using Bayesian networks inthese application areas is the ability tocapitalize on existing algorithms forthese queries, instead of having to in- vent a corresponding specialized algo-rithm for each application area.

    Probability of Evidence. This querycomputes the probability Pr (e), where

    e is an assignment of values to some variables E in the Bayesian networke is called a variable instantiation orevidence in this case. For example, inFigure 3, we can compute the prob-ability that an individual will come outpositive on both tests using the proba-bility-of-evidence query Pr (T 1 =+ve, T 2 = +ve ). We can also use the same queryto compute the overall reliability of thesystem in Figure 5, Pr (S = avail ). Thedecision version of this query is knownto be PP-complete. It is also related to

    another common query, which asksfor computing the probability Pr (x| e)for each network variable X and eachof its values x. This is known as thenode marginals query.

    Most Probable Explanation (MPE). Given an instantiation e of some vari-ables E in the Bayesian network, thisquery identies the instantiation q of variables Q = E that maximizes theprobability Pr (q | e). In Figure 3, wecan use an MPE query to nd the mostlikely group, dissected by sex and con-dition, that will yield negative resultsfor both tests, by taking the evidence e to be T 1 =ve ;T 2 =ve and Q = {S, C}. Wecan also use an MPE query to restore im-ages as shown in Figures 8 and 9. Here, we take the evidence e to be Cij = true for all i, j and Q to include Pi for all i. Thedecision version of MPE is known tobe NP-complete and is therefore easierthan the probability-of-evidence queryunder standard assumptions of com-plexity theory.

    Maximum a Posteriori Hypothesis(MAP). Given an instantiation e of some

    Figure 10. An arithmetic circuit for the Bayesian network B A C . Inputs labeledwith variables correspond to network parameters, while those labeled with variablescapture evidence.

    Probability-of-evidence, node-marginals, and MPE queries can all beanswered using linear-time traversalalgorithms of the arithmetic circuit.

    +

    + ++ +

    *

    * ** ** ** *

    a

    b c b c

    a

    a

    b | a c | a b | a c | a b | a c | a b | a c | a

    a

    *

    Figure 11. Two networks that represent the same set of conditional independencies.

    (a) (b)

    Earthquake?(E )

    Earthquake?(E )

    Radio?(R)

    Radio?(R)

    Alarm?(A)

    Alarm?(A)

    Call?(C )

    Call?(C )

    Burglary?(B)

    Burglary?(B)

    Figure 12. Extending Bayesian networks to account for interventions.

    (a) (b)

    Earthquake?

    (E )

    Earthquake?

    (E )

    Radio?(R)

    Radio?(R)

    Alarm?(A)

    Alarm?(A)

    Call?(C )

    Call?(C )

    Tampering(T )

    Tampering(T )

    Burglary?

    (B)

    Burglary?

    (B)

  • 8/12/2019 Bayesian Networks Darwiche

    9/11

    88 COMMUNICATIONS OF THE ACM | DECEMB ER 2010 | VOL. 53 | NO. 12

    eview articles

    variables E in the Bayesian network,this query identies the instantiationq of some variables Q E that maxi-mizes the probability Pr (q | e). Note thesubtle difference with MPE queries:Q is a subset of variables E instead ofbeing all of these variables. MAP is amore difcult problem than MPE sinceits decision version is known to be NP PP -complete, while MPE is only NP-complete. As an example of this query,consider Figure 5 and suppose we areinterested in the most likely congu-ration of the two fans given that thesystem is unavailable. We can nd thisconguration using a MAP query withthe evidence e being S =un _avail andQ = { F 1, F 2}.

    One can use these canonical que-ries to implement more sophisticated

    queries, such as the ones demandedby sensitivity analysis. This is a modeof analysis that allows one to check therobustness of conclusions drawn fromBayesian networks against perturba-tions in network parameters (for exam-ple, see Darwiche 8). Sensitivity analy-sis can also be used for automaticallyrevising these parameters in order tosatisfy some global constraints that areimposed by experts, or derived fromthe formal specications of tasks un-der consideration. Suppose for exam-

    ple that we compute the overall systemreliability using the network in Figure 5and it turns out to be 95%. Suppose we wish this reliability to be no less than99%: Pr (S = avail ) 99%. Sensitivityanalysis can be used to identify com-ponents whose reliability is relevant toachieving this objective, together withthe new reliabilities they must attainfor this purpose. Note that componentreliabilities correspond to network pa-rameters in this example.

    How Well Do BayesianNetworks Scale? Algorithms for inference on Bayes-ian networks fall into two main cat-egories: exact and approximate al-gorithms. Exact algorithms make nocompromises on accuracy and tendto be more expensive computation-ally. Much emphasis was placed onexact inference in the 1980s and early1990s, leading to two classes of al-gorithms based on the concepts ofelimination 10,24,36 and conditioning .6,29 In their pure form, the complexity of

    these algorithms is exponential onlyin the network treewidth , which is agraph-theoretic parameter that mea-sures the resemblance of a graph to atree structure. For example, the tree- width of trees is 1 and, hence, infer-ence on such tree networks is quite ef-cient. As the network becomes moreand more connected, its treewidth in-creases and so does the complexity ofinference. For example, the networkin Figure 9 has a treewidth of n as-suming an underlying image with n n pixels. This is usually too high, evenfor modest-size images, to make thesenetworks accessible to treewidth-based algorithms.

    The pure form of elimination andconditioning algorithms are calledstructure-based as their complexity is

    sensitive only to the network structure.In particular, these algorithms willconsume the same computational re-sources when applied to two networksthat share the same structure, regard-less of what probabilities are used toannotate them. It has long been ob-served that inference algorithms canbe made more efcient if they alsoexploit the structure exhibited by net- work parameters, including determin-ism (0 and 1 parameters) and context-specic independence (independence

    that follows from network parametersand is not detectable by d-separation 3).Yet, algorithms for exploiting paramet-ric structure have only matured in thelast few years, allowing one to performexact inference on some networks whose treewidth is quite large (see sur- vey 8). Networks that correspond to ge-netic linkage analysis (Figure 6) tendto fall in this category 12 and so do net- works that are synthesized from rela-tional models. 4

    One of the key techniques for ex-ploiting parametric structure is basedon compiling Bayesian networks intoarithmetic circuits, allowing one to re-duce probabilistic inference to a pro-cess of circuit propagation; 7 see Figure10. The size of these compiled circuitsis determined by both the network to-pology and its parameters, leading torelatively compact circuits in somesituations where the parametric struc-ture is excessive, even if the networktreewidth is quite high (for example,Chavira et al. 4). Reducing inference tocircuit propagation makes it also easi-

    er to support applications that requirereal-time inference, as in certain diag-nosis applications. 25

    Around the mid-1990s, a strong be-lief started forming in the inferencecommunity that the performance ofexact algorithms must be exponentialin treewidththis is before parametricstructure was being exploited effective-ly. At about the same time, methods forautomatically constructing Bayesiannetworks started maturing to the pointof yielding networks whose treewidthis too large to be handled efciently byexact algorithms at the time. This hasled to a surge of interest in approxi-mate inference algorithms, which aregenerally independent of treewidth.Today, approximate inference algo-rithms are the only choice for networks

    that have a high treewidth, yet lack suf-cient parametric structurethe net- works used in low-level vision appli-cations tend to have this property. Aninuential class of approximate infer-ence algorithms is based on reducingthe inference problem to a constrainedoptimization problem, with loopy be-lief propagation and its generalizationsas one key example. 29,35 Loopy beliefpropagation is actually the commonalgorithm of choice today for handlingnetworks with very high treewidth,

    such as the ones arising in vision orchannel coding applications. Algo-rithms based on stochastic samplinghave also been pursued for a long timeand are especially important for infer-ence in Bayesian networks that containcontinuous variables. 8,15,22 Variationalmethods provide another importantclass of approximation techniques 19,22 and are key for inference on someBayesian networks, such as the onesarising in topic models. 2

    Causality, AgainOne of the most intriguing aspects ofBayesian networks is the role they playin formalizing causality. To illustratethis point, consider Figure 11, whichdepicts two Bayesian network struc-tures over the same set of variables.One can verify using d-separationthat these structures represent thesame set of conditional independen-cies. As such, they are representation-ally equivalent as they can induce thesame set of probability distributions when augmented with appropriate

  • 8/12/2019 Bayesian Networks Darwiche

    10/11

    review articles

    DECEMBER 2010 | VOL. 53 | NO. 12 | COMMUNICATIONS OF THE ACM 89

    One of the mostintriguing aspects

    of Bayesiannetworks isthe role they playin formalizingcausality.

    CPTs. Note, however, that the networkin Figure 11a is consistent with com-mon perceptions of causal inuences, yet the one in Figure 11b violates theseperceptions due to edges A E and A B. Is there any signicance to thisdiscrepancy? In other words, is theresome additional information that canbe extracted from one of these net- works, which cannot be extracted fromthe other? The answer is yes accordingto a body of work on causal Bayesiannetworks, which is concerned with akey question: 16,30 how can one charac-terize the additional information cap-tured by a causal Bayesian networkand, hence, what queries can be an-swered only by Bayesian networks thathave a causal interpretation?

    According to this body of work, only

    causal networks are capable of updat-ing probabilities based on interven-tions, as opposed to observations. Togive an example of this difference, con-sider Figure 11 again and suppose that we want to compute the probabilitiesof various events given that someonehas tampered with the alarm, causingit to go off. This is an intervention, to becontrasted with an observation, where we know the alarm went off but with-out knowing the reason. In a causalnetwork, interventions are handled as

    shown in Figure 12a: by simply adding anew direct cause for the alarm variable.This local x, however, cannot be ap-plied to the non-causal network in Fig-ure 11b. If we do, we obtain the networkin Figure 12b, which asserts the follow-ing (using d-separation): if we observethe alarm did go off, then knowing it was not tampered with is irrelevant to whether a burglary or an earthquaketook place. This independence, whichis counterintuitive, does not hold inthe causal structure and represents oneexample of what may go wrong whenusing a non-causal structure to answerquestions about interventions.

    Causal structures can also be usedto answer more sophisticated queries,such as counterfactuals. For example,the probability of the patient wouldhave been alive had he not taken thedrug requires reasoning about in-terventions (and sometimes mighteven require functional information,beyond standard causal Bayesian net- works 30). Other types of queries in-clude ones for distinguishing between

    direct and indirect causes and for de-termining the sufciency and neces-sity of causation. 30 Learning causalBayesian networks has also beentreated, 16,30 although not as extensive-ly as the learning of general Bayesiannetworks.

    Beyond Bayesian Networks Viewed as graphical representationsof probability distributions, Bayesiannetworks are only one of several othermodels for this purpose. In fact, inareas such as statistics (and now alsoin AI), Bayesian networks are studiedunder the broader class of probabi-listic graphical models , which includeother instances such as Markov net- works and chain graphs (for example,Edwards 11 and Koller and Friedman 22).

    Markov networks correspond to undi-rected graphs, and chain graphs haveboth directed and undirected edges.Both of these models can be inter-preted as compact specications ofprobability distributions, yet their se-mantics tend to be less transparentthan Bayesian networks. For example,both of these models include numericannotations, yet one cannot interpretthese numbers directly as probabili-ties even though the whole model canbe interpreted as a probability dis-

    tribution. Figure 9b depicts a specialcase of a Markov network, known as aMarkov random eld (MRF), which istypically used in vision applications.Comparing this model to the Bayes-ian network in Figure 9a, one ndsthat smoothness constraints betweentwo adjacent pixels Pi and P j can nowbe represented by a single undirectededge Pi P j instead of two directed edg-es and an additional node, Pi Cij P j . In this model, each edge is associ-ated with a function f ( Pi, P j ) over thestates of adjacent pixels. The values ofthis function can be used to capturethe smoothness constraint for thesepixels, yet do not admit a direct proba-bilistic interpretation.

    Bayesian networks are meant tomodel probabilistic beliefs, yet theinterest in such beliefs is typically mo-tivated by the need to make rationaldecisions. Since such decisions areoften contemplated in the presenceof uncertainty, one needs to knowthe likelihood and utilities associated with various decision outcomes. A

  • 8/12/2019 Bayesian Networks Darwiche

    11/11

    90 COMMUNICATIONS OF THE ACM | DECEMB ER 2010 | VOL 53 | NO 12

    eview articles

    classical example in this regard con-cerns an oil wildcatter that needs todecide whether or not to drill for oil ata specic site, with an additional de-cision on whether to request seismicsoundings that may help determinethe geological structure of the site.Each of these decisions has an asso-ciated cost. Moreover, their potentialoutcomes have associated utilities andprobabilities. The need to integratethese probabilistic beliefs, utilitiesand decisions has lead to the develop-ment of Inuence Diagrams , which areextensions of Bayesian networks thatinclude three types of nodes: chance,utility, and decision. 18 Inuence dia-grams, also called decision networks,come with a toolbox that allows one tocompute optimal strategies: ones that

    are guaranteed to produce the highestexpected utility. 20,22Bayesian networks have also been

    extended in ways that are meant to fa-cilitate their construction. In many do-mains, such networks tend to exhibitregular and repetitive structures, withthe regularities manifesting in bothCPTs and network structure. In thesesituations, one can synthesize largeBayesian networks automatically fromcompact high-level specications. Anumber of concrete specications

    have been proposed for this purpose.For example, template-based ap-proaches require two components forspecifying a Bayesian network: a set ofnetwork templates whose instantia-tion leads to network segments, anda specication of which segments togenerate and how to connect them to-gether. 22,23 Other approaches includelanguages based on rst-order logic,allowing one to reason about situa-tions with varying sets of objects (forexample, Milch et al. 26).

    The Challenges AheadBayesian networks have been estab-lished as a ubiquitous tool for model-ing and reasoning under uncertainty.The reach of Bayesian networks, how-ever, is tied to their effectiveness inrepresenting the phenomena of inter-est, and the scalability of their infer-ence algorithms. To further improvethe scope and ubiquity of Bayesian net- works, one therefore needs sustainedprogress on both fronts. The mainchallenges on the rst front lie in in-

    creasing the expressive power of Bayes-ian network representations, whilemaintaining the key features that haveproven necessary for their success:modularity of representation, trans-parent graphical nature, and efciencyof inference. On the algorithmic side,there is a need to better understand thetheoretical and practical limits of exactinference algorithms based on the twodimensions that characterize Bayesiannetworks: their topology and paramet-ric structure.

    With regard to approximate infer-ence algorithms, the main challengesseem to be in better understandingtheir behavior to the point where wecan characterize conditions under which they are expected to yield goodapproximations, and provide tools for

    practically trading off approximationquality with computational resources.Pushing the limits of inference algo-rithms will immediately push the en- velope with regard to learning Bayes-ian networks since many learningalgorithms rely heavily on inference.One cannot emphasize enough the im-portance of this line of work, given theextent to which data is available today,and the abundance of applicationsthat require the learning of networksfrom such data.

    References

    1. Bayes, T. An essay towards solving a problem in thedoctrine of chances. Phil. Trans. 3 (1963), 370418.Reproduced in W.E. Deming.

    2. Blei, D.M., Ng, A.Y. and Jordan, M.I. Latent Dirichletallocation. Journal of Machine Learning Research 3(2003), 9931022.

    3. Boutilier, C., Friedman, N., Goldszmidt, M. and Koller, D.Context-specic independence in Bayesian networks.In Proceedings of the 12th Conference on Uncertaintyin Articial Intelligence (1996), 115123.

    4. Chavira, M., Darwiche, A. and Jaeger, M. Compilingrelational Bayesian networks for exact inference.International Journal of Approximate Reasoning 42 ,1-2 (May 2006) 420.

    5. Cowell, R., Dawid, A., Lauritzen, S. and Spiegelhalter,D.Probabilistic Networks and Expert Systems.Springer, 1999.

    6. Darwiche, A. Recursive conditioning. ArticialIntelligence 126 , 1-2 (2001), 541.

    7. Darwiche, A. A differential approach to inference inBayesian networks. Journal of the ACM 50 , 3 (2003).

    8. Darwiche, A. Modeling and Reasoning with BayesianNetworks. Cambridge University Press, 2009.

    9. Dean, T. and Kanazawa, K. A model for reasoningabout persistence and causation. ComputationalIntelligence 5 , 3 (1989), 142150.

    10. Dechter, R. Bucket elimination: A unifying frameworkfor probabilistic inference. In Proceedings of the 12thConference on Uncertainty in Articial Intelligence(1996), 211219.

    11. Edwards, D. Introduction to Graphical Modeling. Springer, 2nd edition, 2000.

    12. Fishelson, M. and Geiger, D. Exact genetic linkagecomputations for general pedigrees. Bioinformatics 18 ,1 (2002), 189198.

    13. Frey, B. editor. Graphical Models for Machine Learning

    and Digital Communication. MIT Press, Cambridge,MA, 1998.

    14. Friedman, N., Geiger, D. and Goldszmidt, M. Bayesiannetwork classiers. Machine Learning 29 , 2-3 (1997),131163.

    15. Gilks, W., Richardson, S. and Spiegelhalter, D. MarkovChain Monte Carlo in Practice: InterdisciplinaryStatistics. Chapman & Hall/CRC, 1995.

    16. Glymour, C. and Cooper, G. eds. Computation,Causation, and Discovery. MIT Press, Cambridge, MA,1999.

    17. Heckerman, D. A tutorial on learning with Bayesiannetworks. Learning in Graphical Models. Kluwer,1998, 301354.

    18. Howard, R.A. and Matheson, J.E. Inuence diagrams.Principles and Applications of Decision Analysis, Vol.2. Strategic Decision Group, Menlo Park, CA, 1984,719762.

    19. Jaakkola, T. Tutorial on variational approximationmethods. Advanced Mean Field Methods. D. Saadand M. Opper, ed, MIT Press, Cambridge, MA, 2001,129160.

    20. Jensen, F.V. and Nielsen, T.D. Bayesian Networks andDecision Graphs. Springer, 2007.

    21. Jordan, M., Ghahramani, Z., Jaakkola, T. and Saul, L.An introduction to variational methods for graphicalmodels. Machine Learning 37 , 2 (1999), 183233.

    22. Koller, D. and Friedman, N. Probabilistic GraphicalModels: Principles and Techniques. MIT Press,Cambridge, MA, 2009.

    23. Koller, D. and Pfeffer, A. Object-oriented Bayesiannetworks. In Proceedings of the 13th Conference onUncertainty in Articial Intelligence (1997), 302313.

    24. Lauritzen, S.L. and Spiegelhalter, D.J. Localcomputations with probabilities on graphicalstructures and their application to expert systems.Journal of Royal Statistics Society, Series B 50 , 2(1998), 157224.

    25. Mengshoel, O., Darwiche, A., Cascio, K., Chavira, M.,Poll, S. and Uckun, S. Diagnosing faults in electricalpower systems of spacecraft and aircraft. InProceedings of the 20th Innovative Applications ofArticial Intelligence Conference (2008), 16991705.

    26. Milch, B., Marthi, B., Russell, S., Sontag, D., Ong, D.and Kolobov, A. BLOG: Probabilistic models withunknown objects. In Proceedings of the InternationalJoint Conference on Articial Intelligence (2005),13521359.

    27. Neapolitan, R. Learning Bayesian Networks. PrenticeHall, Englewood, NJ, 2004.

    28. Pearl, J. Bayesian networks: A model of self-activated memory for evidential reasoning. InProceedings of the Cognitive Science Society (1985),329334.

    29. Pearl, J. Probabilistic Reasoning in IntelligentSystems: Networks of Plausible Inference. MorganKaufmann, 1988.

    30. Pearl, J. Causality: Models, Reasoning, and Inference.Cambridge University Press, 2000.

    31. Durbin, A.K.R., Eddy, S. and Mitchison, G. BiologicalSequence Analysis: Probabilistic Models of Proteinsand Nucleic Acids. Cambridge University Press, 1998.

    32. Smyth, P., Heckerman, D. and Jordan, M. Probabilisticindependence networks for hidden Markov probabilitymodels. Neural Computation 9 , 2 (1997), 227269.

    33. Steyvers, M. and Grifths, T. Probabilistic topicmodels. Handbook of Latent Semantic Analysis. T. K.Landauer, D. S. McNamara, S. Dennis, and W. Kintsch,eds. 2007, 427448.

    34. Szeliski, R., Zabih, R., Scharstein, D., Veksler,O., Kolmogorov, V., Agarwala, A., Tappen, M.F.and Rother, C. A comparative study of energy

    minimization methods for Markov random elds withsmoothness-based priors. IEEE Trans. Pattern Anal.Mach. Intell. 30 , 6 (2008), 10681080.

    35. Yedidia, J., Freeman, W. and Weiss, Y. Constructingfree-energy approximations and generalized beliefpropagation algorithms. IEEE Transactions onInformation Theory 1 , 7 (2005), 22822312.

    36. Zhang, N.L. and Poole, D. A simple approach toBayesian network computations. In Proceedingsof the 10th Conference on Uncertainty in ArticialIntelligence , (1994), 171178.

    Adnan Darwiche ([email protected]) is a professorand former chair of the computer science department atthe University of California, Los Angeles, where he alsodirects the Automated Reasoning Group.

    2010 ACM 0001-0782/10/1200 $10.00