Site Management | WordPress Blog

Bayes Nets

Links to other web pages: Basic Concepts | Causality | Classifiers | Inference | Learning | Reading List | Related Topics

 

Links in this web page: Conferences | Bayes Nets Construction | Bayes Nets Structure Learning | Database | Groups | Introduction | Journals | References | Researchers | Resources | Software

 

Introduction

Bayes Nets or Bayesian networks [79] are graphical representation for probabilistic relationships among a set of random variables. Given a finite set  of discrete random variables where each variable  may take values from a finite set, denoted by . A Bayesian network is an annotated directed acyclic graph (DAG) G that encodes a joint probability distribution over. The nodes of the graph correspond to the random variables. The links of the graph correspond to the direct influence from one variable to the other. If there is a directed link from variable  to variable , variable  will be a parent of variable . Each node is annotated with a conditional probability distribution (CPD) that represents, where  denotes the parents of  in. The pair (, CPD) encodes the joint distribution. A unique joint probability distribution over  from  is factorized as:

Figure 2 is an example of a Bayesian network from Cooper’s paper [16], which is hypothetically about the medical domain with 5 variables. The corresponding CPDs are in Table 1.

Figure 2  A simple example of Bayesian network

 

P(X1=no)=0.8

P(X1 = yes)=0.2

P(X2=absent | X1=no)=0.95

P(X2=absent | X1=yes)=0.75

P(X2=present | X1=no)=0.05

P(X2=present | X1=yes)=0.25

P(X3=absent | X1=no)=0.99995

P(X3=absent | X1=yes)=0.997

P(X3=absent | X1=no)=0.00005

P(X3=absent | X1=yes)=0.003

P(X4=absent | X2=absent, X3=absent)=0.95

P(X4=absent | X2=absent, X3=present)=0.5

P(X4=absent | X2=present, X3=absent)=0.9

P(X4=absent | X2=present, X3=present)=0.25

P(X4=present | X2=absent, X3=absent)=0.05

P(X4=present | X2=absent, X3=present)=0.5

P(X4=present | X2=present, X3=absent)=0.1

P(X4=present | X2=present, X3=present)=0.75

P(X5=absent | X3=absent)=0.98

P(X5=absent | X3=present)=0.4

P(X5=present | X3=absent)=0.02

P(X5=present | X3=present)=0.6

Table 1  The probabilities associated with Figure 1

Causal Bayesian networks

A causal Bayesian network [78] of a domain is similar as the normal Bayesian network. The difference is in the explanation of the links in the Bayesian networks. In the normal Bayesian networks, the links between variables can be explained as correlation or association. In a causal Bayesian network, the links mean that the parent variables will causally influence the values of the child variables. The causal influence in this thesis is defined based on manipulation criteria (see Section 3.7.2 of Spirtes et al [96] for a detailed discussion of manipulation): Suppose there are two variables A and B in the domain; If we can manipulate the variables in the domain, set the value of variable A as a1 or a2 and measure its effect on variable B, then the probability distribution of variable B will change under the conditions of the different values of variable A –  (the operator  is from Pearl’s book [78]).

Bayesian network construction

One way to construct Bayesian networks is from domain knowledge. There are three main steps to construct a Bayesian network by domain knowledge: 1) determine the number and the meanings of the variables in the interested domain; 2) determine the direct influence relationships among variables in the domain; and 3) determine the conditional probabilities given the structure of the Bayesian networks from step 2. Some methods have been proposed to facilitate the process to construct Bayesian networks with causal domain knowledge [22,49,74].

  In many domains, however, domain knowledge is not sufficient to construct a complete Bayesian network. In this case, Bayesian networks can be learned from data when the data is available. In literature, many methods have been proposed for this purpose [17,47,79,96]. The Bayesian network learning problem can be categorized as 1) parameter learning problem when the structure is known, and 2) structure learning problem when the structure is unknown. Generally the parameter learning will be a part of the structure learning problem and used as an inner loop of structure learning in the score-and-search-based approach. In this thesis, we are interested in the Bayesian network structure learning problem.

Bayesian network structure learning

The task in Bayesian network structure learning is to find a structure of Bayesian network that describes the observed data the most, which is proved to be a NP-complete problem [11]. Many heuristics have been proposed to learn Bayesian network structure. In literature, there are two categories of approaches for Bayesian network structure learning. The first one is the score-and-search-based approach [13,17,47]. The methods in this category start from an initial structure (generated randomly or from domain knowledge) and move to the neighbors with the best score in the structure space determinately or stochastically until a local maximum of the selected criteria is reached. The greedy learning process can re-start several times with different initial structures to improve the result. Another category of the Bayesian network structure learning approach is the constraint-based approach [80,95]. The methods under this category start to test the statistical significance of the pairs of variables conditioning on other variables to induce conditional independence. The pairs of variables which pass some threshold are deemed as directly connected in the Bayesian networks. The complete Bayesian network structure is constructed from the induced conditional independence and dependence information. In the later sub-sections, we will discuss the score-and-search-based approach and the constraint-based approach.

Score-and-search-based approach

There are three main issues in the score-and-search-based approach for Bayesian network structure learning: the structure space, the search strategy and the model selection criterion.

 Structure space

The search space in Bayesian network structure learning is all the possible structures of directed acyclic graphs (DAGs) given the number of variables in the domain. Robinson [87,88] derived the efficiently computable recursive function to determine the number of possible DAGs that contain n nodes:

The numbers of Bayesian networks with small number of variables are shown in Table 2. From Table 2, we know that the number of DAGs is super-exponential in the number of nodes. It is impossible to enumerate all possible structures and score them, even only with a reasonably-sized number of nodes n in the Bayesian network. Then heuristic-based methods have been proposed to find a local maximum in the structure space. The idea is that a good heuristic will speed up the search process in order to reach a local maximum. The representative methods are K2 algorithm [17], greedy search, etc.

Number of variables in DAG

Number of the possible DAGs

1

1

2

3

3

25

4

543

5

29,281

6

3,781,503

7

1,138,779,265

8

78,370,2329,343

9

1,213,442,454,842,881

10

4,175,098,976,430,598,100

Table 2  Number of DAGs with specified number of variables

Besides the basic structure space described above, Checking [12] has considered the Markov-equivalent structures as the search space. A set of Bayesian networks are equivalent if they imply exactly the same set of independence statements among variables. Markov-equivalent set of Bayesian network structures can be represented with complete partial DAG (CPDAG). The edges in a CPDAG can be either directed, denoting that the edge direction is the same in all equivalent Bayesian networks, or undirected, denoting that either direction is possible in some equivalent Bayesian networks. Since our interest is the direct influence relationships among variables, the Markov-equivalent structures cannot model the exact direct influence relationship and we will not consider it in this report.

 Search methods

In the score-and-search-based approach, any search methods from artificial intelligence, such as brute-force, depth-first, width-first, best-first or simulated annealing, can be used. Here we only consider some basic and widely-used methods.

 Exhaustive search

The brute-force approach to Bayesian network structure learning is to enumerate all possible DAGs, score each one and select the best one. This provides a "gold standard" to evaluate other algorithms.

As mentioned in last sub-section, the number of possible DAGs is super exponential to the number of the nodes. It is impossible to enumerate all possible DAGs for N > 5 and score them in a reasonable time. But one can evaluate any reasonably-sized set of structures in this way (e.g., the nearest neighbors of the best structure at the current stage).

 K2 algorithm

Since we cannot enumerate all the possible DAGs for Bayesian network structure learning, we have to try the heuristic methods. Since DAGs are acyclic and the parents of the variables are before children in causal ordering, knowing the ordering of the variables can reduce the structure space. If we know a total ordering on the nodes, finding the best structure amounts to picking the best set of parents for each node independently. This is what the K2 algorithm [17] does. K2 algorithm is a greedy search algorithm that works as follows. Suppose we know the total ordering of the nodes. Initially each node has no parents. The algorithm then incrementally adds the parent whose addition most increases the score of the resulting structure. When no addition of a single parent can increase the score, it stops adding parents to the node. Since an ordering of the nodes is known beforehand, the search space under this constraint is much smaller than the entire space. And we do not need to check for cycles, since the total ordering guarantees that there is no cycle in the deduced structures. Furthermore, based on some appropriate assumptions, we can choose the parents for each node independently.

If the ordering is unknown, we can search over orderings. The space of orderings is much smaller and more regular than the space of the structures, and has a smoother posterior “landscape”. As a result, the search over ordering is more efficient than searching over DAGs. Refer to Friedman and Koller’s work [32] for details of search over ordering.

 Greedy search

If we know nothing about the structure, we can treat the structure learning problem as a general optimization problem in a discrete space. The intuitive way to solve such a problem is greedy search. Greedy search starts at a specific point (an initial structure) in the structure space, considers all nearest neighbors of the current point, and moves to the neighbor that has the highest score; if no neighbors have higher score than the current point (i.e., we have reached a local maximum), the algorithm stops.

The greedy search process for score-and-search-based approach is as follows:

Input of the algorithm:

1)      Observational data set;

Output of the algorithm:

1)      A Bayesian network.

1.      Generate the initial Bayesian network, evaluate it and set it as the current Bayesian network

2.      Evaluate the neighbors of the current Bayesian network

3.      If the best score of the neighbors is better than the score of the current Bayesian network, set the neighbor with the best score as the current Bayesian network and return to Step 2;

4.      Otherwise, stop the learning process

Three different ways to generate the initial Bayesian network structures are often used in practice: 1) the structure without any arcs – it means every node is independent of any other nodes, 2) the structure with complete arcs – it means there are links between any pairs of nodes; and 3) randomly generated structures.

The score can be calculated with any reasonable score function for Bayesian network.

A common definition of "neighbor" is all structures that can be generated from the current structures by adding, deleting or reversing a single arc, subject to the acyclicity constraint.

When the algorithm stops, it always reaches a local maximum. The local maximum reached by the algorithm is essentially dependent on the initial structure. If a good initial structure is chosen, we can reach a good model in a short time. If a bad initial structure is chosen, we will reach a reasonable good model in a very long time, or can’t reach a reasonable good model. Although we know the initial structure is essential, we have no enough knowledge to justify which initial model is good. Instead of choosing one good initial structure, the alternative way is to restart the algorithm with different initial structures and choose the best model in the reached local maximums.

Other methods have been proposed to solve the local maximum problem, such as simulated annealing and best-first search. In the simulated annealing, the process does not always go to the neighbors of the current structure with the highest score, but in a stochastic way. It goes to the neighbor with the highest score with a probability  or does a random walk to a neighbor with probability. The probability  may change over time. Following a stochastic way, the process may escape from a local maximum and find a global maximum.

In the best-first approach, the space of the possible network structure will be searched systematically using a heuristic measure to determine the next best structure to examine.

Although the idea in greedy search is intuitive, it can reach good model in reasonable time compared with other methods. Chickering [12] has shown that, for a fixed amount of computational time, greedy search with random restarts produces better models than does either simulated annealing or best-first search. There is also a theoretic justification of greedy search method over the Markov equivalent class by Chickering [13].

 Model selection criteria

In Bayesian network structure learning, a scoring function evaluates how well a given network G matches the data D. Given a scoring function, the best Bayesian network is the one that maximizes this scoring function.

An ad-hoc scoring function is based on the maximum likelihood (ML) principle: Selecting the structure which predicts the data D with the highest probability

where denotes the hypothesis of the Bayesian structure,  is the vector of parameters ,  is the vector of parameters for the distribution    of variable  and  is the maximum likelihood configuration of .

But the number of parameters  (here and in the following used to measure the complexity of the model) can be very large depending on G. Thus an application of the maximum-likelihood principle might lead to an overfitting to Bayesian networks that match the given data well, but have low generalization quality for new data. Therefore, a penalty in the scoring function for the complexity of the model is needed.

The scoring functions that are most frequently used to learn Bayesian networks fulfill this condition: Bayesian Information Criterion (BIC) and Bayesian score combine the likelihood with some penalty relating to the complexity of the model.

The Bayesian Information Criterion (BIC) [90] is defined as

where D is the data, is the maximum likelihood (ML) estimate of the parameters, is the hypothesis of the Bayesian structure, d is the number of parameters in the Bayesian network, and N is the number of data cases. The BIC criterion has several properties. First, it does not depend on the prior, so we do not need to specify the prior. Second, the criterion is quite intuitive. Namely, it contains a term measuring how well the parameterized model predicts the data  and a term that punishes the complexity of the model . Third, the BIC criterion is exactly minus the Minimum description Length (MDL) criterion [44,61]. Although BIC is intuitive and often used in practice, it tends to choose models that are too simple due to the heavy penalty on the complexity of the model.

The Bayesian score for measuring the quality of a Bayesian network G is its posterior probability given the database:

where  is a normalization constant which does not depend on the Bayesian network structure G. Since  is a constant and will not affect the ordering of the different models, the relative posterior probability  is often used for model selection. This criterion has two components: the prior and the marginal likelihood. The prior can be specified by experts or just set uniformly to all possible structures. The marginal likelihood integrates out the parameters of the model.

The Bayesian score is originally discussed by Cooper and Herskovits [17] as BD metric and further developed by Heckerman et al [47] as BDe metric. Here only BDe metric is covered. The BDe metric (Bayesian metric with Dirichlet priors and equivalence) [47] evolves from the search for a network with the largest posterior probability given priors over network structures G and parameters. It was derived with the complete data under the following assumptions: multinomial sample assumption, parameter independence assumption, parameter modularity assumption, likelihood equivalence assumption, and structure possibility assumption. The BDe metric is based on the concept of sets of likelihood equivalent network structures, where all members in a set of equivalent network are given the same score.

The BDe metric also provides a simple way of identifying the virtual counts  by having the user specify a prior Bayesian network Sp for variables X and an equivalent sample size:

The Bayesian score is a more accurate criterion, but it needs more computation in practice. BIC can be derived as a large sample approximation to the marginal likelihood, the second component of Bayesian score. In practice, the sample size does not need to be very large for the approximation to be good.

Except the criteria mentioned above, some other criteria, such as cross-validation criterion, entropy score, and minimum message length (MML) [101], are also used for Bayesian network structure selection, and are not covered here.

Constraint-based approach

Constraint-based methods view the structure learning problem differently. Since a Bayesian network structure encodes many dependencies and (conditional) independencies of the underlying model, the algorithms of this approach try to discover the dependencies and conditional independencies from the data, and then use these dependencies and conditional independencies to infer the Bayesian network structure.

The dependency and conditional independency relationships are measured by using some kind of Conditional Independence (CI) test. The conditional independence tests that are used in practice are statistical tests on the data set. In order to use the results to reconstruct the structure, several assumptions have to be made: Causal Sufficiency assumption, Causal Markov assumption, and Faithfulness assumption.

Causal sufficiency assumption: There exist no common unobserved (also known as hidden or latent) variables in the domain that are parent of one or more observed variables of the domain.

Causal Markov assumption: Given a Bayesian network model G, any variable is independent of all its non-descendants in G given its parents.

Faithfulness assumption: A Bayesian network structure G and a probability distribution P generated by G are faithful to one another if and only if every conditional independence relationship valid in P is entailed by the Causal Markov assumption in G.

With these assumptions in place, one can ascertain the existence of an edge between two variables, or the direction of that link, though the latter is only being possible in certain cases. The output of constraint-based methods will be a partial DAG (PDAG) to represent the whole Markov equivalence class. The representative algorithms in this category are SGS algorithm [94], CI algorithm [80], and PC algorithm [96].

Hybrid methods