Site Management | WordPress Blog
Links to other web pages: Basic Concepts | Causality | Classifiers | Inference | Learning | Reading List | Related Topics
Links in this web page: Conferences | Bayes Nets Construction | Bayes Nets Structure Learning | Database | Groups | Introduction | Journals | References | Researchers | Resources | Software
Bayes Nets or Bayesian networks [79] are graphical representation
for probabilistic relationships among a set of random variables. Given a finite
set
of discrete random
variables where each variable
may take values from a
finite set, denoted by
. A Bayesian network is an annotated directed acyclic graph
(DAG) G that encodes a joint probability distribution over
. The nodes of the graph correspond to the random variables
. The links of the graph correspond to the direct influence
from one variable to the other. If there is a directed link from variable
to variable
, variable
will be a parent of
variable
. Each node is annotated with a conditional probability
distribution (CPD) that represents
, where
denotes the parents of
in
. The pair (
, CPD) encodes the joint distribution
. A unique joint probability distribution over
from
is factorized as:
![]()
Figure 2 is an example of a Bayesian network from Cooper’s paper [16], which is hypothetically about the medical domain with 5 variables. The corresponding CPDs are in Table 1.

Figure 2 A simple example of Bayesian network
|
P(X1=no)=0.8 |
P(X1 = yes)=0.2 |
|
P(X2=absent | X1=no)=0.95 P(X2=absent | X1=yes)=0.75 |
P(X2=present | X1=no)=0.05 P(X2=present | X1=yes)=0.25 |
|
P(X3=absent | X1=no)=0.99995 P(X3=absent | X1=yes)=0.997 |
P(X3=absent | X1=no)=0.00005 P(X3=absent | X1=yes)=0.003 |
|
P(X4=absent | X2=absent, X3=absent)=0.95 P(X4=absent | X2=absent, X3=present)=0.5 P(X4=absent | X2=present, X3=absent)=0.9 P(X4=absent | X2=present, X3=present)=0.25 |
P(X4=present | X2=absent, X3=absent)=0.05 P(X4=present | X2=absent, X3=present)=0.5 P(X4=present | X2=present, X3=absent)=0.1 P(X4=present | X2=present, X3=present)=0.75 |
|
P(X5=absent | X3=absent)=0.98 P(X5=absent | X3=present)=0.4 |
P(X5=present | X3=absent)=0.02 P(X5=present | X3=present)=0.6 |
Table 1 The probabilities associated with Figure 1
Causal Bayesian networks
A causal Bayesian network [78] of a domain is similar as the
normal Bayesian network. The difference is in the explanation of the links in
the Bayesian networks. In the normal Bayesian networks, the links between
variables can be explained as correlation or association. In a causal Bayesian
network, the links mean that the parent variables will causally influence the
values of the child variables. The causal influence in this thesis is defined
based on manipulation criteria (see Section 3.7.2 of Spirtes et al [96] for a detailed discussion of
manipulation): Suppose there are two variables A and B in the domain; If we can
manipulate the variables in the domain, set the value of variable A as a1
or a2 and measure its effect on variable B, then the probability
distribution of variable B will change under the conditions of the different
values of variable A –
(the operator
is from Pearl’s
book [78]).
One way to construct Bayesian networks is from domain knowledge. There are three main steps to construct a Bayesian network by domain knowledge: 1) determine the number and the meanings of the variables in the interested domain; 2) determine the direct influence relationships among variables in the domain; and 3) determine the conditional probabilities given the structure of the Bayesian networks from step 2. Some methods have been proposed to facilitate the process to construct Bayesian networks with causal domain knowledge [22,49,74].
In many domains, however, domain knowledge is not sufficient to construct a complete Bayesian network. In this case, Bayesian networks can be learned from data when the data is available. In literature, many methods have been proposed for this purpose [17,47,79,96]. The Bayesian network learning problem can be categorized as 1) parameter learning problem when the structure is known, and 2) structure learning problem when the structure is unknown. Generally the parameter learning will be a part of the structure learning problem and used as an inner loop of structure learning in the score-and-search-based approach. In this thesis, we are interested in the Bayesian network structure learning problem.
The task in Bayesian network structure learning is to find a structure of Bayesian network that describes the observed data the most, which is proved to be a NP-complete problem [11]. Many heuristics have been proposed to learn Bayesian network structure. In literature, there are two categories of approaches for Bayesian network structure learning. The first one is the score-and-search-based approach [13,17,47]. The methods in this category start from an initial structure (generated randomly or from domain knowledge) and move to the neighbors with the best score in the structure space determinately or stochastically until a local maximum of the selected criteria is reached. The greedy learning process can re-start several times with different initial structures to improve the result. Another category of the Bayesian network structure learning approach is the constraint-based approach [80,95]. The methods under this category start to test the statistical significance of the pairs of variables conditioning on other variables to induce conditional independence. The pairs of variables which pass some threshold are deemed as directly connected in the Bayesian networks. The complete Bayesian network structure is constructed from the induced conditional independence and dependence information. In the later sub-sections, we will discuss the score-and-search-based approach and the constraint-based approach.
There are three main issues in the score-and-search-based approach for Bayesian network structure learning: the structure space, the search strategy and the model selection criterion.
The search space in Bayesian network structure learning is all the possible structures of directed acyclic graphs (DAGs) given the number of variables in the domain. Robinson [87,88] derived the efficiently computable recursive function to determine the number of possible DAGs that contain n nodes:
![]()
The numbers of Bayesian networks with small number of variables are shown in Table 2. From Table 2, we know that the number of DAGs is super-exponential in the number of nodes. It is impossible to enumerate all possible structures and score them, even only with a reasonably-sized number of nodes n in the Bayesian network. Then heuristic-based methods have been proposed to find a local maximum in the structure space. The idea is that a good heuristic will speed up the search process in order to reach a local maximum. The representative methods are K2 algorithm [17], greedy search, etc.
|
Number of variables in DAG |
Number of the possible DAGs |
|
1 |
1 |
|
2 |
3 |
|
3 |
25 |
|
4 |
543 |
|
5 |
29,281 |
|
6 |
3,781,503 |
|
7 |
1,138,779,265 |
|
8 |
78,370,2329,343 |
|
9 |
1,213,442,454,842,881 |
|
10 |
4,175,098,976,430,598,100 |
Table 2 Number of DAGs with specified number of variables
Besides the basic structure space described above, Checking [12] has considered the Markov-equivalent structures as the search space. A set of Bayesian networks are equivalent if they imply exactly the same set of independence statements among variables. Markov-equivalent set of Bayesian network structures can be represented with complete partial DAG (CPDAG). The edges in a CPDAG can be either directed, denoting that the edge direction is the same in all equivalent Bayesian networks, or undirected, denoting that either direction is possible in some equivalent Bayesian networks. Since our interest is the direct influence relationships among variables, the Markov-equivalent structures cannot model the exact direct influence relationship and we will not consider it in this report.
In the score-and-search-based approach, any search methods from artificial intelligence, such as brute-force, depth-first, width-first, best-first or simulated annealing, can be used. Here we only consider some basic and widely-used methods.
The brute-force approach to Bayesian network structure learning is to enumerate all possible DAGs, score each one and select the best one. This provides a "gold standard" to evaluate other algorithms.
As mentioned in last sub-section, the number of possible DAGs is super exponential to the number of the nodes. It is impossible to enumerate all possible DAGs for N > 5 and score them in a reasonable time. But one can evaluate any reasonably-sized set of structures in this way (e.g., the nearest neighbors of the best structure at the current stage).
Since we cannot enumerate all the
possible DAGs for Bayesian network structure learning, we have to try the
heuristic methods. Since DAGs are acyclic and the parents of the variables are
before children in causal ordering, knowing the ordering of the variables can
reduce the structure space. If we know a total ordering on the nodes, finding
the best structure amounts to picking the best set of parents for each node
independently. This is what the
If the ordering is unknown, we can search over orderings. The space of orderings is much smaller and more regular than the space of the structures, and has a smoother posterior “landscape”. As a result, the search over ordering is more efficient than searching over DAGs. Refer to Friedman and Koller’s work [32] for details of search over ordering.
If we know nothing about the structure, we can treat the structure learning problem as a general optimization problem in a discrete space. The intuitive way to solve such a problem is greedy search. Greedy search starts at a specific point (an initial structure) in the structure space, considers all nearest neighbors of the current point, and moves to the neighbor that has the highest score; if no neighbors have higher score than the current point (i.e., we have reached a local maximum), the algorithm stops.
The greedy search process for score-and-search-based approach is as follows:
Input
of the algorithm:
1) Observational data set;
Output
of the algorithm:
1)
A
Bayesian network.
1.
Generate
the initial Bayesian network, evaluate it and set it as the current Bayesian
network
2.
Evaluate
the neighbors of the current Bayesian network
3.
If the
best score of the neighbors is better than the score of the current Bayesian
network, set the neighbor with the best score as the current Bayesian network
and return to Step 2;
4.
Otherwise,
stop the learning process
Three different ways to generate the initial Bayesian network structures are often used in practice: 1) the structure without any arcs – it means every node is independent of any other nodes, 2) the structure with complete arcs – it means there are links between any pairs of nodes; and 3) randomly generated structures.
The score can be calculated with any reasonable score function for Bayesian network.
A common definition of "neighbor" is all structures that can be generated from the current structures by adding, deleting or reversing a single arc, subject to the acyclicity constraint.
When the algorithm stops, it always reaches a local maximum. The local maximum reached by the algorithm is essentially dependent on the initial structure. If a good initial structure is chosen, we can reach a good model in a short time. If a bad initial structure is chosen, we will reach a reasonable good model in a very long time, or can’t reach a reasonable good model. Although we know the initial structure is essential, we have no enough knowledge to justify which initial model is good. Instead of choosing one good initial structure, the alternative way is to restart the algorithm with different initial structures and choose the best model in the reached local maximums.
Other methods have been proposed to
solve the local maximum problem, such as simulated annealing and best-first
search. In the simulated annealing, the process does not always go to the
neighbors of the current structure with the highest score, but in a stochastic
way. It goes to the neighbor with the highest score with a probability
or does a random walk
to a neighbor with probability
. The probability
may change over time.
Following a stochastic way, the process may escape from a local maximum and
find a global maximum.
In the best-first approach, the space of the possible network structure will be searched systematically using a heuristic measure to determine the next best structure to examine.
Although the idea in greedy search is intuitive, it can reach good model in reasonable time compared with other methods. Chickering [12] has shown that, for a fixed amount of computational time, greedy search with random restarts produces better models than does either simulated annealing or best-first search. There is also a theoretic justification of greedy search method over the Markov equivalent class by Chickering [13].
In Bayesian network structure learning, a scoring function evaluates how well a given network G matches the data D. Given a scoring function, the best Bayesian network is the one that maximizes this scoring function.
An ad-hoc scoring function is based on the maximum likelihood (ML) principle: Selecting the structure which predicts the data D with the highest probability
![]()
where
denotes the hypothesis of the Bayesian structure,
is the vector of
parameters
,
is the vector of
parameters for the distribution
of variable
and
is the maximum
likelihood configuration of
.
But the number of parameters
(here and in the
following used to measure the complexity of the model) can be very large
depending on G. Thus an application of the maximum-likelihood principle might
lead to an overfitting to Bayesian networks that match the given data well, but
have low generalization quality for new data. Therefore, a penalty in the
scoring function for the complexity of the model is needed.
The scoring functions that are most frequently used to learn Bayesian networks fulfill this condition: Bayesian Information Criterion (BIC) and Bayesian score combine the likelihood with some penalty relating to the complexity of the model.
The Bayesian Information Criterion (BIC) [90] is defined as
![]()
where D is the data,
is the maximum likelihood (ML) estimate of the parameters,
is the hypothesis of the Bayesian structure, d is the number
of parameters in the Bayesian network, and N is the number of data cases. The
BIC criterion has several properties. First, it does not depend on the prior,
so we do not need to specify the prior. Second, the criterion is quite
intuitive. Namely, it contains a term measuring how well the parameterized
model predicts the data
and a term that
punishes the complexity of the model
. Third, the BIC criterion is exactly minus the Minimum
description Length (MDL) criterion [44,61]. Although BIC is intuitive
and often used in practice, it tends to choose models that are too simple due
to the heavy penalty on the complexity of the model.
The Bayesian score for measuring the quality of a Bayesian network G is its posterior probability given the database:
![]()
where
is a normalization
constant which does not depend on the Bayesian network structure G. Since
is a constant and will
not affect the ordering of the different models, the relative posterior
probability
is often used for
model selection. This criterion has two components: the prior and the marginal
likelihood. The prior can be specified by experts or just set uniformly to all
possible structures. The marginal likelihood integrates out the parameters of
the model.
The Bayesian score is originally
discussed by Cooper and Herskovits [17] as BD metric and further
developed by Heckerman et al [47] as BDe metric. Here only BDe
metric is covered. The BDe metric (Bayesian metric with Dirichlet priors and
equivalence) [47] evolves from the search for a
network with the largest posterior probability given priors over network
structures G and parameters
. It was derived with the complete data under the following
assumptions: multinomial sample assumption, parameter independence assumption,
parameter modularity assumption, likelihood equivalence assumption, and
structure possibility assumption. The BDe metric is based on the concept of
sets of likelihood equivalent network structures, where all members in a set of
equivalent network are given the same score.
The BDe metric also provides a
simple way of identifying the virtual counts
by having the user
specify a prior Bayesian network Sp for variables X and an equivalent sample
size
:
![]()
The Bayesian score is a more accurate criterion, but it needs more computation in practice. BIC can be derived as a large sample approximation to the marginal likelihood, the second component of Bayesian score. In practice, the sample size does not need to be very large for the approximation to be good.
Except the criteria mentioned above, some other criteria, such as cross-validation criterion, entropy score, and minimum message length (MML) [101], are also used for Bayesian network structure selection, and are not covered here.
Constraint-based methods view the structure learning problem differently. Since a Bayesian network structure encodes many dependencies and (conditional) independencies of the underlying model, the algorithms of this approach try to discover the dependencies and conditional independencies from the data, and then use these dependencies and conditional independencies to infer the Bayesian network structure.
The dependency and conditional independency relationships are measured by using some kind of Conditional Independence (CI) test. The conditional independence tests that are used in practice are statistical tests on the data set. In order to use the results to reconstruct the structure, several assumptions have to be made: Causal Sufficiency assumption, Causal Markov assumption, and Faithfulness assumption.
Causal sufficiency assumption: There exist no common unobserved (also known as hidden or latent) variables in the domain that are parent of one or more observed variables of the domain.
Causal Markov assumption: Given a Bayesian network model G, any variable is independent of all its non-descendants in G given its parents.
Faithfulness assumption: A Bayesian network structure G and a probability distribution P generated by G are faithful to one another if and only if every conditional independence relationship valid in P is entailed by the Causal Markov assumption in G.
With these assumptions in place, one can ascertain the existence of an edge between two variables, or the direction of that link, though the latter is only being possible in certain cases. The output of constraint-based methods will be a partial DAG (PDAG) to represent the whole Markov equivalence class. The representative algorithms in this category are SGS algorithm [94], CI algorithm [80], and PC algorithm [96].