An artificial neural network (ANN), usually called neural ...



An artificial neural network (ANN), usually called neural network (NN), is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes information using a connectionist approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. Modern neural networks are non-linear statistical data modeling tools. They are usually used to model complex relationships between inputs and outputs or to find patterns in data.

Learning

What has attracted the most interest in neural networks is the possibility of learning. Given a specific task to solve, and a class of functions [pic], learning means using a set of observations to find [pic] which solves the task in some optimal sense.

This entails defining a cost function [pic] such that, for the optimal solution [pic], [pic] [pic] (i.e., no solution has a cost less than the cost of the optimal solution).

The cost function [pic] is an important concept in learning, as it is a measure of how far away a particular solution is from an optimal solution to the problem to be solved. Learning algorithms search through the solution space to find a function that has the smallest possible cost.

For applications where the solution is dependent on some data, the cost must necessarily be a function of the observations, otherwise we would not be modelling anything related to the data. It is frequently defined as a statistic to which only approximations can be made. As a simple example, consider the problem of finding the model [pic] which minimizes [pic], for data pairs [pic] drawn from some distribution [pic]. In practical situations we would only have [pic] samples from [pic] and thus, for the above example, we would only minimize [pic]. Thus, the cost is minimized over a sample of the data rather than the entire data set.

When [pic] some form of online machine learning must be used, where the cost is partially minimized as each new example is seen. While online machine learning is often used when [pic] is fixed, it is most useful in the case where the distribution changes slowly over time. In neural network methods, some form of online machine learning is frequently used for finite datasets.

See also: Optimization (mathematics), Estimation theory, and Machine learning

[edit]Choosing a cost function

While it is possible to define some arbitrary, ad hoc cost function, frequently a particular cost will be used, either because it has desirable properties (such as convexity) or because it arises naturally from a particular formulation of the problem (e.g., in a probabilistic formulation the posterior probability of the model can be used as an inverse cost). Ultimately, the cost function will depend on the desired task. An overview of the three main categories of learning tasks is provided below.

[edit]Learning paradigms

There are three major learning paradigms, each corresponding to a particular abstract learning task. These are supervised learning,unsupervised learning and reinforcement learning.

[edit]Supervised learning

In supervised learning, we are given a set of example pairs [pic] and the aim is to find a function [pic] in the allowed class of functions that matches the examples. In other words, we wish to infer the mapping implied by the data; the cost function is related to the mismatch between our mapping and the data and it implicitly contains prior knowledge about the problem domain.

A commonly used cost is the mean-squared error which tries to minimize the average squared error between the network's output, f(x), and the target value y over all the example pairs. When one tries to minimize this cost using gradient descent for the class of neural networks called multilayer perceptrons, one obtains the common and well-known backpropagation algorithm for training neural networks.

Tasks that fall within the paradigm of supervised learning are pattern recognition (also known as classification) and regression (also known as function approximation). The supervised learning paradigm is also applicable to sequential data (e.g., for speech and gesture recognition). This can be thought of as learning with a "teacher," in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.

[edit]Unsupervised learning

In unsupervised learning, some data [pic] is given and the cost function to be minimized, that can be any function of the data [pic] and the network's output, [pic].

The cost function is dependent on the task (what we are trying to model) and our a priori assumptions (the implicit properties of our model, its parameters and the observed variables).

As a trivial example, consider the model [pic], where [pic] is a constant and the cost [pic]. Minimizing this cost will give us a value of [pic] that is equal to the mean of the data. The cost function can be much more complicated. Its form depends on the application: for example, in compression it could be related to the mutual information between x and y, whereas in statistical modeling, it could be related to the posterior probability of the model given the data. (Note that in both of those examples those quantities would be maximized rather than minimized).

Tasks that fall within the paradigm of unsupervised learning are in general estimation problems; the applications include clustering, the estimation of statistical distributions, compression and filtering.

[edit]Reinforcement learning

In reinforcement learning, data [pic] are usually not given, but generated by an agent's interactions with the environment. At each point in time [pic], the agent performs an action [pic] and the environment generates an observation [pic] and an instantaneous cost [pic], according to some (usually unknown) dynamics. The aim is to discover a policy for selecting actions that minimizes some measure of a long-term cost; i.e., the expected cumulative cost. The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated.

More formally, the environment is modeled as a Markov decision process (MDP) with states [pic] and actions [pic] with the following probability distributions: the instantaneous cost distribution [pic], the observation distribution [pic] and the transition [pic], while a policy is defined as conditional distribution over actions given the observations. Taken together, the two define a Markov chain (MC). The aim is to discover the policy that minimizes the cost; i.e., the MC for which the cost is minimal.

ANNs are frequently used in reinforcement learning as part of the overall algorithm.

Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision making tasks.

See also: dynamic programming and stochastic control

A Boltzmann machine is a type of stochastic recurrent neural network by Geoffrey Hinton and Terry Sejnowski. Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets. They were one of the first examples of a neural network capable of learning internal representations, and are able to represent and (given sufficient time) solve difficult combinatoric problems. However, due to a number of issues discussed below, Boltzmann machines with unconstrained connectivity have not proven useful for practical problems in machine learning or inference. They are still theoretically intriguing, however, due to the locality andHebbian nature of their training algorithm, as well as their parallelism and the resemblance of their dynamics to simple physical processes. If the connectivity is constrained, the learning can be made efficient enough to be useful for practical problems.

They are named after the Boltzmann distribution in statistical mechanics, which is used in their sampling function.

Structure

A Boltzmann machine, like a Hopfield network, is a network of units with an "energy" defined for the network. It also has binary units, but unlike Hopfield nets, Boltzmann machine units are stochastic. The global energy, E, in a Boltzmann machine is identical in form to that of a Hopfield network:

[pic]

Where:

▪ wij is the connection strength between unit j and unit i.

▪ si is the state, [pic], of unit i.

▪ θi is the threshold of unit i.

The connections in a Boltzmann machine have two restrictions:

▪ No unit has a connection with itself.

▪ [pic]. (All connections are symmetric.)

[edit]Probability of a unit's state

The difference in the global energy that results from a single unit i being 0 (off) versus 1 (on), written ΔEi, is given by:

[pic]

This can be expressed as the difference of energies of two states:

ΔEi = Ei=off − Ei=on

We then substitute the energy of each state with its relative probability according to the Boltzmann Factor (the property of aBoltzmann distribution that the energy of a state is proportional to the negative log probability of that state):

[pic]

where kB is Boltzmann's constant and becomes lumped into the artificial notion of temperature T. We then rearrange terms and consider that the probabilities of the unit being on and off must sum to one:

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

We can now finally solve for pi=on, the probability that the i-th unit is on.

[pic]

where the scalar T is referred to as the temperature of the system.

This relation is the source of the logistic function found in probability expressions in variants of the Boltzmann machine.

Equilibrium state

The network is run by repeatedly choosing a unit and setting its state according to the above formula. After running for long enough at a certain temperature, the probability of a global state of the network will depend only upon that global state's energy, according to aBoltzmann distribution. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is "at thermal equilibrium", meaning that the probability distribution of global states has converged. If we start running the network from a high temperature, and gradually decrease it until we reach a thermal equilibrium at a low temperature, we are guaranteed to converge to a distribution where the energy level fluctuates around the global minimum. This process is called simulated annealing.

If we want to train the network so that the chance it will converge to a global state is according to an external distribution that we have over these states, we need to set the weights so that the global states with the highest probabilities will get the lowest energies. This is done by the following training procedure.

[edit]Training

The units in the Boltzmann Machine are divided into "visible" units, V, and "hidden" units, H. The visible units are those which receive information from the "environment", i.e. our training set is a set of binary vectors over the set V. The distribution over the training set is denoted P + (V).

On the Boltzmann Machine side, as recalled, the distribution over the global states is converging as we reach a thermal equilibrium. We denote the converged distribution, after we marginalize it over the visible units V, as P − (V).

Our goal is to approximate the "real" distribution P + (V) using the P − (V) which will be produced (eventually) by the machine. To measure how similar the two distributions are, we use the Kullback-Leibler divergence, G:

[pic]

Where the sum is over all the possible states of V. G is a function of the weights, since they determine the energy of a state, and the energy determines P − (v), as promised by the Boltzmann distribution. Hence, we can use a gradient descent algorithm over G, so a given weight, wij is changed by subtracting the partial derivative of G with respect to the weight.

There are two phases to Boltzmann machine training, and we switch iteratively between them. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to P + ). The other is the "negative" phase where the network is allowed to run freely, i.e. no units have their state determined by external data. Surprisingly enough, the gradient with respect to a given weight, wij, is given by the very simple equation (proved in Ackley et al.):

[pic]

Where:

▪ [pic] is the probability of units i and j both being on when the machine is at equilibrium on the positive phase.

▪ [pic] is the probability of units i and j both being on when the machine is at equilibrium on the negative phase.

▪ R denotes the learning rate

This result follows from the fact that at the thermal equilibrium the probability P − (s) of any global state s when the network is free-running is given by the Boltzmann distribution (hence the name "Boltzmann machine").

Remarkably, this learning rule is fairly biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection (or synapse biologically speaking) does not need information about anything other than the two neurons it connects. This is far more biologically realistic than the information needed by a connection in many other neural network training algorithms, such as backpropagation.

The training of a Boltzmann machine does not use the EM algorithm, which is heavily used in machine learning. By minimizing the KL-divergence, it is equivalent as maximizing the log-likelihood of the data. Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated before the maximization of the expected value of the complete data likelihood during the M-step.

Training the biases is similar, but uses only single node activity:

[pic]

[edit]Problems

The Boltzmann machine would theoretically be a rather general computational medium. For instance, if trained on photographs, the machine would theoretically model the distribution of photographs, and could use that model to, for example, complete a partial photograph.

Unfortunately, there is a serious practical problem with the Boltzmann machine, namely that the learning seems to stop working correctly when the machine is scaled up to anything larger than a trivial machine. This is due to a number of effects, the most important of which are:

▪ the time the machine must be run in order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths

▪ connection strengths are more plastic when the units being connected have activation probabilities intermediate between zero and one, leading to a so-called variance trap. The net effect is that noise causes the connection strengths to random walk until the activities saturate.

[edit]Restricted Boltzmann Machine

Although learning is impractical in general Boltzmann machines, it can be made quite efficient in an architecture called the "Restricted Boltzmann Machine" or "RBM" which does not allow intralayer connections between hidden units. After training one RBM, the activities of its hidden units can be treated as data for training a higher-level RBM. This method of stacking RBM's makes it possible to train many layers of hidden units efficiently and as each new layer is added the overall generative model gets better.

There is an extension to the Restricted Boltzmann Machine that affords using real valued data rather than binary data. Along with higher order Boltzmann Machines, it is outlined here [1].

[edit]History

Hopfield net

From Wikipedia, the free encyclopedia

A Hopfield net is a form of recurrent artificial neural network invented by John Hopfield. Hopfield nets serve as content-addressable memory systems with binary threshold units. They are guaranteed to converge to a local minimum, but convergence to one of the stored patterns is not guaranteed.

Structure

[pic]

[pic]

A Hopfield net with four nodes.

The units in Hopfield nets are binary threshold units, i.e. the units only take on two different values for their states and the value is determined by whether or not the units' input exceeds their threshold. Hopfield nets can either have units that take on values of 1 or -1, or units that take on values of 1 or 0. So, the two possible definitions for unit i's activation, ai, are:

(1) [pic]

(2) [pic]

Where:

▪ wij is the strength of the connection weight from unit j to unit i (the weight of the connection).

▪ sj is the state of unit j.

▪ θi is the threshold of unit i.

The connections in a Hopfield net typically have the following restrictions:

▪ [pic] (no unit has a connection with itself)

▪ [pic] (connections are symmetric)

The requirement that weights be symmetric is typically used, as it will guarantee that the energy function decreases monotonically while following the activation rules, and the network may exhibit some periodic or chaotic behaviour if non-symmetric weights are used. However, Hopfield found that this chaotic behaviour is confined to relatively small parts of the phase space, and does not impair the network's ability to act as a content-addressable associative memory system.

Hopfield nets have a scalar value associated with each state of the network referred to as the "energy", E, of the network, where:

[pic]

This value is called the "energy" because the definition ensures that if units are randomly chosen to update their activations the network will converge to states which are local minima in the energy function (which is considered to be a Lyapunov function). Thus, if a state is a local minimum in the energy function it is a stable state for the network. Note that this energy function belongs to a general class of models in physics, under the name of Ising models; these in turn are a special case of Markov networks, since the associated probability measure, the Gibbs measure, has the Markov property.

[edit]Running

At each step, pick a node at random. The node's behavior is then deterministic: it moves to a state to minimize the energy of itself and its neighbors. (In contrast, the Boltzmann machine has a stochastic update rule.)

[edit]Training

Training a Hopfield net involves lowering the energy of states that the net should "remember". This allows the net to serve as a content addressable memory system, that is to say, the network will converge to a "remembered" state if it is given only part of the state. The net can be used to recover from a distorted input the trained state that is most similar to that input. This is called associative memory because it recovers memories on the basis of similarity. For example, if we train a Hopfield net with five units so that the state (1, 0, 1, 0, 1) is an energy minimum, and we give the network the state (1, 0, 0, 0, 1) it will converge to (1, 0, 1, 0, 1). Thus, the network is properly trained when the energy of states which the network should remember are local minima.

Navigation

One of the earliest recurrent neural networks reported in literature was the auto-associator independently described by Anderson (Anderson, 1977) and Kohonen (Kohonen, 1977) in 1977. It consists of a pool of neurons with connections between each unit i and j, i 6= j.

[pic]

The auto-associator network. All neurons are both input and output neurons, i.e., a

pattern is clamped, the network iterates to a stable state, and the output of the network consists of

the new activation values of the neurons.

All connections are weighted. In 1982, Hopfield (Hopfield, 1982) brings together several earlier ideas concerning these networks and presents a complete mathematical analysis based on Ising spin models (Amit, Gutfreund, & Sompolinsky, 1986). It is therefore that this network, which we will describe in this chapter, is generally referred to as the Hopfield network.

The Hopfield network consists of a set of N interconnected neurons which update their activation values asynchronously and independently of other neurons. All neurons are both input and output neurons. The activation values are binary. Originally, Hopeld chose activation values of 1 and 0, but using values +1 and -1 presents some advantages discussed below. We will therefore adhere to the latter convention.

The state of the system is given by the activation values1 y = (yk). The net input sk(t+1) of a neuron k at cycle t+1 is a weighted sum

[pic]

A simple threshold function is applied to the net input to obtain the new activation value yi(t + 1) at time t + 1:

[pic]

Hopfield network as associative memory

A primary application of the Hopfield network is an associative memory. In this case, the weights of the connections between the neurons have to be thus set that the states of the system corresponding with the patterns which are to be stored in the network are stable. These states can be seen as 'dips' in energy space. When the network is cued with a noisy or incomplete test pattern, it will render the incorrect or missing data by iterating to a stable state which is in some sense 'near' to the cued pattern.

[pic]

i.e., if xp j and xp k are equal, wjk is increased, otherwise decreased by one (note that, in the original Hebb rule, weights only increase). It appears, however, that the network gets saturated very quickly, and that about 0:15N memories can be stored before recall errors become severe. There are two problems associated with storing too many patterns:

1. the stored patterns become unstable;

2. spurious stable states appear (i.e., stable states which do not correspond with stored patterns).

The first of these two problems can be solved by an algorithm proposed by Bruce et al. (Bruce, Canning, Forrest, Gardner, & Wallace, 1986):

Algorithm 1 Given a starting weight matrix [pic]for each pattern xp to be stored and each element xp k in xp dene a correction εk such that

[pic]

Now modify wjk by [pic]Repeat this procedure until all patterns are stable.

It appears that, in practice, this algorithm usually converges. There exist cases, however, where the algorithm remains oscillatory (try to find one)!

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download