Notes 20110324 CIS 6050 Neural Networks
From SnOwy - Ed's Wiki Notebook
Continuing on Radial Basis Functions (RBF)s
- trying to turn a non-linear problem into a linear problem so we can solve it more easily
Contents |
Architecture
- let xi be each of the input values
- a weight layer connects xi to φi
- the weight layer really is only a copy function so that each output unit yk sees each input value
- each basis function φj receives all input vectors
- not really weights -- a copy operation (weight layers between input and hidden, wij)
- additionally, a bias vector points to the output layer too
Training
- determining the parameters for the basis functions (unsupervised)
- once basis functions set, weight layer wkj is determined -- uses both input and output data
Step Two
- the second step is the easier, cheaper, faster step
- involves solving a set of linear equations
- the error at the output is normally a sum of squares ...
- E = ½ΣnΣk{yk(n)-tkn}2
- for all input patterns ...
- sum the output values for each output node ...
- achieved output for node k given xn
- subtract desired output for node k given input n
- xn -- input pattern
- tn -- desired output value
- (input-output training pair)
- the error function a quadratic function
- its minimum can be found as a solution of a set of linear equations (solving a set of matrices)
- the formal solution for the weights is WT = ΦT
- WT is the matrix of weights wkj (transposed)
- Φ is the pseudoinverse of the matrix of basis function results ...
- φj(xn)
- T is a matrix of desired outcomes ...
- tkn for given inputs xn (network outputs)
- normally solved with singular value decomposition (SVD)
Creating the Basis Functions and Optimizations thereof
how do we make Φ?
- training the hidden layer
- the operations can be described as different techniques described using ...
- regularization theory
- noise interpolation theory
- kernel regression
- function approximation
- estimation of posterior class probabilities
- each of the above suggest something --
- the basis function parameters should be chosen to represent the probability density
- the result of this is the training procedure is an unsupervised optimization of the basis function parameters
- the basis function centers μj can be regarded as prototypes for input vectors
- benefit: in general, training heteroassociative networks, we need a lot of data ...
- However, with the RBF network, we can get away with large amounts of unlabelled data to train the first layer
- and relatively small amounts of labelled data to train the second layer
Problems
- to train the RBF network, we start with an input layer of some size
- this connects to a hidden layer that is much larger by contrast
- this size difference allows us to spread the data out so that it is more likely to be linearly separable
- if the hidden layer grows too large, then clusters don't emerge; instead, each input gets its own unit
- this is a problem (pathological case)
- this occurs if the basis functions (the hidden nodes) are required to fill the problem space
- the number of basis functions increases exponentially with the dimensions of the problem
- this requires long training times, large number of training patterns, many hidden units
- this is a particularly costly problem when input variables with a high variance
- but have little effect on determining the output
- input variables with this property are not uncommon
- when basis functions are selected using only input data, there is no way to identify if the patterns are relevant
- contrast to back propagation which smooths out exemplars