Notes 20110212 CIS 6050 Neural Networks
From SnOwy - Ed's Wiki Notebook
Instructor: Dr. Calvert
Contents |
Self-Organizing Maps (SOM)
- competitive learning --
- output neurons compete to determine which one(s) remain active
- usually the one(s) with the largest activation
- can be one or more neurons which remain active
- winner-take-all
- two layers (inputs, clusters)
- hidden layer (weights) a one or two dimensional array (lattice)
- input patterns activate hidden nodes (clusters)
- creates a topographic map
- clusters which are near each other in the map are similar to each other
- neighbouring nodes are similar to the active node
- distant nodes are not as similar to active node
- this clustering method pushes the clusters close together
- whereas most other clustering methods do not rearrange the clusters
- SOM can be viewed non-linear generalization of principle component analysis
Processing Model
- given input pattern
- calculate hidden node activation -- inputs are filtered through weights
- hidden node with largest activation is the winner
- nodes near winning neuron are identified
- topological neighbourhood is identified
- requires a distance from the winner to describe size of neighbourhood
- adjust weight values of winner and neighhbours
- makes it more likely that they will respond to similar input patterns
- creates regions in the hidden layer based on similarity of clusters
- by adjusting a neighbourhood of weights, createes more than a single cluster
- clusters can intersect
SOM Processing
- see Kohonen's website
- competition -- input filters through weights
- input pattern
- x = [x1, x2, x3 ... xm]T
- input length is m
- weights wj = [wj1, wj2, ... wj3], j - 1 ... l
- initially wj random
- inputs are filtered through SOM layers
Competition
Thanks to Richard Schwarting for this section -- I was absent for it!
/o o o o o o
/1 1 o o o o hidden layer, 2D (*** could actually be 3D)
/2 1 o o o o
-----------
^
| weights, completely connected
|
[o o .. o] input layer
1 2 m
[o o .. o] hidden layer, 1D
1 2 l
^
| weights, completely connected
|
[o o .. o] input layer
1 2 m
activation is the hidden layer is calculated using the Inner Product
multiplication of vectors; weights transposed with input vector
W[j]^T * x x: input So, we have [ w w w ] [ x x x ] => [ y ], want to maximise y
non-linear step: where we choose which function remains activated:
The node with the greatest activation is the centre of the topological neighbourhood
we want to maximise the inner product (activation)
(to make it more likely to happen in the future, to create a larger activation)
maximising the inner product is mathematically equivalent to _minimising_the_distance_ between X and W[j]
x = [ 0.5 0.1 0.7 ] -> want w[j] to approach x
*** input vector & 'weight to winning node' will become more simiar
consequently, the weights will model the input that they are patterning for
the resulting weights represent a _prototype_ for a cluster
the result of the activation is the node which *most closely* matches the input pattern
so similar-but-not-the-same patterns are clustered together :D
Dimensions
- for some reason, no one was clear about the dimensions, so I'll clarify it here.
- the number of nodes is independent from the number of inputs -- however, the number of weight values is equal to that of the inputs.
- for a 1D map: you will have this many weight values: |nodes|×|inputs|
- for a 2D map: you will have this many weight values: |rows|×|columns|×|inputs|
Cooperation
- the neighbourhood around the winning node is identified
- identified using hi,j = exp( - dij2 / 2σ2)
- j is one of a set of cooperating nodes (neighbourhood)
- i is the winning node
- dij is the distance between i, j
- σ parameter controls width of the neighbourhood
- controls how nodes near winning node are modified by weight changes
- effect of weight changes decrease with distance from winning node
- hi,j is the activation (?) -- NO
- h is actually a neighbourhood function -- we use it as a multiplicative mask later for how much to adapt each weight
- dj,i -- (?)
- hj,i symmetric about maximum point
- decreases with dj,i
- size of the neighbourhood shrinks with time
- this allows large areas to form early during training
- eventually becomes more defined
- controlled by adjusting σ during training
- popular decay function is σ(n) = σ0 exp( - n / Γ)
- where n = 0, 1, 2 ...
- n represents time
- σ0 initial width parameter
- Γ1 is a constant
- σ decreases as n increases
Weight Updating (Adaption)
- notice we've changed the subscription i and j here --
- here, we'll call j the winning node
- this is where we update the weights
- wj(n + 1) = wj(n) + α × hji(x)(n) × (x - wj(n))
- where wj(n + 1) is the new weight -- time + 1
- α is a learning rate parameter
- can change with n -- can decrease with time
- similar to back propagation learning, high α may cause system to oscillate
- hji(x) -- neighbourhood nodes near winner will change
- x - wj(n) -- difference between input pattern and weight vector
- to the winning node
- gradually making the new weights look like the input patterns
- this really actually pushes everything toward the input patterns
- the only real challange to this learning algorithm is calculating the neighbourhood
- with repeated presentations
- the weights tend to follow the distribution
- of the input vectors
Changing the Learning Rate with respect to Time
- to adjust learning rate α so that it decreases with time ...
- we use α(n) = α0 × exp (- n / Γ2)
- n = 0, 1, 2 ...
- where α0 -- initial learning rate
- Γ2 -- is a constant
- exponential decays might not be optimal -- but are sufficient
Γ
- probably between 0, 1 -- look it up.