From SnOwy - Ed's Wiki Notebook
Back Propagation (BP) Advantages
- able to store many patterns
- contrast: Hopfield -- can store few patterns
- can store arbitrarily complex non-linear mappings
- will not store conflicting patterns
Limitations
- long training time
- contrast: Hopfield -- able to train quickly
- offline encoding -- training phase is different to testing phase
- stochastic training -- difficult to know how to generate any mappings
Learning Rule Improvements
- momentum -- the first and arguably best training innovation
- the most common method to accelerate convergence
- from signal processing
- add a momentum term to the weight change during training (ΔWij)
- Δij = αbidj
- becomes ...
- ΔWij(n+1) = αbidj + γΔWij(n)
- where γ ∈ [0,1]
- where n is the index for presentation (time -- exemplars -- data points)
- γ is a constant which affects momentum rate
- α -- learning rate → low [0.05, 0.1]
- γ -- momentum rates → [0.8, 0.9]
- engineering: is a filter
- filter out high frequency variations in the error surface
- this is a high pass filter -- where small variations are removed and big variations are retained
- useful for weight spaces with long ravines with sharp curvature
-| \
-| \
-| \
-| \
-| \
-| \
- moving in correct direction, slowly due to high frequency oscillations
- left: no momentum -- right: high momentum
A Simple Example
- two-bit parity (XOR)
- input → output
- (0,0) → 0
- (0,1) → 1
- (1,0) → 1
- (1,1) → 0
- requires two inputs
- one output
- input -- two nodes
- hidden -- two nodes
- output -- one node
- with learning rate of α=0.25, it took 6587 presentations of the data to learn (~1500 epochs)
- where one epoch is a single complete presentation of the entire training set
A More Complex Example
- NETtalk
- converts English text (ASCII) to phonemes
- phonemes = sounds when spoken
- text to speech conversion
- pronunciation follows rules but many exceptions
Sequential Information
- problem: back propagation does not process sequential information
- -- it keeps no history (context) of past events
- a context needed for text conversion
- a in cake vs a in cat.
- i, th in with.
- problem solved by presenting the surrounding six letters to the network at one time
- creates a context in the inputs
- sliding window in time
- input[0] = char[-3]
- input[1] = char[-2]
- input[2] = char[-1]
- input[3] = char[0]
- input[4] = char[1]
- input[5] = char[2]
- input[6] = char[3]
- input layer (6 units) → hidden layer → output layer
- called windowing
- each of the seven character input windows is a 29 component binary vector
- input = 〈 26-letters, space, two punctuation chars 〉
- only one of the 29 inputs per window is set to one {1.0}
- indicates which letter is active in that window
- 7 windows * 29 characters = 203 binary inputs
- there were 26 outputs
- output = 〈 23-phonemes, three features for stress and syllable boundaries 〉
Training
- involved moving the seven character window through the data set ...
"the cat ran"
the cat
he cat
e cat r
cat ra
cat ran
- the system trained on 1024 word vocabulary after 50 epochs
- could pronounce "accurate" speech with testing data
Problems
- window may be too short for large words
- some words are spelled identically but pronounced differently
- bow
- tear
- house
- bass
- lead
- live
- address
- advise
- these words require a context derived from the surrounding text (other words)
- ... to identify how they should be pronounced
- ... and even that might not be sufficient
- -- I went to the store and bought a bass