“You can talk for years about the theory of tying your shoes and still not be able to do it.”
-Terry Meeks, 1984
In their paper “Nonlinear signal processing using neural networks: Prediction and system modeling”, Alan Lapedes and Robert Farber explored training neural networks to predict the behavior of various non-linear data sets. As I continue to explore deep learning, I thought it might be good for me to review some of their work. Lapedes and Farber’s early work on training neural networks still applies to the disciplines of data science and deep learning. In the process of reviewing the work of Lapedes and Farber, I also hope to repeat some of their work using modern tools, including brushing off my python skills, which I haven’t used for a few years. I am also looking forward to working with Google’s TensorFlow and the Keras python framework.
Something to Predict:
Before we start to dig into neural networks, let’s set the stage a bit with an overview of what we’re trying to accomplish. The goal of Lapedes and Farber’s work is to develop a mechanism that can predict the behavior of a nonlinear system. The first nonlinear system explored by Lapedes and Farber in their paper is a sequence of numbers called “The Feigenbaum Map“. The Feigenbaum Map is familiar to many people who have explored chaotic and/or dynamical systems, and is often accompanied by the following diagram:
Feigenbaum Map: x(t+1) = 4 * x(t) * r * ( 1 – x(t) )
This fun little formula has many interesting behaviors. The key point for our exploration is that if we fix r, the Feigenbaum map starts to display some very non-linear behavior. Consider the sequence of points for the map with r=4 over time:
It may be obvious at this point that the sequence of numbers generated by the map seems fairly random. To compute the sequence, you pick a number between 0 and 1. (I picked 0.23 above.) The next value in the sequence then becomes four times that number, multiplied by the value of 1 minus the number, or x(t+1) = 4 * x(t) * ( 1 – x(t) ) .
Now, imagine that we don’t know the formula, and all we know are the data points. Getting from the data points back to the formula is very difficult. There are a number of ways to do it, including polynomial regression. However, even polynomial regression requires that you somehow suspect that the underlying system involves a polynomial of some degree. When you have a sequence of data that is fairly evenly distributed, that can be very difficult. Lapedes and Farber were able to show that a neural network could be trained to very accurately predict the behavior of the sequence. Let’s take a look at how they did it.
The network that Lapedes and Farber used to “learn” the Feigenbaum Map looked like this:
The above diagram shows the connections between the nodes. It does not show, however, the weights associated with each connection, the activation functions in the nodes, or the biases associated with each activation function. One interesting aspect of this simple network that Lapedes and Farber used is the “short circuit” between the input and output nodes. Lapedes and Farber simply state in their paper that they “…chose a network architecture with 5 hidden units as illustrated.” They don’t go into detail as to why they chose to include the connection directly from the input to the output node. We will return to this observation later. For now, however, we will just go with it.
The process for working with neural networks typically is as follows:
- Obtain a set of data that includes known inputs and outputs. This is sometimes referred to as “The Training Set”.
- Construct a neural network that has a number of input nodes and output nodes that match the training set. (For our example, we are using 1 input node, and one output node because our data has an expected output based on a single input data value.)
- Pass in the training set data and make adjustments to the network so that the network output matches the training set output to within some limit, or continue training for a set period of time.
- Once the network has been trained, you feed it new data that you don’t know the expected values. The network then makes predictions based on its training.
This is just one simple way to use neural networks. More common uses also include using neural networks to categorize and recognize patterns. We will visit those topics in future blog posts.
Training and Back Propagation of Error:
So, what was that step 3 again? Let’s consider a very simple neural network:
This network has the following parts:
- Input (I): This is the input value for the network from our training set.
- Weight (w): This is the weight that we multiply the input by.
- Bias (b): The bias is used to tune the activation function.
- Activation Function (f): The activation function takes as inputs, the input, weight, and bias. For our purposes, we can use the sigmoid function of:
- Output (O): The output of the network.
- Target (T): This is what we WANT the network to return. It is our training value.
- Error (E): Error. The is how far off the network was from our desired target.
We can think of the output of the network as follows:
O = f( (i*w) – b ) (formula 1)
And we can think of the error of a single pattern as:
E = ( T – O )^2 (formula 2)
Another way to think of this is just trying to minimize the distance between O and T. We square the difference because we want our Error to be positive, and we will try and get the positive value as close to zero as possible.
So, substituting in our values for the above O, we have the following.
E =( T – f( (i*w) – b ) )^2 (formula 3)
Now, the cool part. If our function, f, is a continuous, differentiable function, we can use a technique called gradient descent to minimize the error. To do that, we simply take the partial derivatives of the above equation with respect to w and b. For the partial derivative with respect to the weight, w, looks like this: (Note: we need to use the chain rule here).
Similarly, the formula for the partial derivative of the error function with respect to the bias looks like this:
In more complex neural networks, we can generalize the equations over the various nodes. And, we can walk our way back up the network as we compute the various partial derivatives. In this way, we are back-propagating the partial derivatives of the error function with respect to the weights and biases of the network.
In the case of the simple one node network in figure 1, we can actually visualize what we are doing as follows: What we have in formula 3 is a function of two variables w and b. The values of T and I are the input and target values provided by our training set. What we are trying to do is pick values of w and b such that E is as small as possible. If we consider the function of w and b and the output E as a 3-dimensional graph, what we are trying to do is find the point on the surface that is the lowest. (In this diagram, the Z values represent the output of our function or E. The X and Y axis correspond to our input values of w and b.
Our function is not as smooth as above, and it changes with every input-output pair. That said, the graph does give an intuitive feel for what we are trying to do.
Keep in mind that we are trying to find the minimum of the above graph. Because our error function is a continuous function, and we can take the derivative of it, we use gradient descent to find the direction to head in order to move towards the steepest “down”. The partial derivatives that we calculated in the above equations point us in the direction of the minimum.
You can think of it like placing a marble on the graph, and seeing which way it would roll down to the lowest point. In the above graph, the marble would roll right to the bottom. This is a trivial case. However, what if our error graph looked like this:
In this case, we might literally get stuck in one of the wrinkles of the surface and not be at the global minimum. We will incorporate the idea of using a “momentum” that will help us “roll out” of a local minimum until we settle at (what we hope) is the actual true minimum of the function. The other technique that we can use is to pick up the marble and place it at a different spot on the graph. We would do this by simply randomizing the weights and biases. I like to think of that technique like pounding on the side of a pinball machine.
OK. Enough of this trivial case. Let’s go back to Lapedes and Farber’s first neural network. It has considerably more weights and biases than our single node network. Lapedes and Farber describe in their paper how to calculate all of the partial derivatives with respect to weights and biases. That was then, this is now.
TensorFlow is a relatively new framework that takes care of a lot of the coding around the partial derivatives and the gradient descent algorithm required to train neural networks. It offers many different activation functions and a number of different optimization algorithms in addition to gradient descent. Another useful framework written on top of TensorFlow is Keras. The Keras framework makes constructing and training neural networks even easier.
When I did my neural network research in graduate school, I wrote everything in C. Back then, we didn’t have many of the more modern languages like .NET, or Java. C was the “new hotness.”
Much of the programming work done today in the field of “Data Science” is done in python. There are several python packages including “numpy” that make dealing with mathematical and scientific programming easier. My original neural network code was several pages long. The code for constructing and training a network similar to Lapedes and Farber’s neural network is much easier today thanks to TensorFlow and Keras. Let’s take a look:
# ***************************************** # A simple neural net using # Keras and TensorFlow for # training a neural netowrk to learn # the Feigenbaum Map # # June 1, 2017 # Miles R. Porter, Painted Harmony Group # This code is free to use and distribute # ***************************************** import numpy import pandas from keras.models import Model from keras.layers import Dense, Input from matplotlib import pyplot # load dataset dataframe = pandas.read_csv("logistics.csv", delim_whitespace=True, header=None) dataset = dataframe.values # split into input (X) and output (Y) variables X = dataset[:, 0] Y = dataset[:, 1] #Set up the network def baseline_model(): inputs = Input(shape=(1,)) x = Dense(30, activation='sigmoid')(inputs) predictions = Dense(1, activation='sigmoid')(x) model = Model(inputs=inputs, outputs=predictions) model.summary() model.compile(optimizer='rmsprop', loss='mean_squared_error', metrics=['accuracy']) return model # Do the work model = baseline_model() results = model.fit(X, Y, epochs=1000, batch_size=2, verbose=2) print("Training is complete.\n") # Make predictions print "The prediction is:\n" newdata = numpy.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]) prediction = model.predict(newdata, batch_size=1, verbose=1) print prediction # Plot the predicted and expected results expected = newdata * 4.0 * (1.0 - newdata) pyplot.plot(prediction, color="blue") pyplot.plot(expected, color="green") pyplot.show()
The above program references a logistics.csv file. That file is just a two column file that contains a sequence of numbers that start with 0.23 and follow the Feigenbaum Map like so…
0.23 0.7084 0.7084 0.82627776 0.82627776 0.5741712933 0.5741712933 0.977994477 0.977994477 0.08608511987 0.08608511987 0.314697888 ...
After we have trained our network on the sample data, we can run some new values through it. Because we are training to a known functions, we are able to compute just how accurate the predictions are:
If we play around with the network parameters, like the number of nodes, and the number of passes through the training set (epochs), we can improve the accuracy:
In fact, if we play around with the number of layers in the network, we can start to get some pretty amazing accuracy…
Is this perfect? No. But often these types of predictions don’t need to be. The important thing here is that we are able to train a neural network to estimate the values of a seemingly random sequence. In a sense, the network is able to find hidden meaning in what would appear at first to be random noise. Again, it is true that you could use techniques like polynomial regression to discover or closely model the sequence. After all, the underlying polynomial expression is simply f(x) = 4x(1-x). Keep in mind, however, that our neural network doesn’t know that the underlying polynomial is a second-degree polynomial… or even that there is a underlying polynomial.
A valid argument that you can encounter with this approach is that it is entirely possible that the training algorithm for the network could get stuck on a local minimum and never actually converge. In that respect, it is possible that you may NOT be able to train a neural network to learn a non-linear system, particularly if you make bad choices in the parameters that define the network. That said, there is still significant value in neural networks. While they may not be able to predict every sequence, or find logic where there is none, they CAN find logic in systems that seem random. An algorithm doesn’t need to be perfect in order to provide value. If it is a cloudy day, do you bring an umbrella, even though it may not rain?
There are a number of mathematical theorems that exist that address the convergence of training neural networks including A Generalized Convergence Theorem for Neural Networks by Jehoshua Bruck and Joseph W. Goodman.
A Little Black Magic Never Hurts:
In looking at the network that we trained above, you may notice a couple of things. First, Lapedes and Farber’s network had an extra connection from their input node to the output node. Secondly, Lapedes and Farber used a neural network that had a single hidden layer with 5 nodes. Why 5 and not 3 or maybe 7? The networks that I have trained above have either 10 or 30 hidden nodes and no connections between the input and output layers. There is some black magic involved to know how to construct the networks in that way. As we have also seen above, it is possible to also add multiple hidden layers to neural networks for various purposes. As I continue to explore the topic of deep learning, I plan to blog about how to go about using various network topologies in order to learn different training sets. Also, it is possible to vary the structure of the network as part of training. I hope to touch on that as well.
Further Work and Looking Ahead:
The problem of “learning” the Feigenbaum map is just the beginning of what neural networks can do. Lapedes and Farber go further in their article as they explore training neural nets on the Mackey-Glass Equation, and modeling adaptive control systems. (Lapedes and Farber p.19-20)
Also, as I have mentioned before, much of the new work involving neural networks involves classification of data. This not only includes simple time series data (for example, classifying ECG signals in an attempt to uncover heart arrhythmias) but also scenarios like classifying sounds and audio signals, photographs, movie clips, etc. In those cases, we might be interested in discovering when observed data differed from a trained network prediction. For instance, what if you trained a neural network to estimate the oil pressure of an engine based on the engine RPMs. It would then be possible to continuously estimate the oil pressure based on the engine RPMs, and raise an alert when the actual pressure significantly diverged from the network’s prediction.
A good overview of deep learning in general includes Deep Learning – MIT Technology Review.
Another great resource is the Udacity.com course on deep learning:
And the book Deep Learning by Ian Goodfellow, Yoshua Gengio, and Aaron Couville:
I hope you have enjoyed this post, and will check back for more interesting topics in software engineering, IoT, and deep learning.
I was going to start this blog post with a song reference, but I just found out that one of my teachers in high school recently died. Terry Meeks taught humanities, was a writer, a musician and a seeker of truth and beauty. Though I haven’t seen her in years, the memories of her in front of the class teaching us how to write, to love music, and to appreciate art and literature are still vivid. She was one of a kind.
I remember doing a presentation in her class that involved bouncing laser beams off speaker cones to create a sort of spirograph laser show. I remember her being as interested in the trigonometry of sine and cosine waves as she was in enjoying the Bach Brandenburg Concerto… which she was probably hearing for the thousandth time. Mrs. Meeks was always open to attempts to bridge the gap between art and science.
Mrs. Meeks (Terry, as she instructed us to call her when we graduated from high school) had a big impact on me. From her, I learned a little about how to write. I also learned to love the search for beauty and truth.