The 4th Ave Train Spotter is a piece of technology I helped develop with the good people at Cohub, a software company based out of Nashville, TN. My dad, Steve, and brother, Elliott, both help run the company. Cohub’s office is located in the Wedgewood-Houston neighborhood of Nashville, where railroad tracks cross over two roads, 4th Avenue and Chestnut Street. The trains passing through can be quick, or they can take their time. It’s common for a train to sit parked at one of the intersections for 45+ minutes, not letting any cars by. My dad detailed this problem and gave a high-level overview of our solution in this Cohub blog post. I don’t wish to repeat what he’s already written. Instead, this post is going to dive into the train detection technology that I built for the app. You can find all the detection code (not including the iOS app and web stuff, which I didn’t develop) in this GitHub repo.
Framing the Problem
Here’s a screenshot of Google Maps showing the area around Cohub, with the two railroad crossings circled:
On the outside of Cohub’s office, we’ve mounted two AXIS Q1786-LE network cameras, one pointed at each intersection. Here’s the view looking at Chestnut when a train is present:
And here’s the view when the train’s gone:
I’ve drawn a box around the crossing gate. This is the gate that lowers when a train is on the way or passing through. We use the fact that it lowers ahead of the train coming to give our users advance warning of intersection blockage.
The story is similar for the Fourth Ave intersection. Here’s the view with a train present:
And here’s the view without a train:
I’ve once again drawn boxes around features of the crossing that give us a heads up that a train’s coming. In this case, it’s two sets of warning lights. These exist on the Chestnut side, too, but they’re harder to see from our angle.
If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.
One dumb way to solve our train problem would be to pay someone to look at the intersections 24 hours a day, and every time a train is coming, based on the crossing gate going down or warning lights blinking, update the app manually to let everyone know. When the train rolls in, update the app again to let everyone know the train’s arrived. It would only take a second for a person to recognize when a train is coming, but they’d have to be there, vigilant, at all times. This smells like a job for computer vision. If we could have a pair of “robot eyeballs” looking at the intersection instead, perhaps we could automate these train notifications. It would go something like this:
- Get an image of the intersection.
- Is there indication that a train is coming? If yes, send a notification. If not, do nothing.
- Is there a train? If yes, send a notification. If not, do nothing.
- Go back to 1.
That’s the essence of what our solution does. We have four convolutional neural network (CNN) based models doing this task automatically: two for detecting trains at each intersection, and two for detecting “signals” of trains coming (i.e. crossing gate down, warning lights blinking) at each intersection. CNNs are the state-of-the-art in this domain (image classification), and Keras makes it easy to build them, so that’s the route we took.
Deep neural networks tend to perform better and better the more data you give them for training. Fortunately for us, data is abundant. In addition to monitoring train activity live, we also use the cameras to harvest data for training. Using AXIS’s API, we can grab images from the cameras whenever we want. Getting negatives (i.e. no train) is easy, as usually there’s not a train. Getting positives isn’t too bad, either, as the cameras have built-in motion detection. You can configure them to make a recording whenever some movement happens in a specified area of the shot, so we use this feature to record train events. You might be thinking, “Couldn’t you just use that feature to solve your problem?” Sort of. It works for data collection, but there’s an intolerable amount of false positives for production use. For example, a passing bus or garbage truck is likely to set off the motion detection. So, we use these cameras to get thousands of examples of shots with and without the train from each intersection, manually sifting through the false positives. These images form our dataset.
Model Design and Training
Our system uses four models, as noted earlier. Each is responsible for one task:
- Fourth Ave train detection.
- Fourth Ave signal detection (detects blinking warning lights).
- Chestnut St train detection.
- Chestnut St signal detection (detects lowered crossing gate).
Simplicity is a primary goal for our models. I don’t want to spend tons of time tuning hyperparameters and running training after training to get something that works, so we constrain the problem and our solution in two key ways:
- We always take the shots for training and inference at the same angle and zoom level.
- We prefer more models, each good at a highly specific task, rather than fewer models with greater generality.
These are not general purpose models. Take the train detection models. One is a “Fourth Avenue at this specific angle and zoom level train detector” and the other is a “Chestnut Street at this specific angle and zoom level train detector.” I belabor this point to make it clear that these models won’t work outside of the context we trained them in. We have trained them for highly specific tasks to make training simpler and to maximize accuracy. If we zoom out one of the cameras even a little bit, the predictions at runtime go haywire. We accept this fragility for the sake of accuracy. I’ll touch on this issue again later.
Train Detection Model Architecture
I’ll attempt to give the intuition behind the various choices made in designing this architecture, but it should be noted that designing deep neural networks involves a lot of experimentation. Neural networks, viewed as predictive models, typically exhibit low bias but are prone to high variance. They’re capable of modeling highly complex functions, but this comes at the cost of interpretability and the risk of overfitting. For the most part, my reason for doing one thing vs another is simply, “it was shown to increase accuracy empirically.” I won’t be able to provide analytical explanations for all design choices.
Below is a summary of the layers of the model. All convolutional layers use a kernel size of 5x5, zero padding, and stride of 1. All max pooling layers use a pool size of 2x2, no padding, and a stride of 2.
Note: The pixels are 3 channels, RGB, not grayscale. I have omitted the channel dimension in the input shape column below for brevity.
|Layer Type||Input Shape||# Filters / Units||Activation||Dropout Fraction|
|Max pool||(216, 384, 8)||—||—||—|
|Dropout||(108, 192, 8)||—||—||0.2|
|Conv||(108, 192, 8)||16||ReLU||—|
|Max pool||(108, 192, 16)||—||—||—|
|Dropout||(54, 96, 16)||—||—||0.2|
|Conv||(54, 96, 16)||32||ReLU||—|
|Max pool||(54, 96, 32)||—||—||—|
|Dropout||(27, 48, 32)||—||—||0.2|
|Conv||(27, 48, 32)||64||ReLU||—|
|Max pool||(27, 48, 64)||—||—||—|
|Dropout||(13, 24, 64)||—||—||0.2|
|Flatten||(13, 24, 64)||—||—||—|
- Conv layers: I chose 5x5 for the filter/kernel size because it boosted performance over the typical 3x3 configuration. As far as CNNs go, 384x216 is a fairly large input size. My intuition in trying 5x5 filters was that using a larger filter makes sense since we are dealing with larger images than usual. Increasing the number of filters as you get deeper in the network is also a typical pattern that I borrowed after seeing others do it. The intuition goes that the early layers learn simple features, while later ones learn more complex, abstract features. Adding filters in these later layers increases the network’s capacity to learn those complex features. Why powers of 2? I’m a computer engineer, after all: was there really another choice? ;) I used zero padding because it’s easier to see how the input shape of each layer changes if the only layers performing dimensionality reduction are the pooling layers. I chose ReLU because, well, that’s what everyone does. I’m standing on the shoulders of giants before me who figured out this was a good activation function.
- Max pooling layers: I experimented with different pool sizes and strides, but ultimately, I didn’t note much of a difference in performance between the experiments. I stuck with Keras’s default, 2x2 pools with stride of 2, for simplicity.
- Dropout layers: Overfitting is the single biggest obstacle to training these models. Dropout helps reduce overfitting. I experimented with other forms of regularization as well, including batch normalization and L2 regularization on the conv layers. Overall, dropout provided the most significant impact on reducing overfitting, while the impact of the other forms of regularization was negligible.
- Dense layers: The number of neurons in the penultimate dense layer was the product of experimentation. 64 provides good performance while keep training times reasonable. The final dense layer has a single neuron with sigmoid activation since this model is making a binary classification.
Signal Detection Model Architecture
Detecting the signals is an easier classification task than detecting the trains. As such, the model architecture isn’t as deep. The conv layers use 3x3 filters instead of 5x5. Again, the intuition is based on image size. The input images to the signal detection model for Chestnut St are 170x180 and for Fourth Ave they’re 130x130. Both sizes are significantly smaller than those of images going into the train detection model. Padding and stride for the conv layers are the same as the train detection model, as are the pool size and stride for the max pooling layers. The table below is for an input size of 180x170 (Chestnut).
|Layer Type||Input Shape||# Filters / Units||Activation||Dropout Fraction|
|Max pool||(180, 170, 8)||—||—||—|
|Dropout||(90, 85, 8)||—||—||0.2|
|Conv||(90, 85, 8)||16||ReLU||—|
|Max pool||(90, 85, 16)||—||—||—|
|Dropout||(45, 42, 16)||—||—||0.2|
|Flatten||(45, 42, 16)||—||—||—|
I won’t go into the details of how I chose each parameter for this model because the reasoning is pretty much the same as for the train detection model.
Not all the images in the train and signal datasets are the size we want them to be before we feed them into their respective models. The first step of training is to resize them to conform to the input shape shown in each table above. Then, we rescale the pixel values to be between 0 and 1. Why? This is a common technique employed to speed up gradient descent and thus reduce training time.
Pixel values are often unsigned integers in the range between 0 and 255. Although these pixel values can be presented directly to neural network models in their raw format, this can result in challenges during modeling, such as in the slower than expected training of the model.
The dataset is then split into training and validation sets, so that we can monitor performance during training. For training the train detection model, I use some minor data augmentation that I observed reduces overfitting: height and width shifting and horizontal flips. Data augmentation is easy with Keras’s ImageDataGenerator class. I use the SGD optimizer for both models. I experimented with Adam and a few other adaptive learning rate optimizers, but SGD outperformed them all. For the train detection model, I ran tons of experiments with different learning rates before converging 4e-3 as the optimal value. For the signal detection model, I was able to use a more aggressive learning rate of 1e-2. Both models use a momentum value of 0.9 with a decay of learning rate divided by number of epochs. I typically train the train detection model for 50 epochs, with a patience of 8, meaning if validation loss doesn’t improve over 8 consecutive epochs, training stops. This is done using Keras’s EarlyStopping callback. For the signal detection model, 10 epochs with a patience of 3 is sufficient. I train both models with a batch size of 32.
Below are some example training results for each model generated using Keras’s CSVLogger callback. Overfitting, characterized by improvement in training loss with little to no improvement in validation loss, is not a problem for the signal models, but the train models do begin to overfit as the epochs drag on, though it’s not too significant. All models achieve high validation accuracy in the end.
Note: I am not using a testing set here. I just use training and validation splits. For this problem, I’ve found that validation accuracy/loss is a suitable metric for performance, and that a separate testing set isn’t necessary to judge the quality of a model.
Here’s the training script used for all models. The best weights for the different models are saved after training in the
saved_models directory of the repo. The format of the filename is
Seeing What the Model Sees
I mentioned earlier that these models are sensitive to changes in camera angle and zoom level. A technique called GradCAM can give us some insight about why this happens.
We propose a class-discriminative localization technique called Gradient-weighted Class Activation Mapping (Grad-CAM) that can be used to generate visual explanations from any CNN-based network without requiring architectural changes
In other words, Grad-CAM can help “explain” why a CNN makes a particular classification. I’ll use it here to produce a heatmap overlayed on an image from the dataset. My implementation lives in
vision/gradcam.py in the repo and is largely copied from this pyimagesearch tutorial. The more red an area of the image is, the stronger the activation of that region in the CNN. The more blue an area is, the weaker the activation. Thus, the “hotter” areas are the ones that contributed the most to the final classification made by the network. Here’s an example image of a train at Chestnut, its associated Grad-CAM heatmap, and the heatmap overlayed on the original image. The model said this image contained a train with a probability of 99.95%.
As you can see, it’s not the train that’s lighting up! It’s the tracks underneath the train. The CNN hasn’t learned the train, it’s learned some proxy for the train on the railroad tracks. Whatever it’s learned, it enables very accurate predictions, so we’re ok with the fact that it hasn’t learned to classify trains more generally; it works for our use case. If we wanted to expand to different intersections and create a more general model, I would certainly reach for Grad-CAM to help me see if the network is really learning to identify trains, or if it’s doing something funky like you see here.
The Detection System
The detection system is deployed onto two Raspberry Pis, one for each intersection. Each deployment has three core components:
- Web Publisher
- Event Tracker
These all exist as separate Python classes in the core module. The system is launched at boot by running a script in rc.local. That script invokes
spotter.py, which starts up the detection system.
spotter.py spawns three additional processes with Python’s multiprocessing package. Each core component runs in one of those processes.
The detector is where the train and signal detection models live. The detector pulls an image from the appropriate webcam every N seconds, where N is a runtime configurable parameter. It crops out the signal region(s) and runs it through the signal detection model. It runs the whole image through the train detection model. This produces two floating point values indicating the probability of “signal on” and “train present”, respectively. These two values along with a scaled down version of the camera image are then published over an IPC ZMQ socket.
The web publisher is very simple. It subscribes to the updates from the detector and uses the requests library to POST the probability values to a web server.
The event tracker is used to record information about train events to a filesystem (on a USB drive connected to each Pi, in our case). It subscribes to updates from the detector and begins recording an event when the probability of a train being present exceeds a runtime configurable threshold. It saves the camera image, signal and train prediction values, and timestamp. An event is considered over when the train prediction value drops below the threshold for five consecutive updates from the detector. This event data can be used for later analysis. For example, events that start and end quickly are almost always false positives. We can use the images saved from these transient events for additional training to make the models more robust.
Why Split Them Up?
I split each component into its own process mostly because I wanted to learn about ZeroMQ and multiprocessing with Python. However, there are some other, non-educational benefits:
- This scheme makes better use of the multiple cores on each Pi. Though the web publisher and event tracker are very light on resources, if their needs increased in the future, this design would scale better than running everything in one process.
- Splitting them up and using a single type of message to transfer data between them makes for loose coupling. This makes it easier to change one component without worrying too much about the impact on the other components.
- Technically, if we were ever to want to distribute these components in a network on different hosts (probably will never be necessary), we could do so easily with this design by simply switching the socket to use TPC instead of IPC.
Seeing It in Action
The video below shows the detection system successfully catching a train parked at the Chestnut St intersection. I’m SSH’d into each of the Raspberry Pis. You’ll see that the Pi for Fourth logs nothing (no train), while the Pi for Chestnut is logging an ongoing event. I then show the camera feeds for each intersection and the results in the browser app.
I started learning about computer vision after my dad asked me to help out with this project last year. I was itching to learn something new outside of my primary domain of expertise (FPGAs, computer hardware). I became fascinated with computer vision and machine learning, and I’m hoping to start a new career in something related to these fields. So, a big thanks to my dad, Steve Roche, for getting me involved in this project, tending to the network, setting up and maintaining the cameras, and so much more.
Another big thanks to my brother, Elliott Roche, and friend, Jimmy Baker, for building the web-side stuff and user-facing applications.
I’d like to thank Adrian Rosebrock of pyimagesearch for the excellent tutorials on his website and in his book, Deep Learning for Computer Vision with Python, both of which helped my learning immensely. Similarly, the content on Jason Brownlee’s Machine Learning Mastery has also been instrumental in my learning process. Thanks to Victor Zhou for his blog posts on neural networks and CNNs, which were the best introductions to each topic I found while starting out on this journey.