Robust CV

Background

Over the course of the past 10-15 years, machine learning, and more specifically computer vision (CV) has had a major impact on many industries and the consensus is that it will continue to disrupt and revolutionize more and more facets of everyday life. Already, CV systems have shown promise and even superhuman performance in areas ranging from driving to medical diagnosis. They are able to do this by leveraging massive amounts of data and complex algorithms that can be trained to complete a specific task, recently, however, a serious problem has emerged. Look at the following example: me annotaed To a human, theses look identical. Same animal, same pose, same lighting, etc. But to a state of the art neural net, the image on the right looks like a gibbon. This net, which routinely outperforms humans, is more than 99% sure that the picture on the right is a gibbon. What could be causing this?

CV systems don't 'see' the same way humans do. If asked, "how do you identify a stop sign?", most humans would likely answer something along the lines of shape, color, and the word 'stop' written on it. Neural networks don't operate the same way, they look at features that aren't necessarily salient to the task but allow the model to pictures easily. In practice, this means that they often rely heavily on texture and other aspects that a human wouldn't consider the most relevant features for identifying an object. Clearly, this has some advantages, as evidenced by the system's performance on any number of standardized tasks, however, there are also significant drawbacks. The texture can be subtly changed or 'perturbed' in such a way that it fools the system into thinking that a picture is something it is obviously not. This minor perturbation doesn't affect a human's ability to recognize a picture, and often isn't even noticeable, but it absolutely destroys a computer's ability to make sense of an image.

These perturbations can be introduced to the system in two main ways: targeted or natural. Targeted adversarial perturbations are difficult to defend against especially if the attacker has access to the original model. There are many mature techniques for attacking theses systems and an effective defense against them is still an open problem in the field. Applications of these attacks can be dangerous in the right situation. Many self-driving cars rely at least partially on computer vision, which we know is vulnerable, so if a creative attacker managed to perturb a stop sign in a specific way, they could cause the car to perceive the sign as a 50 mph and accelerate instead of stopping. me annotaed There are many other applications of this approach but targeted attacks are largely out of scope for this writing.

Our approach is a training method for defense against natural adversarial examples, which are broadly defined as images that aren't altered post-photography, but that cause significant and unexpected problems for CV applications. These are slightly easier to correct for as they are not designed to cause problems. me annotaed If we can identify and reproduce the problem, we can simply retrain the network with these examples as well as the original training set. This is called adversarial training. Adversarial training has been used for defending against targeted attacks as well but its efficacy is somewhat limited.

In this repo, we develop an image recognition system based on res-net and the tiny image-net dataset that is somewhat resistant to random natural perturbations. We show that our training method is effective at improving the accuracy of our underlying model when testing on cross-validated tiny image-net data which consists of 100,000 64x64 RGB images spanning 200 classes for training and 10,000 similar images for validation.

Approach

Data Preparation

As mentioned above, we are using the tiny-imagenet dataset. All of our data is dynamically augmented in two main ways. The first and most important modification is generating and applying perturbations. This process will be described in-depth in a future section. The second augmentation method is more conventional data augmentation such as applying simple transformations to the images in the data loader. We use TensorFlow's built-in ImageDataGenerator class to flip, shift, and zoom the images in a pseudo-random fashion as the data is transferred from the directory to the model.

Baseline Model

We originally implemented as relatively shallow CNN consisting of 3 layers each of which had a convolutional layer, a Relu non-linearity, and a max-pooling function, followed by a fully connected layer for the output. We quickly realized that this architecture was not going to perform as well as we needed. We began researching state of the art models that had proven capable of superhuman performance. We settled on res-net, due to a combination of its impressive performance and relatively simple architecture. Due to our limited time and computational resources, a model that we could train and fine-tune quickly was essential . Our implementation of Res-Net performed well on the public validation set with XX.X% accuracy on 10,000 images. Below is a graphical representation of our implementation of Res-Net:

Adversarial Training

To defend against, adversarial examples found in nature, we first looked at what sort of images caused problems in the 'Natural Adversarial Examples' paper. We concluded that many of the examples that were misclassified could be broadly sorted into 3 types of problem images. The first type is just an out of distribution example. If the model hasn't ever seen a squid, it is not going to be able to recognize one at test time. We would love for the model to just admit it has no idea what it's looking at instead of making a high confidence guess but this is slightly out of scope. We aim to improve performance, not allow for elegant failure. Hardening against out of distribution examples is often just improving the dataset by making it bigger. Another type is a known object with a background or overlay of a texture that was confusing to the classifier. Here are some examples of that:

As mentioned above, ANNs are partial to texture, so this type of example causes some problems that are hard to avoid. The third broad category is known objects but where the picture is taken from a slightly different angle. The training set might have a front view of a car, but the test picture might be from a traffic camera where the angle is different.

This third type is much broader, consisting of any obstruction or modification that isn't covered by the first two. Some common examples are and object that is only partially in the frame or the object is zoomed in on and the net can't recognize it. This is a fairly common type of perturbation that occurs often in real-life photography and video systems and can often happen for a few consecutive frames in a video in which a known object is partially obstructed. For example, a video where a person is walking might have several frames where the person is standing in such a way that only half of their body is visible. A human would have no problem working with this footage but computers often do.

In order to train against these types of images, develop a list of functions that, when given an image, will generate a new image that is mean to look like a perturbation that might occur naturally. We do this because we want the network to associate these confusing images with the correct class, and the easiest way to forge an association is simply retraining, but data collection and labeling is expensive and difficult normally. Collecting and labeling a significant amount of adversarial data would be nearly impossible in the time frame allotted so we needed to automate the process of sourcing adversarial images.

We developed a collection of functions that perform the following effects to images:

flip
zoom
rotate
scale
shift
sheer
brightness
dropout
partial delete
mesh overlay
blur

Each of these is meant to mimic a situation that could arise in the real world. Some are obvious like shift and zoom, but others are less so. Sheering doesn't seem like it would be an issue in the real world but camera distortion and subsequent rectification gone wrong can lead to this time of transformation on many devices and so we must replicate it. We also use partial deletion and mesh overlay to train against specific image examples that we felt represented a significant proportion of the errors in the original set. These are when an object is partially obscured but still fully in frame, which is different from shifted or zoomed out of frame, and when a picture is over-represented by a texture that's not pertinent to the object being classified. There were several examples of the latter including viewing an object through a screen or on a background of a pattern like a stylized table. Examples of some can be seen below:

All of these effects are applied to random pictures and to random degrees in the training set. For example, dropout was implemented such that a random percent of pixels between 5% and 15% would be deleted at random locations. Similar randomization procedures are implemented for all functions in an attempt to cover all reasonable natual perturbations.

We built a pipeline to generate hundreds of perturbed images and store them remotely. We then froze every layer of our network except for the logits, post norm, and global pool layers. We also unfroze block4 which is one the res-net bottleneck blocks toward the end of the network. This allows the network to retain its ability to recognize things it already is able to recognize but also retrain certain layers that are working suboptimally.

We chose which layers to unfreeze manually, by looking at the res-net architecture and the saliency maps and layer activations for each layer. When we found a layer that visually resembled the type of perturbation that we were trying to train against, or found a layer that had similar structural elements, we unfroze it. For example, when training against shifts and out of frame objects, we unfroze the very last layers because we want to train the network to recognize a fully developed image, which isn't known until the later layers, as its class regardless of its location in the frame. However, when we trained against dropout we unfroze both early and later layers because the ability to recognize the objects detected by the earlier layers (while resistant to dropout) is important to recognize the objects they form in the later layers. If we ignored the earlier layers with this perturbation we might still be susceptible to this type of problem

We also implemented our own version of res-net with self-attention which was inspired by a paper called 'Squeeze-and-Excitation Networks'. The idea behind this architecture is that it is more resistant to pictures where the object to be classified is obscured or in front of a texture that confuses the classifier because the residuals are combined with a squeeze and excitation layer which is meant to preserve salient features from the earlier layers. We implemented this ourselves because we couldn't find a pre-trained network that worked the way we wanted, however because of this, our ability to train the network in a timely fashion was hobbled. Even with downsampled training images and a GPU we were unable to achieve a high enough accuracy score to be work talking about more.

Results

As mentioned above, our baseline model ran at 72% accuracy on average and topped out at 90% on our default dataset of tiny-imagenet-200 images. During our adversarial training, we built a new dataset that consisted of roughly half normal images and half images that were synthetically perturbed. This is the dataset that we partitioned and then used to train and test our hardened network. The original training and validation data for the original res-net trained on unperturbed is shown below:

The network's accuracy saturates around 85% and is still rather volatile from epoch to epoch after about 15 epochs, presumably due at least partially to natural adversarial examples in the original training set. Due to our limited time and computational resources, we were not able to train our network from scratch and are required to get our training metrics from the original sources. While our network did improve performance against adversarial examples in particular, we were not able to train as long as we would have liked and so our results are a bit inconclusive. The plot of our data is shown here:

However, as mentioned, we felt out data lacked the robustness and therefore reliability it might have gained from more time training. To account for this we've also included the graph of another taining run. Below is a graph of data points from the training of a similar network using similar methodology to ours. The scale and unfortunately small performance gain made the use of a graph generated from our data redundant.

Our findings largely agree with those mentioned in the original paper from Hendrycks. Our adversarial training, did show some improvement but does not make a compelling case to continue this methodology. Our model scored 54% accuracy on average on our dataset consisting of half synthetically perturbed images and achieving a best of 75%. This represents an 8% improvement on the heavily perturbed dataset from our original benchmark of 46% accuracy on validation data. Validation data consisted of a 15% partition of the half perturbed dataset.

In future work, we intend to combine the self-attention model with the perturbations and develop a more robust method for making perturbed images. This should combine the best aspects of both because we believe that the two augmentation methods don't train for the same type of adversarial example. The two methods should complement each other in a way that leads to a robust network, reasonably resistant to many types of sup-optimal images that could be found in nature.

As previously mentioned, we also implemented a res-net model with self-attention that showed some level of promise in several papers including the original however due to training difficulties we don't have anything substantial to report for this network.

Robust Computer vision

Dante Everaert, McClain Thiel, Greg Yannett

For CS182: Designing, Visualizing and Understanding Deep Neural Networks

Background

Approach

Data Preparation

Baseline Model

Adversarial Training

Results

Tools

Lessons Learned

Team Contributions

References