Home

Awesome

HurricaneHarvey

Post-Hurricane Structure Damage Assessment Leveraging Aerial Imagery and Convolutional Neural Networks

This project is an effort to automate structure damage assessments based on post-hurricane aerial imagery. A labelled training dataset of images of structures from the Houston area just after Hurricane Harvey was acquired from the University of Washington Disaster Data Science Lab. Several neural network architectures were trained and evaluated, from basic architectures to deep networks via transfer learning. Models were trained on Google Public Cloud virtual machines leveraging multiple GPUs to enable fast and efficient iteration. The final model achieves an accuracy of 0.9775 on test data. This model could be used by local, state, and federal natural disaster responders to quickly develop damage assessment maps based on a single fly-over by imaging aircraft after a hurricane.</p>

View the presentation, full report, or check out the summary below.</p>

<h2>Why Post-Hurricane Damage Assessment?</h2>

Natural disasters, especially hurricanes and extreme storms, cause widespread destruction and loss of life and cost the U.S billions of dollars each year. Effective response after a natural disaster strikes is critical to reducing harm and facilitating long term recovery but manual damage assessments are time and resource intensive. A model that can perform automated damage assessment based on remote sensing imagery quickly captured by a few planes flying over a disaster zone would greatly reduce the effort and increase speed in which useful data (location and extent of damaged structures) could be put in the hands of responders.</p>

<h1> Project Summary </h1>

The following sections summarize key steps in my process. Check out the Jupyter notebooks to see how it was done!</p>

<h2>Data Wrangling + Exploratory Data Analysis</h2>

For code/more detail, see: EDA Notebook

<h3> Data Source </h3>

Data for this project was sourced from a labeled training dataset made publicly available by Quoc Dung Cao and Youngjun Choe at the University of Washington’s Disaster Data Science Lab. The dataset consists of images of structures cropped from aerial imagery collected in the days after Hurricane Harvey in Houston and other nearby cities. There are 12,000 images total: 8,000 for training, 2,000 for validation, and 2,000 for testing. </p>

<h3> Visual Inspection </h3>

The images are color, in RGB format. The figure below compares 3 randomly selected damage images to 3 randomly selected no damage images. From viewing these examples along with others, some patterns emerged:</p>

Example_images

<h3> Summary Images </h3>

In classification problems based on structured (tabular) data, it is common to explore summary statistics for each feature (column). For this image classification problem I used the numeric RGB color values in the images to calculate summary statistics by pixel (feature) and then plot them visually. By plotting visual summaries of the two classes, differences can be identified between them that could inform decisions about neural network architecture. </p>

<h4> Mean Value by Class </h4>

The figures below depict each pixel’s mean value across all images by class. For both damage and no damage images we see that a structure tends to be located within the center of the image. For damage images, the pixel values around the structure tend to be lower in value than around the no damage images. Perhaps this is because damage images tend to have flood waters around the structures, which are a different color than unflooded ground. </p>

mean_byclass

<h4>Standard Deviation by Class </h4>

The figures below are similar to those above but instead of depicting the mean pixel value they depict the standard deviation for each pixel by class. Standard deviation around the edges of the image appears to be (slightly) greater for the no damage images. Perhaps this is because visible ground around the structures creates more variation between images in that class. </p>

mean_byclass

<h3> Geographic Distribution </h3>

I investigated the geographic distribution of the training data by class. This is possible because the GPS coordinates are contained in the filename for each image file. The figure below depicts the locations of each image in the training dataset and whether it is damaged or not damaged. </p>

map

We can see that the training data appears to come from 3 distinct areas. In one area, all structures in the training dataset are damaged, while in the other two there is a mix. Within those two there are clear spatial patterns of where flooding/damage occurred and where it didn't. </p>

This is not surprising given that the hurricane (and flooding) impacted areas differently based on topography, urban form, etc. However, it does also indicate that the examples from each class often come from entirely different areas. There is a danger that when training the neural networks models, they may learn to identify differences that are more to due with differences in appearance of structures between those areas, rather than differences that are actually due to hurricane damage. </p>

<h2>Modeling</h2>

This section describes the general approach used to train and iterate on neural network models: </p>

Code for uploading data to a Google Cloud bucket and then transfering to a virtual machine. </p>

For code/more detail on model architecture, see: Modeling Notebook </p>

<h3>Baseline Model</h3> I first created a simple baseline model with three convolution layers, three dense layers, and an output layer with 2 nodes (corresponding to the two classes). Some initial modeling decisions were made here and carry through to the other models: </p>

The baseline model trained on data with no image augmentation achieved a validation accuracy of 0.94650. When the pipeline was updated to include image augmentation, validation accuracy increased to 0.95650. </p>

This was surprisingly good performance for such a simple model, suggesting that there are clear differences between the damage and no damage classes that are relatively easy for the network to learn. However, there was still substantial room for improvement. </p>

<h3>Model Improvement and Refinement</h3> <h4>Reducing Overfitting</h4>

During training of the baseline model, training validation scores regularly exceeded 0.99, while the validation accuracy was substantially lower. This suggested that the model was overfitting, even with variation in training data introduced by the image augmentation. To address this, I added:</p>

<h4>Improving Convergence</h4>

While most of the models performed relatively well, a recurring issue was that models would achieve high accuracy after only a few training epochs but never converge. The figure below summarizes training and validation accuracy by epoch for a model that demonstrates this trend: </p>

no_converge

Model convergence was ultimately achieved through a combination of several updates to the model architecture:</p>

As summarized in the figure below, the updated model converges much better than before:</p>

converge

<h4>Model Architecture</h4>

Several combinations of hyperparameters were tested including smaller and larger convolution filters and more or less nodes in the dense layers. The model below achieved a validation accuracy of 0.9735, a substantial improvement over the baseline model. </p>

LayerOutput Shape# of Params
Rescaling(128, 128, 3)0
Convolution (filters=32, kernel_size=3, strides=1)(128, 128, 32)896
Max Pooling (pool size=2, strides=2)(64, 64, 32)0
Batch Normalization(64, 64, 32)128
Activation (ReLU)(64, 64, 32)0
Convolution (filters=32, kernel_size=3, strides=1)(32, 32, 64)18,496
Max Pooling (pool size=2, strides=2)(16, 16, 64)0
Batch Normalization(16, 16, 64)256
Activation (ReLU)(16, 16, 64)0
Convolution (filters=32, kernel_size=3, strides=1)(8, 8, 64)36,928
Max Pooling (pool size=2, strides=2)(4, 4, 64)0
Batch Normalization(4, 4, 64)256
Activation (ReLU)(4, 4, 64)0
Flattening10240
Dense (512 nodes, ReLU activation)512524,800
Dropout (rate=0.3)5120
Dense (256 nodes, ReLU activation)256131,328
Dropout (rate=0.2)2560
Dense (128 nodes, ReLU activation)12832,896
Dropout (rate=0.1)1280
Dense (2 nodes, Softmax activation)2258
<h3>Deep Network Leveraging Transfer Learning</h3>

After training and evaluating the models described above, I next leveraged transfer learning to find out if pre-trained deep learning models could produce better and/or more stable results. Out of the many deep learning models available I decided to use Resnet50 because it: </p>

  1. Offers a good balance of accuracy and training efficiency
  2. Was used in the baseline model for the xView2 competition, which suggested that it would perform well for a similar aerial imagery classification task.

Starting with the same hyperparameter settings for the convolution and dense layers as the best performing model from the previous section, several iterations based on different combinations of convolution filters size, dense layer node density, dropout rate, and learning rate were tested. The best performing deep network achieved a validation accuracy of 0.9765, slightly better than the standard convolutional network.</p>

<h3>Comparing Models</h3>

The table below summarizes a select subset of the iterations evaluated (not all are included for brevity). Note that all models except the initial baseline model included image augmentation in their pipeline (rotate, flip, and zoom). </p>

ModelValidation Accuracy
Baseline (no image augmentation)0.9365
Baseline0.9440
With Max Pooling (kernel=5)  & Dropout Layers0.9500
With Max Pooling (kernel=10) & Dropout Layers0.9230
With Max Pooling (kernel=3) & Dropout Layers (dense layers with 50% less nodes)0.9735
Transfer Learning (with Max Pooling kernel=5)0.9765
Transfer Learning (Max Pooling kernel=3)0.9735

While the transfer learning model did perform slightly better than the best performing standard model, the difference in validation accuracy was only 0.003 - which corresponds to 6 images in the validation set. This is well within the potential variation that would be expected if a different set of images had been randomly selected for the validation set.</p>

During testing it was noted that the standard model is significantly smaller in size (11 mb vs. 500 mb) and prediction is much more time and computationally efficient. Because the validation accuracy rates were essentially the same, and the standard model was more computational efficient, it was selected as the final model.</p>

<h2>Seleced Model Performance</h2>

The final model was then evaluated based on test data (an additional 2,000 images which had been kept separate from the training and validation sets). The final model achieved a test accuracy of 0.9775, which was even higher than the validation accuracy (0.9735). Similar performance on the validation and test sets suggest that the model would be generalizable to additional unseen data (see caveats in Conclusion below).</p>

A confusion matrix of the models predictions on test data reveal that the false positives were twice as common as false negatives (0=no damage, 1=damage):</p>

confusion_matrix

Examining the misclassified images reveals some insights about the model:</p>

False Positives tend to have surfaces that are mistaken for flood waters or junk around them that is mistaken for damage. False positives also tend to be rural structures:</p>

false_positives

False negatives appear to be mostly large or non-residential structures, have a lot of variation in the ground surface, and/or no obvious flood water:</p>

false_negatives

<h2>Conclusion</h2>

Some final thoughts on potential improvements and how the model could be used going forward are included below:

<h3>Caveats and Potential Improvements</h3>

Training data improvements:</p>

<h3>Using the Model </h3>

This damage classification model could be inserted as a step in a fully automated pipeline:</p>

  1. Ingest/clean aerial imagery
  2. Crop images of structures based on MS Building Footprints data
  3. Classify images (this model)
  4. Plot locations/damage assessment on interactive map

This pipeline would be much faster than a crowdsourcing/manual review: the final model classified 2,0000 structures in seconds on a desktop computer.

<h2>Credits</h2>

Thanks to Shmuel Naaman for mentorship and advice on modeling approach and architecture.