Home

Awesome

HAI (HIL-based Augmented ICS) Security Dataset

The HAI dataset was collected from a realistic industrial control system (ICS) testbed augmented with a Hardware-In-the-Loop (HIL) simulator that emulates steam-turbine power generation and pumped-storage hydropower generation.

Click here to find out more about the HAI dataset.

You can also download the HAI dataset from Kaggle.

Please e-mail us if you have any questions about the dataset.

Contents

Background

HAI Testbed

The testbed consists of four different processes: boiler process, turbine process, water treatment process and HIL simulation:

HAI Dataset

Four versions of HAI dataset have been released thus far. Each dataset consists of several CSV files, and each file satisfies time continuity. The quantitative summary of each version are as follows:

HAIEnd is a dataset that collects tag values for the internal logic of the boiler DCS, and is collected simultaneously during the same experiment with the same version of HAI dataset.

The version numbering follows a date-based scheme, where the version number indicates the released date of the HAI dataset. HAI 20.07 is the bug-fixed version of HAI v1.0 released in February 2020.

<table align=center > <thead align=center> <tr bgcolor='#bbbbbb'> <th rowspan=2>Version</th> <th rowspan=2>Data Points<br>(points/sec)</th> <th colspan=3>Normal Dataset </th> <th colspan=4>Attack Dataset</th> </tr> <tr> <th>Files<br>(CSV)</th> <th>Interval<br>(hours)</th> <th>Size<br>(MB)</th> <th>Files<br>(CSV)</th> <th>Attack Count</th> <th>Interval<br>(hours)</th> <th>Size<br>(MB)</th> </tr> </thead> <tbody> <tr bgcolor='#dddddd'> <td rowspan=5 align=right><a href="https://github.com/icsdataset/hai/tree/master/haiend-23.05"> HAI 23.05 <a> <br> <b> <a href="https://github.com/icsdataset/hai/tree/master/hai-23.05"> HAIEnd 23.05 <a></b> </td> <td align=right rowspan=5> 86<br><b>225</b></td> <td align=right>hai-train1<br><b>end-train1</b></td> <td align=right>78</td> <td align=right>154.9<br><b>250.5</b></td> <td align=right> hai-test1<br><b>end-test1</b></td> <td align=right>14</td> <td align=right>15</td> <td align=right>29.8<br><b>48.2</b></td> </tr> <tr > <td align=right> hai-train2<br><b>end-train2</b></td> <td align=right>81</td> <td align=right>161.3<br><b>260.7</b></td> <td align=right>hai-test2<br><b>end-test2</b></td> <td align=right>38</td> <td align=right>64</td> <td align=right>126.8<br><b>204.8</b></td> </tr> <tr bgcolor='#dddddd'> <td align=right> hai-train3<br><b>end-train3</b></td> <td align=right>35</td> <td align=right>69.4<br><b>112.7</b></td> <td rowspan=2 colspan=4> </td> </tr> <tr> <td align=right>hai-train4<br><b>end-train4</b></td> <td align=right>55</td> <td align=right>109.2<br><b>176.0</b></td> </tr> <tr > <td align=right> <b>Total</b></td> <td align=right><b>249</b></td> <td align=right>494.8<br><b>799.9</b></td> <td align=right><b>Total</td> <td align=right><b>52</b></td> <td align=right><b>79</b></td> <td align=right>156.6<br><b>253.0</b></td> </tr> <tr bgcolor='#dddddd'> <td rowspan=7 align=center><b><a href="https://github.com/icsdataset/hai/tree/master/hai-22.04"> HAI 22.04<a></b> </td> <td rowspan=7 align=right >86</td> <td align=right> train1</td> <td align=right>26</td> <td align=right>50.7</td> <td align=right> test1</td> <td align=right>7</td> <td align=right>24</td> <td align=right>48.2</td> </tr> <tr > <td align=right>train2</td> <td align=right>56 </td> <td align=right>108.9</td> <td align=right>test2</td> <td align=right>17</td> <td align=right>23 </td> <td align=right>44.5</td> </tr> <tr bgcolor='#dddddd'> <td align=right>train3</td> <td align=right>35 </td> <td align=right>66.7</td> <td align=right>test3</td> <td align=right>10</td> <td align=right>17.3 </td> <td align=right>33.4</td> </tr> <tr> <td align=right>train4</td> <td align=right>24 </td> <td align=right>45.7</td> <td align=right>test4</td> <td align=right>24</td> <td align=right>36 </td> <td align=right>69.5</td> </tr> <tr bgcolor='#dddddd'> <td align=right>train5</td> <td align=right>66 </td> <td align=right>125.6</td> <td rowspan=2 colspan=4> </td> </tr> <tr> <td align=right>train6</td> <td align=right>72 </td> <td align=right>136.8</td> </tr> <tr > <td align=right> <b>Total</b></td> <td align=right> <b>279 </b></td> <td align=right><b>534.4</b></td> <td align=right><b>Total</td> <td align=right><b>58</b></td> <td align=right><b>100.3 </b></td> <td align=right><b>195.6</b></td> </tr> <tr bgcolor='#dddddd'> <td rowspan=6 align=center><b><a href="https://github.com/icsdataset/hai/tree/master/hai-21.03"> HAI 21.03<a></b> </td> <td rowspan=6 align=right > 78</td> <td align=right>train1</td> <td align=right>60 </td> <td align=right>100</td> <td align=right> test1</td> <td align=right>5</td> <td align=right>12 </td> <td align=right>22</td> </tr> <tr > <td align=right>train2</td> <td align=right>63 </td> <td align=right>116</td> <td align=right>test2</td> <td align=right>20</td> <td align=right>33 </td> <td align=right>62</td> </tr> <tr bgcolor='#dddddd'> <td align=right>train3</td> <td align=right>229</td> <td align=right>246</td> <td align=right>test3</td> <td align=right>8</td> <td align=right>30 </td> <td align=right>56</td> </tr> <tr> <td rowspan=2 colspan=3> </td> <td align=right>test4</td> <td align=right>5</td> <td align=right>11 </td> <td align=right>20</td> </tr> <tr bgcolor='#dddddd'> <td align=right>test5</td> <td align=right>12</td> <td align=right>26 </td> <td align=right>48</td> </tr> <tr > <td align=right> <b>Total</b></td> <td align=right> <b>352 </b></td> <td align=right><b>471</b></td> <td align=right><b>Total</td> <td align=right><b>50</b></td> <td align=right><b>112 </b></td> <td align=right><b>205</b></td> </tr> <tr > <td rowspan=3 align=center> <b><a href="https://github.com/icsdataset/hai/tree/master/hai-20.07"> HAI 20.07<a> </b><br> (HAI1.0)</td> <td rowspan=3 align=right > 59</td> <td align=right>train1</td> <td align=right>86</td> <td align=right>127</td> <td align=right>test1</td> <td align=right>28</td> <td align=right>81</td> <td align=right>119</td> </tr> <tr bgcolor='#dddddd'> <td align=right>train1</td> <td align=right>91</td> <td align=right>98</td> <td align=right>test1</td> <td align=right>10</td> <td align=right>42 </td> <td align=right>62</td> </tr> <tr > <td align=right> <b>Total</b></td> <td align=right> <b>177</b></td> <td align=right><b>225</b></td> <td align=right><b>Total</td> <td align=right><b>38</b></td> <td align=right><b>123 </b></td> <td align=right><b>181</b></td> </tr> </tbody> </table>

Data fields

The time-series data in each CSV file satisfies time continuity. The first column represents the observed time in the “yyyy-MM-dd hh:mm:ss” format, while the rest of the columns provide the recorded SCADA data points. The last four columns provide data labels for whether an attack occurred . Out of these four columns, the attack column is applicable to all the process and the other three columns are applicable to the corresponding control processes.

Refer to the latest technical manual for the details for each column.

From the HAI 22.04 version, attack labels for each process (attack_p1, attack_p2, attack p3) have been excluded. This is because they can be replaced by the attack targets (controllers and points) provided for each dataset version.

<div align="center">
timeP1_B2004P2_B2016...P4_HT_LDattackattack_P1...attack_P3
20190926 13:00:000.098301.07370...000...0
20190926 13:00:010.098301.07410...010...1
20190926 13:00:020.098301.07380...010...1
20190926 13:00:030.098301.07360...011...1
20190926 13:00:040.098301.07430...011...1
</div>

Getting the dataset

Type git clone, and the paste the below URL.

$ git clone https://github.com/icsdataset/hai

To unzip multiple gzip files, you can use:

$ gunzip *.gz

From HAI 22.04, use git lfs pull to download the actual file contents managed by Git LFS.

$ git lfs pull

Performance Metric

The use of eTaPR (Enhanced Time-series Aware Precision and Recall) metric is strongly recommended to evaluate your anomaly detection model, which provides fairness to performance comparisons with other studies. Got something to suggest? Let us know!

Projects using the dataset

Here are some projects and experiments that are using or featuring the dataset in interesting ways. Got something to add? Let us know!

The related projects so far are as follows.

Year 2023

  1. A comparative study of time series anomaly detection models for industrial control systems
  2. CPS-GUARD: Intrusion detection for cyber-physical systems and IoT devices using outlier-aware deep autoencoders
  3. Detection of cyberattacks and anomalies in cyber-physical systems: approaches, data sources, evaluation
  4. Machine learning in industrial control system (ICS) security: current landscape, opportunities and challenges
  5. Monitoring industrial control systems via spatio-temporal graph neural networks
  6. Time-series anomaly detection via contextual discriminative contrastive learning

Year 2022

  1. A hybrid algorithm incorporating vector quantization and one-class support vector machine for industrial anomaly detection
  2. Anomalous behaviour detection for cyber defence in modern industrial control systems
  3. Benchmarking machine learning based detection of cyber attacks for critical infrastructure
  4. Can industrial intrusion detection be SIMPLE?
  5. Deep Analysis Net with Causal Embedding for Coal-fired power plant Fault Detection and Diagnosis (DANCE4CFDD)
  6. Frequency-based representation of massive alerts and combination of indicators by heterogeneous intrusion detection systems for anomaly detection
  7. Improving method of anomaly detection performance for industrial IoT environment
  8. IPAL: breaking up silos of protocol-dependent and domain-specific industrial intrusion detection systems
  9. Learning sparse latent graph representations for anomaly detection in multivariate time series
  10. Mad: Self-supervised masked anomaly detection task for multivariate time series
  11. Multivariate time series anomaly detection with few positive samples
  12. Residual size is not enough for anomaly detection: improving detection performance using residual similarity in multivariate time series
  13. Towards building intrusion detection systems for multivariate time-series data
  14. Variational restricted Boltzmann machines to automated anomaly detection

Year 2021

  1. Research on improvement of anomaly detection performance in industrial control systems
  2. E-sfd: Explainable sensor fault detection in the ics anomaly detection system
  3. Stacked-autoencoder based anomaly detection with industrial control system
  4. Improved mitigation of cyber threats in iiot for smart cities: A new-era approach and scheme
  5. Revitalizing self-organizing map: Anomaly detection using forecasting error patterns
  6. Cluster-based deep one-class classification model for anomaly detection
  7. Measurement data intrusion detection in industrial control systems based on unsupervised learning
  8. A machine learning approach for anomaly detection in industrial control systems based on measurement data
  9. Probabilistic attack sequence generation and execution based on mitre att&ck for ics datasets

Year 2020

  1. Anomaly detection in time-series data environment
  2. Detecting anomalies in time-series data using unsupervised learning and analysis on infrequent signatures
  3. Expansion of ICS testbed for security validation based on MITRE ATT&CK techniques
  4. Expanding a programmable cps testbed for network attack analysis
  5. Co-occurrence based security event analysis and visualization for cyber physical systems
  6. Expansion of ICS testbed for security validation based on MITRE ATT&CK techniques
  7. Expanding a programmable cps testbed for network attack analysis
  8. Co-occurrence based security event analysis and visualization for cyber physical systems

Competitions

Since 2020, we have held two AI competitions using the HAI dataset. The competition website shares the competition baseline codes and the winner's codes.

Contributors

Shin, Hyeok-Ki; Lee, Woomyo; Choi, Seungoh; Hwang, Won-Seok ; Yun, Jeong-Han ; Min, Byung-Gil; Kim, HyoungChun

The Affiliated Institute of ETRI, Daejeon, South Korea.

License

This work is licensed under a Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0).

Citation

If you publish your works that use HAI data sets, HAICon competitions, and eTaPR, please cite the sources below:

HAI 22.04, HAI 23.05, HAIEnd 23.05

  @misc{github,
    author={Shin, Hyeok-Ki; Lee, Woomyo; Choi, Seungoh; Yun, Jeong-Han; and Min, Byung-Gi},
    title={HAI security datasets},
    year={2023},
    url={https://github.com/icsdataset/hai},
 }

HAI 21.03, HAICon 2020, HAICon 2021

    @inproceedings{10.1145/3474718.3474719,
    author = {Shin,  Hyeok-Ki and Lee, Woomyo and Yun, Jeong-Han and Min, Byung-Gi},
    title = {Two ICS Security Datasets and Anomaly Detection Contest on the HIL-Based Augmented ICS Testbed},
    year = {2021},
    isbn = {9781450390651},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3474718.3474719},
    doi = {10.1145/3474718.3474719},
    abstract = {Security datasets with various operating characteristics and abnormal situations of industrial control system (ICS) are essential to develop artificial intelligence (AI)-based control system security technology. In this study, we built a hardware-in-the-loop (HIL)-based augmented ICS (HAI) testbed and developed ICS security datasets. Here, we introduce the second dataset (HAI 21.03), which was developed with the user feedback of the first released version (HAI 20.07). All HAI datasets are publicly available at https://github.com/icsdataset/hai. HAI 21.03 was expanded by adding data points and normal/attack scenarios to HAI 20.07. We also held an AI-based anomaly detection contest (HAICon 2020) utilizing the HAI datasets developed so far, giving many AI researchers an opportunity to discuss and share ideas for ICS anomaly detection research. This paper presents the results of the HAICon 2020. The results of the top teams in the competition can be used as a performance comparison criterion when using HAI 21.03. },
    booktitle = {Cyber Security Experimentation and Test Workshop},
    pages = {36–40},
    numpages = {5},
    keywords = {security dataset, testbed, artificial intelligence, hardware-in-the-loop, industrial control system, anomaly detection},
    location = {Virtual, CA, USA},
    series = {CSET '21}
}

HAI 20.07

@inbook{10.5555/3485754.3485755,
    author = {Shin, Hyeok-Ki and Lee, Woomyo and Yun, Jeong-Han and Kim, HyoungChun},
    title = {HAI 1.0: HIL-Based Augmented ICS Security Dataset},
    year = {2020},
    publisher = {USENIX Association},
    address = {USA},
    abstract = {Datasets are paramount to the development of AI-based technologies. However, the available cyber-physical system (CPS) datasets are insufficient. In this paper, we introduce the HIL-based augmented ICS security (HAI) dataset 1.0 (https://github.com/icsdataset/hai), the first CPS dataset collected using the HAI testbed. The HAI testbed comprises three physical control systems, namely GE turbine, Emerson boiler, and FESTO water treatment systems, combined through a dSPACE hardware-in-the-loop (HIL) simulator. We built an environment to remotely and automatically manipulate all components of a feedback control loop. Using this environment, we collected the HAI dataset 1.0 while repeatedly running a large number of benign and malicious scenarios for a long period with minimal human effort. We will continue to improve the HAI testbed and release new versions of the HAI dataset.},
    booktitle = {Proceedings of the 13th USENIX Conference on Cyber Security Experimentation and Test},
    articleno = {1},
    numpages = {1}
}

eTaPR

@inproceedings{ 
    10.1145/3477314.3507024,
    author = {Hwang, Won-Seok and Yun, Jeong-Han and Kim, Jonguk and Min, Byung Gil},
    title = {"Do You Know Existing Accuracy Metrics Overrate Time-Series Anomaly Detections?"},
    year = {2022},
    isbn = {9781450387132},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3477314.3507024},
    doi = {10.1145/3477314.3507024},
    booktitle = {Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing},
    pages = {403–412},
    numpages = {10},
    location = {Virtual Event},
    series = {SAC '22}
}

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as <a href="https://g.co/datasetsearch">Google Dataset Search</a>.

<div itemscope itemtype="http://schema.org/Dataset"> <table> <tr> <th>property</th> <th>value</th> </tr> <tr> <td>name</td> <td><code itemprop="name">HIL-based Augmented ICS Security Dataset</code></td> </tr> <tr> <td>keywords</td> <td><code itemprop="keywords">ICS, CPS, AI Dataset, Anomaly Detection</code></td> </tr> <tr> <td>alternateName</td> <td><code itemprop="alternateName">HAI, HAIEnd</code></td> </tr> <tr> <td>url</td> <td><code itemprop="url">https://github.com/icsdataset/hai</code></td> </tr> <tr> <td>sameAs</td> <td><code itemprop="sameAs">https://github.com/icsdataset/hai</code></td> </tr> <tr> <td>description</td> <td><code itemprop="description">The HAI security dataset was collected from a realistic Industrial Control System (ICS) testbed augmented with a Hardware-In-the-Loop (HIL) simulator that emulates steam-turbine power generation and pumped-storage hydropower generation.</code></td> </tr> <tr> <td>provider</td> <td> <div itemscope itemtype="http://schema.org/Organization" itemprop="provider"> <table> <tr> <th>property</th> <th>value</th> </tr> <tr> <td>name</td> <td><code itemprop="name">The Affiliated Institute of ETRI, South Korea</code></td> </tr> <tr> <td>sameAs</td> <td><code itemprop="sameAs">https://github.com/icsdataset</code></td> </tr> </table> </div> </td> </tr> <tr> <td>license</td> <td> <div itemscope itemtype="http://schema.org/CreativeWork" itemprop="license"> <table> <tr> <th>property</th> <th>value</th> </tr> <tr> <td>name</td> <td><code itemprop="name">CC BY 4.0</code></td> </tr> <tr> <td>url</td> <td><code itemprop="url">https://creativecommons.org/licenses/by/4.0/</code></td> </tr> </table> </div> </td> </tr> <tr> <td>citation</td> <td> <code itemprop="citation">https://dl.acm.org/doi/abs/10.1145/3474718.3474719</code> <code itemprop="citation">https://dl.acm.org/doi/abs/10.5555/3485754.3485755</code> <code itemprop="citation">https://dl.acm.org/doi/10.1145/3357384.3358118</code> </td> </tr> </table> </div>