Home

Awesome

Simulation Framework for Customs Trade Selection

Use your collected declarations data, fit the provided models, find the best trade selection strategy for customs. This framework supports general import declarations.

Implementation of:

How to Use

  1. Setup your Python environment: e.g., Anaconda Python 3.8 [Guide]
$ conda activate py38 
  1. Clone the repository:
$ git clone https://github.com/Seondong/Customs-Fraud-Detection.git
  1. Install requirements
$ pip install -r requirements.txt
  1. Run the codes: Refer to main.py for hyperparameters, .sh files in ./bash directory will give you some ideas how to run codes effectively.
$ python main.py --data synthetic --train_from 20130101 --test_from 20130115 --valid_length 7 --test_length 7 --numweeks 100 --final_inspection_rate 10 --sampling hybrid --subsamplings xgb/random --weights 0.9/0.1  

The example command is to simulate the customs targeting system on a synthetic dataset. The initial training period starts from Jan 1, 2013 (--train_from), and spans 14 days. The last seven days of the training set are held out for validation (--valid_length). With the trained model, customs selection begins on Jan 15 (--test_from). The first testing period spans seven days - batch setting (--test_length). After testing, inspected items are labeled and added to the training set. The simulation terminates after 100 testing periods (--numweeks). The target inspection rate is set as 10% (--final_inspection_rate), which means that 10% of the goods are inspected and levied duties. The hybrid selection strategy consisting of xgb and random is used by 9:1 ratio (--sampling, --subsamplings, --weights). In other words, 9% of the total items are selected by XGBoost for inspection, and the remaining 1% of the items are randomly inspected.

Data Format

For your understanding, we upload the synthetic import declarations in the data/ directory. Users are expected to preprocess their import declarations in a similar format. Currently, the framework supports single-item declarations that the target labels; illicitness of the item, revenue by inspection, are marked for each item. To run the code with real datasets, please refer to data/ directory. [README]

sgd.idsgd.dateimporter.idtariff.code...cif.valuetotal.taxesillicitrevenue
SGD113-01-02IMP8261648703241128...280964700
SGD213-01-02IMP8372198703232926...266140326200
SGD313-01-02IMP1174068517180000...302275561200
SGD413-01-02IMP4351088703222900...416051400
SGD513-01-02IMP7179008545200000...2395493971980

Available Selection Strategies:

Stand-alone strategies:

$ python main.py --sampling random --data synthetic --train_from 20130101 --test_from 20130115 --valid_length 7 --test_length 7 --numweeks 100 --final_inspection_rate 10
$ python main.py --sampling DATE --data real-t --train_from 20150101 --test_from 20150115 --valid_length 7 --test_length 7 --numweeks 300 --initial_inspection_rate 10 --final_inspection_rate 5 --inspection_plan fast_linear_decay --initial_masking importer
$ python main.py --sampling ssl_ae --data real-n --train_from 20130101 --test_from 20130131 --valid_length 7 --test_length 14 --numweeks 100 --initial_inspection_rate 10 --final_inspection_rate 5 --semi_supervised 1

Supervised strategies (use labeled data only):

Semi-supervised strategies (use unlabeled data together, --semi_supervised 1):

Hybrid strategies:

$ python main.py --sampling hybrid --subsamplings xgb/risky/random --weights 0.7/0.2/0.1 --data synthetic --train_from 20130101 --test_from 20130115 --valid_length 7 --test_length 7 --numweeks 100 --final_inspection_rate 10 
$ python main.py --sampling adahybrid --subsamplings DATE/random --weights 0.9/0.1 --data synthetic --train_from 20130101 --test_from 20130115 --valid_length 7 --test_length 7 --numweeks 100 --final_inspection_rate 10 
$ python main.py --prefix rada-bal-s  --drift pot --mixing reinit --data synthetic --ada_algo ucb --ada_discount decay --ada_lr 3 --ada_epsilon 0.1 --ada_decay 0.9 --sampling rada --subsamplings xgb/random --weights 0.9/0.1 --mode scratch --train_from 20130101 --test_from 20130115 --test_length 7 --valid_length 7 --final_inspection_rate 10 --epoch 10 --numweeks 300

Research

Please find the attached literatures to study. Some of them are uploaded in ./literatures directory.

Citation

If you find this code useful, please cite the original paper:

@inproceedings{kimtsai2020date,
  title={DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection},
  author={Kim, Sundong and Tsai, Yu-Che and Singh, Karandeep and Choi, Yeonsoo and Ibok, Etim and Li, Cheng-Te and Cha, Meeyoung},
  booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  year={2020}
}

@article{kim2021customs,
  title={Active Learning for Human-in-the-Loop Customs Inspection},
  author={Sundong Kim and Tung-Duong Mai and Sungwon Han and Sungwon Park and Thi Nguyen Duc Khanh and Jaechan So and Karandeep Singh and Meeyoung Cha},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  year = {2022}
}

@inproceedings{mai2021drift,
  title={{Customs fraud detection in the presence of concept drift}},
  author={Tung-Duong Mai and Kien Hoang and Aitolkyn Baigutanova and Gaukhartas Alina and Sundong Kim},
  booktitle={Proc. of the International Conference on Data Mining Workshops},
  year={2021},
  pages = {370--379},
}

Contribution

We welcome you to contribute to designing new selection strategies, automating feature engineering adaptive to different feature sets, donating anonymized import declarations dataset, and packaging software (PyPI). To collaborate with us, please contact Sundong Kim (sundong@gist.ac.kr).