Home

Awesome

DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection

License: CC BY-NC-SA 4.0

Implementation of our KDD paper DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection.

[Slides] [Presentation (20 min)] [Presentation (5 min)] [Promotional video]

DATE is a model to classify and rank illegal trade flows that contribute the most to the overall customs revenue when caught.

News

We released a new repository for simulating customs targeting system. Dozens of selection strategies are prepared with DATE. Please find our new code.

Preliminaries

For preliminary understanding, we suggest readers to look below repository, which is dedicated to providing stepping stones toward DATE model for Customs administrations and officials, who want to develop their capacities to use machine learning in their daily works. The repository provides prerequisite knowledge and practices for machine learning, so that Customs community could better understand cutting edge algorithms in DATE model.

Machine Learning for Customs Fraud Detection

Overview of the Transaction-level Import Data

An Import Declaration is a statement made by the importer (owner of the goods), or their agent (licensed customs broker), to provide information about the goods being imported. The Import Declaration collects details on the importer, how the goods are being transported, the tariff classification and customs value.

Synthetic Data

For your understanding, we upload the synthetic import declarations in the data/ directory. Users are expected to preprocess their own import declarations into a similar format.

sgd.idsgd.dateimporter.idtariff.code...cif.valuetotal.taxesillicitrevenue
SGD113-01-02IMP8261648703241128...280964700
SGD213-01-02IMP8372198703232926...266140326200
SGD313-01-02IMP1174068517180000...302275561200
SGD413-01-02IMP4351088703222900...416051400
SGD513-01-02IMP7179008545200000...2395493971980

Model Architecture

DATE consists of three stages. The first stage pre-trains a tree-based classifier to generate cross features of each transaction. The second stage is a dual attentive mechanism that learns both the interactions among cross features and the interactions among importers, HS codes, and cross features. The third stage is the dual-task learning by jointly optimizing illicitness classification and revenue prediction. The overall architecture is depicted in the below figure.

Requirements

To run this code fully, you will need these repositories. We have been running our code in Python 3.7.

Please refer to the issue if you faced CUDA version mismatch.

How to Install

  1. Setup your Python environment: e.g., Anaconda Python 3.7 Guide
$ source activate py37 
  1. Clone the repository
$ git clone https://github.com/Roytsai27/Dual-Attentive-Tree-aware-Embedding.git
  1. Install requirements
$ pip install -r requirements.txt
# Please install the Ranger optimizer by following its instruction.
  1. Run the codes
$ python preprocess_data.py; python generate_loader.py; python train.py
  1. Check the DATE_manual to grasp how the model works. The manual provides a step-by-step execution of DATE model and detailed explanation of its sub-modules.

How to Train the Model

  1. Run preprocess_data.py This script would run the preprocessing for raw data from customs and dump a preprocessed file for training XGB model in step 2.
  2. Run generate_loader.py This will train and evaluate the XGB model and XGB+LR model. Also, the scipt will dump a pickle file for training a DATE model in step 3.
  3. Run train.py This will train and evaluate the DATE model, you can tune the hyperparameters by adding args after train.py. e.g. python3 train.py --epoch 10 --l2 1e-6 etc.

Important: With default settings, the model will run on given synthetic data.

Hyperparameters:

--epoch: number of epochs
--l2: l2 regularization 
--dim: dimension for hidden layers
--use_self: Use leaf-wise self-attention or not 
--alpha: The adaptive weight to balance the scale and importance for regression loss
--lr: learning rate
--head_num: number of heads for self-attention
--act: activation function (Relu or Mish)
--device: The device name for training, if train with cpu, please use:"cpu" 
--output: save the performance output in a csv file

Main Results

Below table illustrates the DATE model and its baseline results of the Nigerian import declarations.

Other Experiments & Codes

Code for auxiliary experiments are uploaded in the experiments/ directory.

Customs Selection in Batch

If you want to use DATE and other baselines for pilot test, please refer to this directory.

Citation

If you mention DATE for your publication, please cite the original paper:

@inproceedings{kimtsai2020date,
  title={DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection},
  author={Kim, Sundong and Tsai, Yu-Che and Singh, Karandeep and Choi, Yeonsoo and Ibok, Etim and Li, Cheng-Te and Cha, Meeyoung},
  booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  year={2020}
}