Awesome

DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection

Implementation of our KDD paper DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection.

[Slides] [Presentation (20 min)] [Presentation (5 min)] [Promotional video]

DATE is a model to classify and rank illegal trade flows that contribute the most to the overall customs revenue when caught.

DATE combines a tree-based model for interpretability and transaction-level embeddings with dual attention mechanisms.
DATE learns simultaneously from illicitness and surtax of each transaction.
DATE shows 92.7% precision on illegal cases and a recall of 49.3% on revenue after inspecting only 1% of all trade flows in Nigeria.

News

We released a new repository for simulating customs targeting system. Dozens of selection strategies are prepared with DATE. Please find our new code.

Customs-fraud-detection

Preliminaries

For preliminary understanding, we suggest readers to look below repository, which is dedicated to providing stepping stones toward DATE model for Customs administrations and officials, who want to develop their capacities to use machine learning in their daily works. The repository provides prerequisite knowledge and practices for machine learning, so that Customs community could better understand cutting edge algorithms in DATE model.

Machine Learning for Customs Fraud Detection

Overview of the Transaction-level Import Data

An Import Declaration is a statement made by the importer (owner of the goods), or their agent (licensed customs broker), to provide information about the goods being imported. The Import Declaration collects details on the importer, how the goods are being transported, the tariff classification and customs value.

Synthetic Data

For your understanding, we upload the synthetic import declarations in the data/ directory. Users are expected to preprocess their own import declarations into a similar format.

sgd.id	sgd.date	importer.id	tariff.code	...	cif.value	total.taxes	illicit	revenue
SGD1	13-01-02	IMP826164	8703241128	...	2809	647	0	0
SGD2	13-01-02	IMP837219	8703232926	...	266140	3262	0	0
SGD3	13-01-02	IMP117406	8517180000	...	302275	5612	0	0
SGD4	13-01-02	IMP435108	8703222900	...	4160	514	0	0
SGD5	13-01-02	IMP717900	8545200000	...	239549	397	1	980

Model Architecture

DATE consists of three stages. The first stage pre-trains a tree-based classifier to generate cross features of each transaction. The second stage is a dual attentive mechanism that learns both the interactions among cross features and the interactions among importers, HS codes, and cross features. The third stage is the dual-task learning by jointly optimizing illicitness classification and revenue prediction. The overall architecture is depicted in the below figure.

Requirements

To run this code fully, you will need these repositories. We have been running our code in Python 3.7.

Ranger optimizer
torch_multi_head_attention
pytorch>=1.0.0
scikit-learn>=0.21.0
numpy>=1.16.4
pandas>=0.25.3
Others: scipy, matplotlib

Please refer to the issue if you faced CUDA version mismatch.

How to Install

Setup your Python environment: e.g., Anaconda Python 3.7 Guide

$ source activate py37

Clone the repository

$ git clone https://github.com/Roytsai27/Dual-Attentive-Tree-aware-Embedding.git

Install requirements

$ pip install -r requirements.txt
# Please install the Ranger optimizer by following its instruction.

Run the codes

$ python preprocess_data.py; python generate_loader.py; python train.py

Check the DATE_manual to grasp how the model works. The manual provides a step-by-step execution of DATE model and detailed explanation of its sub-modules.

How to Train the Model

Run preprocess_data.py This script would run the preprocessing for raw data from customs and dump a preprocessed file for training XGB model in step 2.
Run generate_loader.py This will train and evaluate the XGB model and XGB+LR model. Also, the scipt will dump a pickle file for training a DATE model in step 3.
Run train.py This will train and evaluate the DATE model, you can tune the hyperparameters by adding args after train.py. e.g. python3 train.py --epoch 10 --l2 1e-6 etc.

Important: With default settings, the model will run on given synthetic data.

Hyperparameters:

Parameters of preprocess_data.py and generate_loader.py: Check this document.
Parameters of train.py:

--epoch: number of epochs
--l2: l2 regularization 
--dim: dimension for hidden layers
--use_self: Use leaf-wise self-attention or not 
--alpha: The adaptive weight to balance the scale and importance for regression loss
--lr: learning rate
--head_num: number of heads for self-attention
--act: activation function (Relu or Mish)
--device: The device name for training, if train with cpu, please use:"cpu" 
--output: save the performance output in a csv file

Main Results

Below table illustrates the DATE model and its baseline results of the Nigerian import declarations.

Other Experiments & Codes

Code for auxiliary experiments are uploaded in the experiments/ directory.

revcls: Section 5.1, date_cls and date_rev results
ablation-studies: Section 5.3, includes w/o attention network and w/o fusion. Modify model/AttTreeEmbedding.py with the provided code. w/o dual task learning and w/o multi-head self attention could be done by setting args in train.py
training-length: Section 5.4, effects on training length
corrupted-data: Section 6, way to leverage existing data
hyperparameter-analysis: Section 7.1-2, hyperparameter analysis
loss-weight: Section 7.3, date_cls and date_rev by controlling alpha
interpreting-results: Section 5.6, interpreting DATE results by finding effective cross-features with high attention weight

Customs Selection in Batch

If you want to use DATE and other baselines for pilot test, please refer to this directory.

weekly-customs-selection: Using DATE model prediction results for customs selection in batch, which can be done daily or weekly.

Citation

If you mention DATE for your publication, please cite the original paper:

@inproceedings{kimtsai2020date,
  title={DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection},
  author={Kim, Sundong and Tsai, Yu-Che and Singh, Karandeep and Choi, Yeonsoo and Ibok, Etim and Li, Cheng-Te and Cha, Meeyoung},
  booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  year={2020}
}