Home

Awesome

Pang v1.0.0

Pattern-Based Anomaly Detection in Graphs

Pang is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. For source availability and license information see licence.txt


Description

Pang is an algorithm which represents and classifies a collection of graphs according to their frequent patterns (subgraphs). The detail of this algorithm are described in an article [P'23]. This work was conducted in the framework of the DeCoMaP ANR project (Detection of corruption in public procurement markets -- ANR-19-CE38-0004).

If you use this source code, please cite article [P'23]:

@InProceedings{Potin2023c,
  author    = {Potin, Lucas and Figueiredo, Rosa and Labatut, Vincent and Largeron, Christine},
  title     = {Pattern Mining for Anomaly Detection in Graphs: Application to Fraud in Public Procurement},
  booktitle = {Joint European Conference on Machine Learning and Knowledge Discovery in Databases},
  year      = {2023},
  volume    = {14174},
  series    = {Lecture Notes in Computer Science},
  publisher = {Springer},
  pages     = {69-87},
  doi       = {10.1007/978-3-031-43427-3_5},
  url       = {https://link.springer.com/chapter/10.1007/978-3-031-43427-3_5},
}

Content

Organization

This repository is composed of the following elements:

Installation

Python and Packages

First, you need to install the Python language and the required packages:

  1. Install the Python language
  2. Download this project from GitHub and unzip.
  3. Execute pip install -r requirements.txt to install the required packages (see also Section Dependencies).

Non-Python Dependencies

Second, one of the dependencies, SPMF, is not a Python package, but rather a Java program, and therefore requires a specific installation process:

Note that SPMF is available both as a JAR and as source code archive. However, the former does not contain all the features required by Pang, so one should use only the latter.

In order to run the script that reproduces our ECML PKDD experiments, you also need to install CORK. This is done by unzipping the archive CORKcpp.zip in the src folder. File Readme in this archive contains the instruction for compiling the C++ source code.

Data

Third, you need to set up the data to which you want to apply Pang. This can be the dataset from our paper, in which case you will need to unzip several archives, or your own data, in which case they need to be respect the appropriate format. In both cases, see cf. Section Usage.

Use

We provide two scripts to use Pang:

To Replicate the Paper Experiments

To replicate the experiments in our Paper[P'23], first unzip the provided datasets, and run Pang on them.

Data Preparation

To unzip the datasets used in our experiments:

  1. Go to the data folder.
  2. In each subfolder, you will find an archive that you need to unzip.

We retrieved the benchmark datasets from the SPMF website; they include:

The public procurement dataset contains graphs extracted from the FOPPA database:

Processing

Then, run the appropriate script:

  1. Open the Python console.
  2. Run EMCL.py

The script will compute the results of the experiments and save the results associated with Table 2, 4, 5 and 6 of the paper, in the results folder.

To Apply Pang to Other Data

If you want to use Pang with your own data, you need to set up the data, then identify the patterns, and finally perform the classification.

Data Preparation

Create an XXX folder in the data folder (where XXX is the name of your dataset), in order to host your data. This folder must contain the following files:

We use the same format as SPMF for the graph input files, i.e.:

  1. t # N N: graph id
  2. v M L M: node id, L: node label
  3. e P Q L P: source node id, Q: destination node id, L: edge label

For information, the files produced by our scripts to list the identified patterns are similar, except they contain an extra line:

  1. x A B C A,B,C : graphs containing the pattern

The format of the file containing the graph labels is as follows: each line contains an unique integer, corresponding to the label of the graph in the same line in the graph file.

Processing

Once the data are ready, you need to run a script to identify the patterns, and produce the files required by Pang:

  1. Open the Python console.
  2. Run the script Patterns.sh in order to create the files XXX_patterns.txt.
  3. Run ProcessingPattern.pywith the option -d XXX in order to create the files XXX_mono.txt and XXX_iso.txt.
  4. Run PANG.py, specifying both following parameters:
    • -d XXX: name of the dataset
    • -k k: number of patterns to consider in the PANG representations. User can provide a single value, or a list of values separated by commas.

For each value of the parameter k, Pang will create a file KResults.txt containing the results of the classification and a file KPatterns.txt containing the patterns.

Dependencies

Tested with python version 3.6.13 and the following packages:

The VF2 [C'04] and ISMAGS [H'14] algorithms are included in the Networkx library

Tested with SPMF version 2.54, which implements gSpan [Y'02] (to mine frequent patterns) and cgSpan [S'21] (closed frequent patterns).

For the ECML PKDD assessment, we use the following algorithms for the sake of comparison:

References