Home

Awesome

ComplexCi

This project (ComplexCi) mainly focuses on the c++ implementation of Collective Influence (CI) algorithm, which is designed to find the most important node (or influencers) in the Complex Network via optimal percolation developed by:

Morone F, Makse H A. Influence maximization in complex networks through optimal percolation[J]. Nature, 2015, 524(7563): 65-U122.

Morone F, Min B, Bo L, et al. Collective Influence Algorithm to find influencers via optimal percolation in massively large social media[J]. Scientific reports, 2016, 6.

This project also covers the new algorithm CI_DR proposed in https://www.nature.com/articles/s41598-018-32874-5 .

Fengkuangtian Zhu. Improved collective influence of finding most influential nodes based on disjoint-set reinsertion[J]. Scientific reports, 2018, 9.

The users can use the script newReinsertCollectiveInfluence in this project to try the CI_DR.

Algorithm

Overall, the target of CI algorithm is to give a ranking list of nodes according to their importance and the top-ranked nodes will have more importance. We can remove the nodes from the top-ranked ones in the ranking list generated by CI algorithm and calculate the size of giant component after each removal, which will break down the network into many disconnected pieces. The ratio of giant component will reach zero with the one-by-one removal operation finally. Therefore, the better algorithm, the sooner the network will collapse to the zero giant component with smaller count of provided nodes.

Users can get the minimal set of influencers of the Complex Network by the C++ program in this repository . Considered that CI algorithms are able to be used in so many scientific fields , I implement this algorithm in modern C++ style code. Compared with the original c code http://www-levich.engr.ccny.cuny.edu/~hernanlab/uploads/CI_HEAP.c mentioned on the http://www-levich.engr.ccny.cuny.edu/webpage/hmakse/software-and-data/ with above paper , the ComplexCi has the following features :

Usage

This section describes the Usage of the ComplexCi and its corresponding scripts

Get Repository

Download Release

Binary file compile based on windows 7 x64 is provided on the https://github.com/zhfkt/ComplexCi/releases

Or Fetch from Github Source and Compile

Users can directly clone the repository by the git command or just download the zip archiver on the webpage

git clone https://github.com/zhfkt/ComplexCi.git

There are lots of C++11 features and syntax in the code so that the C++ Compiler needs to support C++11 Standard . In fact, there is only one cpp file ComplexCi/ComplexCi.cpp need to be compiled and it doesn’t rely on the other extra library

Users can enter into the “bin/” under root project folder and execute

g++ -pthread  ../ComplexCi/ComplexCi.cpp  -o ComplexCi -O3 -std=c++11 

or just execute

./make.sh

to generate binary bin/ComplexCi . Pls notice that the version of g++ needs to support c++11. For my own dev, the compilation is passed under g++ 5.4.0 on Ubuntu 16.04. Otherwise , you will get failure of several incorrect syntax errors.

There has already been a windows binary file x64/Release/ComplexCi.exe compiled on the Windows7 x64 in the repository. It can just be used if users do not need to recompile or change the code. If users want to compile themselves and have the Visual Studio 2013 or higher version, they can directly open the ComplexCi.sln and compile the code in the IDE. The binary file will be generated under x64/Release/ComplexCi.exe . I believe users can also pick up any other IDE or Compiler supporting C++11 to compile the file ComplexCi/ComplexCi.cpp . Pls note that users need to replace the x64/Release/ComplexCi.exe with their own generate file after compilation.

Run

Network Input File Format

The data of network can be written in one txt/csv files and the network is considered as undirected network. Each row contains 2 node IDs divided by one comma, which means there is a connection between these 2 nodes. The node IDs should be integer and started from 0. For example:

0,1
1,2
2,3
3,4
3,5
...

The complete Example is at data/test/karate.csv

Easy Start to use Script

Using Script in the project is a quick start to utilize Collective Influence (CI) algorithm. There are 3 scripts users can execute in the project. The project contains both bash scripts for linux and cmd scripts for windows. Bash for linux can be found in the dailyUse\linux and cmd for windows are in the dailyUse\windows

Users can use this script to utilize traditional c style code of http://www-levich.engr.ccny.cuny.edu/~hernanlab/uploads/CI_HEAP.c . I merge it into this repository and set the compatible interface layer to call the c code in c++ .

Script “traditionalCollectiveInfluence” accepts 3 parameters:

./traditionalCollectiveInfluence.sh <networkPath> <ballRadius> <isPrintMinPointCausingMinComponent>


# "networkPath" is the file path. The file format is described in the section Network Input File Format
# "ballRadius" is the Radius parameter defined in the Collective Influence Algorithm. When the ballRadius is zero, pls notice that Collective Influence Algorithm will degenerate into HDA (high degree adaptive) algorithm. a.k.a. CI value of each node will be equal with degree of the node.
# "isPrintMinPointCausingMinComponent" whether the output file contains limited point leading to 0.01 of giant component ratio or all points. If it is set to 0, the program will output all nodes. Otherwise the program will output partial points, which will make the giant component ratio to 0.01 in the deleting nodes process

e.g.

./traditionalCollectiveInfluence.sh /home/network/model1.csv 3 1

It means users are using the traditionalCollectiveInfluence strategy for the input file model1.csv with parameter ballRadius 3. The output file only contain the partial points, which will make the giant component ratio to 0.01 in the deleting nodes process

Users can use this script to utilize new c++ implementation of Collective Influence (CI) algorithm.

Script “cppCollectiveInfluence” accepts 4 parameters:

./cppCollectiveInfluence.sh <networkPath> <ballRadius> <updateBatch> <isPrintMinPointCausingMinComponent>


# "networkPath" is the file path.  The file format is described in the section Network Input File Format
# "ballRadius" is the Radius parameter defined in the Collective Influence Algorithm. When the ballRadius is zero, pls notice that Collective Influence Algorithm will degenerate into HDA (high degree adaptive) algorithm. a.k.a. CI value of each node will be equal with degree of the node.
# "updateBatch" batch size of deleting nodes in Network per updating CI values when the Complex Network collapses
# "isPrintMinPointCausingMinComponent" whether the output file contains limited point leading to 0 of giant component ratio or all points. If it is set to 0, the program will output all nodes. Otherwise the program will output partial points, which will make the giant component ratio to 0 in the deleting nodes process.  i.e. There is no edge but still left point in the network. Pls notice the different behavior compared with traditionalCollectiveInfluence 

e.g.

./cppCollectiveInfluence.sh /home/network/model1.csv 2 500 0

It means users are using the cppCollectiveInfluence strategy for the input file model1.csv with parameter ballRadius 2. The output file contain all nodes and 500 nodes will be deleted in a batch per updating CI values when the Complex Network collapses.

This script involves the new re-insert algorithm of collective influence (CI_DR). After verified on the 8 datasets, this new algorithm can achieve better performance in the metric of Robustness than original re-insert algorithm in collective influence. See in the benchmark sections.

The CI_DR is proposed in https://www.nature.com/articles/s41598-018-32874-5 .

Fengkuangtian Zhu. Improved collective influence of finding most influential nodes based on disjoint-set reinsertion[J]. Scientific reports, 2018, 9.

Script “newReinsertCollectiveInfluence” accepts 4 parameters:

./newReinsertCollectiveInfluence.sh <networkPath> <ballRadius> <updateBatch> <isPrintMinPointCausingMinComponent>

# "networkPath" is the file path.  The file format is described in the section Network Input File Format
# "ballRadius" is the Radius parameter defined in the Collective Influence Algorithm. When the ballRadius is zero, pls notice that Collective Influence Algorithm will degenerate into HDA (high degree adaptive) algorithm. a.k.a. CI value of each node will be equal with degree of the node.
# "updateBatch" batch size of deleting nodes in Network per updating CI values when the Complex Network collapses
# "isPrintMinPointCausingMinComponent" whether the output file contains limited point leading to 0 of giant component ratio or all points. If it is set to 0, the program will output all nodes. Otherwise the program will output partial points, which will make the giant component ratio to 0 in the deleting nodes process.  i.e. There is no edge but still left point in the network. Pls notice the different behavior compared with traditionalCollectiveInfluence 

e.g.

./newReinsertCollectiveInfluence.sh /home/network/model1.csv 2 500 0

It means users are using the newReinsertCollectiveInfluence (CI_DR) strategy for the input file model1.csv with parameter ballRadius 2. The output file contain all nodes and 500 nodes will be deleted in a batch per updating CI values when the Complex Network collapses.

Output file

When the algorithm finishes, the output file will be generated in the same folder of the original file with suffix “_out”. Each row in the output file contains nodes, whose count is a group of "outputNumBatch" defined in running ComplexCi binary. If users runs the script, the value "outputNumBatch" will be fixed to 500, for instance:

model1,4,5,2,33,…,2214
model1,2432,4554,3222,1123,…,2233
...

From the above instance, the nodes of networks model1 will be removed following such order: 4, 5, 2, 33, …, 2214, 2432, 4554, 3222, 1123, …, 2233… . If nodes count % outputNumBatch !=0, some comma will be at the last line of the output.

Benchmark

In order to demonstrate the performance of 3 main algorithms mentioned in the scripts traditionalCollectiveInfluence , cppCollectiveInfluence and newReinsertCollectiveInfluence (CI_DR) , I provide 8 test datasets downloaded from DataCastle Competition at https://github.com/zhfkt/ComplexCi/releases/download/v0.1/networks.zip . Here the metric of Robustness is another measure to quantify the performance of ranking methods introduced by the paper

Schneider C M, Moreira A A, Andrade J S, et al. Mitigation of malicious attacks on networks[J]. Proceedings of the National Academy of Sciences, 2011, 108(10): 3838-3841.

The smaller value is, the better the algorithm is.

traditionalCollectiveInfluencemodel1model2model3model4real1real2real3real4total
ballRadius 0
Robustness score0.21210.17700.34840.12850.04500.09020.10220.07551.1793
time188s155s304s117s705s93s504s170s705s
ballRadius 1
Robustness score0.20800.17320.34600.12490.04090.04900.09640.06581.1045
time200s157s341s121s597s69s727s173s727s
ballRadius 2
Robustness score0.20780.17310.34460.11880.03880.04170.09550.04881.0695
time1558s539s1435s306s30738s57s74935s930s74935s
cppCollectiveInfluencemodel1model2model3model4real1real2real3real4total
ballRadius 0
Robustness score0.21290.17820.34880.13040.04580.09170.10290.07511.1863
time150s121s219s91s203s90s173s144s219s
ballRadius 1
Robustness score0.20780.17280.34590.12570.04070.05440.09690.06531.1099
time233s175s321s128s500s62s637s167s637s
ballRadius 2
Robustness score0.20830.17320.34460.11880.03860.04160.09540.04921.0700
time1833s709s1674s450s27263s57s43663s1012s43663s
newReinsertCollectiveInfluence (CI_DR)model1model2model3model4real1real2real3real4total
ballRadius 0
Robustness score0.21000.17440.36030.11740.03150.00690.09780.04171.0402
time149s127s204s92s163s102s162s133s204s
ballRadius 1
Robustness score0.21040.17430.36550.11520.03170.00460.09680.04211.0409
time228s171s321s123s461s65s609s159s609s
ballRadius 2
Robustness score0.21060.17390.35850.11470.03030.00390.09540.03691.0246
time1841s769s1679s392s27375s58s43405s1009s43405s

The time doesn't cover IO read/write from/to disk and 8 datasets are all running in parallel on the 4-core cpu machine (Intel Xeon E5-2667v4 Broadwell 3.2 GHz) with linux.

From the benchmark ,we can see that the result of traditional c implementation traditionalCollectiveInfluence and new c++ cppCollectiveInfluence can both achieve nearly the same result in the metric of Robustness, even the new c++ implementation is more efficient and spends much less time on some datasets than the traditional c program. Data structure of disjoint-set is used in the reinsertion in the new c++ implementation ComplexCi and it can boost a lot. The traditionalCollectiveInfluence didn’t use this data structure and I think that’s the reason why the traditional c program was slow.

We can also see that newReinsertCollectiveInfluence (CI_DR) can achieve better Robustness Value result. It can be proved that the newReinsertCollectiveInfluence (CI_DR) performs well even the ballRadius is set to 0 in the simple HDA (high degree adaptive) algorithm without using Collective Influence algorithm.

DataCastle Competition

http://www.pkbigdata.com/common/cmpt/%E5%A4%A7%E5%B8%88%E8%B5%9B_%E7%AB%9E%E8%B5%9B%E4%BF%A1%E6%81%AF.html?lang=en_US

Competition background

Disparate networks, including social networks, communication networks and biological networks, are playing an increasingly important role on natural and socio-economic systems. A core problem, therein, is to measure the significance of individual nodes. For instance, a super spreader in Hong Kong triggered transmission of SARS to a significantly greater number of other people than 100 normal infected persons;a rumor re-tweeted by a celebrity may spread much broader than that by an obscure person.

Therefore it is necessary to develop a method to identify thevirulence genes in large-scale gene regulatory networks, to find the super-spreaders in large-scale social networks, and to detect the key enterprises with serious systematic financial risk in large-scale financial networks.

Those tasks could be formalized as a generic challenge that is identifying vital nodes in networks that are important for sustaining connectivity. This challenge, aka optimal percolation, is a well-documented issue in network science. With great anticipation of making big progress on this problem, we successfully invited some experts and hope the great participants will create novel and effective solutions.

How to get Quick/Best Result

Here I write 2 scripts both on the windows and linux platform to help users to get the quick and best results based on the ComplexCi for DataCastle Competition.

The quickresult script is a quick way to generate the output. Compared with the best result, it doesn’t spend too much time and the result is still competitive and surprisingly in the scale of million. The quickresult script can achieve 1.0287 in the metric of Robustness.

The bestresult script will get the score of 0.9928 in the best performance in terms of Robustness. However, it will take much longer time to finish in serveral hours.

If users want to experience the quick/best result of the raised algorithm for DataCastle Competition, they can follow the steps:

data/networks/model1.csv
data/networks/model2.csv
data/networks/model3.csv
data/networks/model4.csv
data/networks/real1.csv
data/networks/real2.csv
data/networks/real3.csv
data/networks/real4.csv

For the linux, users can execute

./quickResult.sh

under bin/ folder to get a quick result and wait until finish. The task will be completed in about 5 minutes in serveral million-scale networks on the 4-core machine.

For the windows, users can execute

quickResult.cmd

under x64/release/ folder and the cmd will pop up 8 separate consoles. Until these consoles all complete the algorithm, users are able to use

mergeResult.cmd 

to merge the 8 files into one result file.

For the linux, users are also able to execute

./bestResult.sh 

under bin/ folder to get the best result but it will take nearly 3 hours to complete the whole algorithm for all datasets.

For the windows, users can execute

bestResult.cmd

to get a best result, but which will take longer time than linux version because it is not implemented in parallel for 8 datasets.

Monitor Quick/Best Result on the linux environment

Due to the reason that Quick/Best Result script run in background on the linux environment, users can monitor the background job by the command in the bin folder:

tail -f "serID".log

"serID" will be shown on the script running screen

Quick/Best Result Benchmark

quickResultmodel1model2model3model4real1real2real3real4total
score0.21000.17440.34880.11740.03150.00690.09780.04171.0287
time150s128s215s100s165s103s162s135s215s
bestResultmodel1model2model3model4real1real2real3real4total
score0.20600.16980.34380.11170.03060.00230.09390.03560.9928
time6057s5114s10145s3278s9621s4013s4812s5008s10145s

The time doesn't cover IO read/write from/to disk and 8 datasets are all running in parallel on the 4-core cpu machine (Intel Xeon E5-2667v4 Broadwell 3.2 GHz) with linux.

PPT in CCCN2017 (www.cccn2017.top:2017). https://github.com/zhfkt/ComplexCi/blob/master/Fengkuangtian%20Zhu%20-%20teamID%20zhfkt%20-%20presentation.pptx

Folder/Files

If you have any question on this project, welcome to file the issue on the github.

Cite

To cite ComplexCi please use the following publication:

Fengkuangtian Zhu. Improved collective influence of finding most influential nodes based on disjoint-set reinsertion[J]. Scientific reports, 2018, 9.