Awesome
Wikipedia Recommender System
Welcome to our project repository for the Network Tour of Data Science course at EPFL !
We implemented a query-based search engine for Wikipedia articles related to various Machine Learning topics.
In other words, given a query our system will retrieve and suggest articles with similar semantic contents. Moreover, we provide a graph visualisation tool to interact with the query engine.
More details about this ML system can be found in the project [report](Team 02 - Project report.pdf).
How to reproduce results:
Note that 'wd' is the directory containing the run.sh script (in the project folder).
- Run the command
export PYTHONPATH=wd
NOTE: if you want to use a virtual environment, run the following:
python3 -m venv ntds
echo 'export PYTHONPATH=wd' >> ntds/bin/activate
From wd, run the following:
- Run the command
sudo apt install build-essential python-dev libxml2 libxml2-dev zlib1g-dev bison flex
pip3 install -r requirements.txt
pip3 install pymagnitude==0.1.120 --no-binary :all:
- Specify INITIAL_FILENAME in config.py. This is the name of the file produced on Seealsology (to put in the data folder). The seeds to scrap the graph are given in the seeds_seealsology.txt file (we used a distance of 2).
- Download the wiki-news-300d-1M-subword.magnitude file at and put it into the data folder.
- Execute the run.sh script (takes a few minutes to run).
- Run exploration.ipynb and/or exploitation.ipynb for the respective analysis.
Interactive Visualisation:
After having done the previous part, run the command: python3 visualization/app.py 8888
NOTE: if you want to put the app online like on the following link, you have to do all the above installs in "sudo" mode, and run the following command instead: sudo PYTHONPATH=wd python3 visualization/app.py 80
. Another option is that you enable port 80 for current user.
You can choose any of the three methods to perform a query.
For multiple concepts, please separate by a comma, e.g. machine learning,text processing The port 80 must be opened for external access if you use a server.
- By clicking on a node, 'Chosen node' link will redirect you to the corresponding web page.
- Only the page title of nodes that best fit the query as well as the neighbours are shown.
- Red edges mean that the pages are present in the 'See also' section on Wikipedia website.
- The color of the nodes represents the cosine similarity score.
This web app has been only tested on Chrome for Linux (78.0.3904.70).
Files breakdown:
run.sh : shell script executing the acquisition, exploitation and visualisation tasks.
Acquisition:
- acquisition_helpers.py : various helpers for the acquisition.py script,
- acquisition.py : loads the dataset and augments it with urls and keywords extraction. Create df_node dataframe which contains node information and df_edge which contains edge relation.
Exploration:
- exploration.ipynb: exploratory data analysis.
Exploitation:
- exploitation.py: fits and saves the 3 models we used
- exploitation.ipynb: loads the models and performs a qualitative evaluation on a set of queries and topics
Visualization:
- app.py: runs the visualisation app on a dedicated server
- create_visu.py: creates and saves the graph visualisation
- utils.py: various helpers
Helpers:
- predict.py: helpers for the exploitation part
- spectral_clustering.py: specific helpers for the spectral clustering model
Data:
- Data: contains every file loaded and generated by the different modules.
Authors
- EL Amrani Ayyoub
- Micheli Vincent
- Myotte Frédéric
- Sinnathamby Karthigan
License
Wikipedia Recommender System - Network Tour of Data Science EE-558 - EPFL - Fall 2019 - Team 2
Copyright (c) 2019 EPFL
This program is licensed under the terms of the GPL.