Awesome
Portuguese Winograd Schema Challenge
Currently under development
Solver for Winograd Schema Challenge in Portuguese. Portuguese translations for original Winograd Schema Challenge are also being proposed here.
Preliminary results were presented on a conference paper: Melo, Gabriela Souza de; Imaizumi, Vinicius A. ; Cozman, Fabio Gagliardi . Winograd Schemas in Portuguese. In: Encontro Nacional de Inteligência Artificial e Computacional, 2019.
Project Setup
-
This project has not been tested in machines without CUDA GPUs available.
-
A Dockerfile is available, and may be used with
docker build -t wsc_port .
followed bynvidia-docker run -it -v $PWD/models:/code/models wsc_port <desired_command>
(ienvidia-docker run -it -v $PWD/models:/code/models wsc_port python -m src.main
). -
The docker-compose file contains a few different options for running the code, which can be run with commands such as:
docker-compose run <service_name>
(iedocker-compose run train
). For the jupyter-server, run withdocker-compose run --service-ports jupyter-server
(password for accessing the webpage for it isroot
). -
For running outside of the Docker container, Conda is required.
- To create the conda environment:
conda env create -f environment.yml
- To create the conda environment:
-
Makefile contains some of the commands used to run the code. These commands must be run from inside the environment.
- to setup the environment for running the project:
make dev-init
. This command also makes suremake processed-data
is run, which prepares data needed to train model- The data corresponding to the Corpus being used is organized as follows:
- Raw data: files used to generate the final Winograd Schema Challenge schema collection JSONs
- External data: the compressed XML file, as downloaded from Wikipedia's dump archive
- Interim data: TXT files extracted from the above. May or may not be split between different, smaller files
- Processed data: TXT files, containing text split between train, test and validation splits. It also contains the generated Winograd Schema Challenge schema collection JSONs.
- Additionally,
make reduced-processed-data
reduces size of each of these splits
- Additionally,
- The data corresponding to the Corpus being used is organized as follows:
- running
make corpus
will speed up first run of code (but is not necessary) make train
trains a modelmake winograd-test
runs evaluation of Winograd Schema Challengemake generate
runs language model for generation of text
- to setup the environment for running the project:
-
Code runs for both English and Portuguese cases, and this setting is controlled by the variable
PORTUGUESE
insrc.consts
. -
Run tests with
make tests
, which is equivalent topytest --cov=src tests/
. Usepytest --cov=src --cov-report=html tests/
for generation of HTML test report. Needs pytest and pytest-cov packages. If there are import errors, should runpip install -e .
to locally install package from source code.
Winograd Collection Generation
There is also code in this repository for generating the Winograd Schema Collection JSON, from the original HTML file, to be ready to be used by the solver. This generation happens by executing python -m src.winograd_collection_manipulation.wsc_subsets_generation
. To generate the version with translated names, after that first command, simply run python -m src.winograd_collection_manipulation.name_replacer
. These commands don't need to be called to be able to run the solver, given that the JSON file is already present in this repository. However, this code is being made available, in case it can help with translations for the Challenge to other languages.
Project Organization
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`.
├── README.md <- The top-level README for developers using this project.
├── environment.yml <- Contains project's requirements, generated from Anaconda environment.
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported.
│
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── githooks <- Contains githooks scripts being used for development. Git hook directory for repo needs to be set to this folder.
│
├── models <- Trained and serialized models, model predictions, or model summaries. Gitignored due to their size.
│
├── notebooks <- Jupyter notebooks, used during experimentation and testing.
│
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module.
└── tests <- Tests module, using Pytest.
<p><small>Project based on the <a target="_blank" href="https://drivendata.github.io/cookiecutter-data-science/">cookiecutter data science project template</a>. #cookiecutterdatascience</small></p>
References
- Code for Language Model based on Pytorch's Word-level language modeling RNN example
- Code for parallelization of PyTorch model based on PyTorch-Encoding package with help from this medium post.
- Idea of using language model for solving Winograd Schema Challenge based on the paper "A Simple Method for Commonsense Reasoning", by Trieu H. Trinh and Quoc V. Le, 2018.