Home

Awesome

Recruitment Dataset Preprocessing and Recommender System

Project Overview

This project aims to preprocess raw data from Djinni service and develop a recommender system for matching candidates with potential jobs based on anonymized profiles of candidates and job descriptions. The preprocessing involves cleaning and organizing the data, while the recommender system utilizes natural language processing techniques to match candidates with suitable job descriptions.

Dataset Information

The Djinni Recruitment Dataset contains 150,000 job descriptions and 230,000 anonymized candidate CVs, posted between 2020-2023 on the Djinni IT job platform. The dataset includes samples in English and Ukrainian.

Exploratory Data Analysis

The exploratory data analysis (EDA) is provided in the notebook/EDA folder. These analyses offer insights into the characteristics of job descriptions and candidate profiles, aiding in understanding the data distribution and potential patterns.

Dataset Split and Loading

The preprocessed dataset has been split by languages and loaded into the HuggingFace Dataset Hub for easy access. The following datasets are available:

Intended Use

The Djinni dataset is designed with versatility in mind, supporting a wide range of applications:

Pipeline Management with DVC

The pipeline for preprocessing and creating the recommender system has been managed using Data Version Control (DVC). DVC ensures reproducibility and tracks the dependencies and outputs of each step in the pipeline. Final outputs are JSON files with candidate IDs as keys and a list of matched job description IDs as values.

Installation Instructions

Follow these steps to install and set up the project:

Prerequisites

Steps

  1. Clone the repository:

    git clone https://github.com/Stereotypes-in-LLMs/recruitment-dataset
    
  2. Create a virtual environment using Conda:

    conda create --name py311 python=3.11
    
  3. Activate the virtual environment:

    conda activate py311
    
  4. Install Poetry for dependency management:

    pip install poetry
    
  5. Install dependencies using Poetry:

    poetry install
    
  6. Pull the necessary data using DVC (this may take some time):

    dvc pull -v
    
  7. Reproduce the training pipeline (all steps should be skipped if data is already up to date locally):

    dvc repro -v
    

Running the Pipeline

For more information on DVC, refer to the documentation.

BibTeX entry and citation info

When publishing results based on this dataset please refer to:

@inproceedings{drushchak-romanyshyn-2024-introducing,
    title = "Introducing the Djinni Recruitment Dataset: A Corpus of Anonymized {CV}s and Job Postings",
    author = "Drushchak, Nazarii  and
      Romanyshyn, Mariana",
    editor = "Romanyshyn, Mariana  and
      Romanyshyn, Nataliia  and
      Hlybovets, Andrii  and
      Ignatenko, Oleksii",
    booktitle = "Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.unlp-1.2",
    pages = "8--13",
}

Contributors

License

This project is licensed under the Apache License 2.0.