Awesome

Recruitment Dataset Preprocessing and Recommender System

Project Overview

This project aims to preprocess raw data from Djinni service and develop a recommender system for matching candidates with potential jobs based on anonymized profiles of candidates and job descriptions. The preprocessing involves cleaning and organizing the data, while the recommender system utilizes natural language processing techniques to match candidates with suitable job descriptions.

Dataset Information

The Djinni Recruitment Dataset contains 150,000 job descriptions and 230,000 anonymized candidate CVs, posted between 2020-2023 on the Djinni IT job platform. The dataset includes samples in English and Ukrainian.

Exploratory Data Analysis

The exploratory data analysis (EDA) is provided in the notebook/EDA folder. These analyses offer insights into the characteristics of job descriptions and candidate profiles, aiding in understanding the data distribution and potential patterns.

Dataset Split and Loading

The preprocessed dataset has been split by languages and loaded into the HuggingFace Dataset Hub for easy access. The following datasets are available:

Intended Use

The Djinni dataset is designed with versatility in mind, supporting a wide range of applications:

Recommender Systems and Semantic Search: It serves as a key resource for enhancing job recommendation engines and semantic search functionalities, making the job search process more intuitive and tailored to individual preferences.
Advancement of Large Language Models (LLMs): The dataset provides invaluable training data for both English and Ukrainian domain-specific LLMs. It is instrumental in improving the models' understanding and generation capabilities, particularly in specialized recruitment contexts.
Fairness in AI-assisted Hiring: By serving as a benchmark for AI fairness, the Djinni dataset helps mitigate biases in AI-assisted recruitment processes, promoting more equitable hiring practices.
Recruitment Automation: The dataset enables the development of tools for automated creation of resumes and job descriptions, streamlining the recruitment process.
Market Analysis: It offers insights into the dynamics of Ukraine's tech sector, including the impacts of conflicts, aiding in comprehensive market analysis.
Trend Analysis and Topic Discovery: The dataset facilitates modeling and classification for trend analysis and topic discovery within the tech industry.
Strategic Planning: By enabling the automatic identification of company domains, the dataset assists in strategic market planning.

Pipeline Management with DVC

The pipeline for preprocessing and creating the recommender system has been managed using Data Version Control (DVC). DVC ensures reproducibility and tracks the dependencies and outputs of each step in the pipeline. Final outputs are JSON files with candidate IDs as keys and a list of matched job description IDs as values.

Installation Instructions

Follow these steps to install and set up the project:

Prerequisites

Git installed on your system
Conda installed (for creating and managing virtual environments)
Python 3.11 installed

Steps

Clone the repository:

git clone https://github.com/Stereotypes-in-LLMs/recruitment-dataset

Create a virtual environment using Conda:
```
conda create --name py311 python=3.11
```
Activate the virtual environment:
```
conda activate py311
```
Install Poetry for dependency management:
```
pip install poetry
```
Install dependencies using Poetry:
```
poetry install
```
Pull the necessary data using DVC (this may take some time):
```
dvc pull -v
```
Reproduce the training pipeline (all steps should be skipped if data is already up to date locally):
```
dvc repro -v
```

Running the Pipeline

To run a single step of the pipeline:
```
dvc repro -v -sf STEPNAME
```
To run all steps of the pipeline after a certain step:
```
dvc repro -v -f STEPNAME --downstream
```
To simulate running all steps without actually running them:
```
dvc repro -v -f STEPNAME --downstream --dry
```

For more information on DVC, refer to the documentation.

BibTeX entry and citation info

When publishing results based on this dataset please refer to:

@inproceedings{drushchak-romanyshyn-2024-introducing,
    title = "Introducing the Djinni Recruitment Dataset: A Corpus of Anonymized {CV}s and Job Postings",
    author = "Drushchak, Nazarii  and
      Romanyshyn, Mariana",
    editor = "Romanyshyn, Mariana  and
      Romanyshyn, Nataliia  and
      Hlybovets, Andrii  and
      Ignatenko, Oleksii",
    booktitle = "Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.unlp-1.2",
    pages = "8--13",
}

Contributors

Stereotypes-in-LLMs

License

This project is licensed under the Apache License 2.0.