Awesome
Recruitment Dataset Preprocessing and Recommender System
Project Overview
This project aims to preprocess raw data from Djinni service and develop a recommender system for matching candidates with potential jobs based on anonymized profiles of candidates and job descriptions. The preprocessing involves cleaning and organizing the data, while the recommender system utilizes natural language processing techniques to match candidates with suitable job descriptions.
Dataset Information
The Djinni Recruitment Dataset contains 150,000 job descriptions and 230,000 anonymized candidate CVs, posted between 2020-2023 on the Djinni IT job platform. The dataset includes samples in English and Ukrainian.
Exploratory Data Analysis
The exploratory data analysis (EDA) is provided in the notebook/EDA folder. These analyses offer insights into the characteristics of job descriptions and candidate profiles, aiding in understanding the data distribution and potential patterns.
Dataset Split and Loading
The preprocessed dataset has been split by languages and loaded into the HuggingFace Dataset Hub for easy access. The following datasets are available:
- Job Descriptions English
- Job Descriptions Ukrainian
- Candidates Profiles English
- Candidates Profiles Ukrainian
Intended Use
The Djinni dataset is designed with versatility in mind, supporting a wide range of applications:
-
Recommender Systems and Semantic Search: It serves as a key resource for enhancing job recommendation engines and semantic search functionalities, making the job search process more intuitive and tailored to individual preferences.
-
Advancement of Large Language Models (LLMs): The dataset provides invaluable training data for both English and Ukrainian domain-specific LLMs. It is instrumental in improving the models' understanding and generation capabilities, particularly in specialized recruitment contexts.
-
Fairness in AI-assisted Hiring: By serving as a benchmark for AI fairness, the Djinni dataset helps mitigate biases in AI-assisted recruitment processes, promoting more equitable hiring practices.
-
Recruitment Automation: The dataset enables the development of tools for automated creation of resumes and job descriptions, streamlining the recruitment process.
-
Market Analysis: It offers insights into the dynamics of Ukraine's tech sector, including the impacts of conflicts, aiding in comprehensive market analysis.
-
Trend Analysis and Topic Discovery: The dataset facilitates modeling and classification for trend analysis and topic discovery within the tech industry.
-
Strategic Planning: By enabling the automatic identification of company domains, the dataset assists in strategic market planning.
Pipeline Management with DVC
The pipeline for preprocessing and creating the recommender system has been managed using Data Version Control (DVC). DVC ensures reproducibility and tracks the dependencies and outputs of each step in the pipeline. Final outputs are JSON files with candidate IDs as keys and a list of matched job description IDs as values.
Installation Instructions
Follow these steps to install and set up the project:
Prerequisites
- Git installed on your system
- Conda installed (for creating and managing virtual environments)
- Python 3.11 installed
Steps
-
Clone the repository:
git clone https://github.com/Stereotypes-in-LLMs/recruitment-dataset
-
Create a virtual environment using Conda:
conda create --name py311 python=3.11
-
Activate the virtual environment:
conda activate py311
-
Install Poetry for dependency management:
pip install poetry
-
Install dependencies using Poetry:
poetry install
-
Pull the necessary data using DVC (this may take some time):
dvc pull -v
-
Reproduce the training pipeline (all steps should be skipped if data is already up to date locally):
dvc repro -v
Running the Pipeline
-
To run a single step of the pipeline:
dvc repro -v -sf STEPNAME
-
To run all steps of the pipeline after a certain step:
dvc repro -v -f STEPNAME --downstream
-
To simulate running all steps without actually running them:
dvc repro -v -f STEPNAME --downstream --dry
For more information on DVC, refer to the documentation.
BibTeX entry and citation info
When publishing results based on this dataset please refer to:
@inproceedings{drushchak-romanyshyn-2024-introducing,
title = "Introducing the Djinni Recruitment Dataset: A Corpus of Anonymized {CV}s and Job Postings",
author = "Drushchak, Nazarii and
Romanyshyn, Mariana",
editor = "Romanyshyn, Mariana and
Romanyshyn, Nataliia and
Hlybovets, Andrii and
Ignatenko, Oleksii",
booktitle = "Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.unlp-1.2",
pages = "8--13",
}
Contributors
License
This project is licensed under the Apache License 2.0.