Awesome
Introduction
This project finds lexical variations within a dataset and evaluates the performance compared to a gold standard lexicon.
Preprocessing and Normalization
Preprocess the raw data using the gold standard file.
mkdir -p UrduPhone/Input\ Files/
mkdir -p UrduPhone/Output\ Files/
cp gold_standard UrduPhone/Input\ Files/Gold\ Standard.txt
cp raw_data UrduPhone/Input\ Files/Dataset.txt
cd UrduPhone
python main.py
Copy these preporcessed files and the gold standard file to Lexical Normalization project.
cd ../
mkdir -p Experiments\ &\ Evaluation/Input\ Files/Default/
cp UrduPhone/Output\ Files/* Experiments\ &\ Evaluation/Input\ Files/Default/
cp UrduPhone/Input\ Files/Gold\ Standard.txt Experiments\ &\ Evaluation/Input\ Files/Default/
Import the Lexical Normalization project in a JAVA IDE (e.g. Eclipse) and run.
Build and Run file src/frontend/LexNorm.java
Update lines 35-37 in LexNorm.java for different clustering scenarios
Paper
Abdul et. al. "A Clustering Framework for Lexical Normalization of Roman Urdu", Journal of Natural Language Engineering, 2020