Awesome
Urdu Word Segmentation
This repository contains code & dataset for Urdu word segmentation as described in paper Urdu Word Segmentation using Conditional Random Fields (CRFs).
Requirement(s)
It is implemented in python and requires scikit-learn and python-crfsuite.
Dataset
A manually annotated corpus of approximately 111,000 tokens is available for download.
Reference(s)
If you use this tool in any of your work, please cite below paper.
Urdu Word Segmentation using Conditional Random Fields (CRFs)
@InProceedings{C18-1217,
author = "Bin Zia, Haris
and Raza, Agha Ali
and Athar, Awais",
title = "Urdu Word Segmentation using Conditional Random Fields (CRFs)",
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "2562--2569",
location = "Santa Fe, New Mexico, USA",
url = "http://aclweb.org/anthology/C18-1217"
}
License(s)
Copyright (c) 2018 CSaLT, ITU
Code licensed under the MIT License: http://opensource.org/licenses/MIT Data licensed under CC-BY 4.0: https://creativecommons.org/licenses/by/4.0/