Awesome
Georgian (ka_GE) word list
Data sources
- Kevin Scannell (http://crubadan.org/languages/ka, CC-BY 4.0)
- National Parliamentary Library of Georgia (http://www.nplg.gov.ge/gwdict/index.php)
- Other Georgian eBooks/websites (Crawler)
Crawler
Crawler is written on PHP and uses MySQL as a database. Code placed under crawler
folder.
Before running the script should be configured the database and run migrations.
First rename the file .env.example
to .env
and specify database credentials.
Install composer dependencies:
composer install
And run migrations:
composer migrate
Usage
Crawl links with internal
profile
This command will crawl urls only inside specified domain and ignore external urls
php cmd crawl --project-name="My Project" --profile=internal "http://www.nplg.gov.ge/gwdict/index.php"
Crawl links with all
profile
This command will crawl all links
php cmd crawl --project-name="My Project" --profile=all "http://www.nplg.gov.ge/gwdict/index.php"
Crawl links with domain
profile
This command will crawl links with all domains, which end with --domain
php cmd crawl --project-name="My Project" --profile=domain --domain=.ge "http://www.nplg.gov.ge/gwdict/index.php"
Will be crawled links, where url's domain ends with .ge
suffix
Crawl links with subset
profile
This command will crawl all urls if link starts with --subset
php cmd crawl --project-name="My Project" --profile=subset --subset="http://www.nplg.gov.ge/gwdict/index.php?a=list&d=1" "http://www.nplg.gov.ge/gwdict/index.php?a=list&d=1"
Will be crawled links, where url starts with http://www.nplg.gov.ge/gwdict/index.php?a=list&d=1
prefix
Continue project
You can continue stopped project by command
php cmd crawl --project-id={id}
Show all possible options: php cmd help crawl
TODO
- Fix wrong entries and add more words
- Add tests
- Add notification sending on complete
License
Please see the LICENSE included in this repository for a full copy of the MIT license, which this project is licensed under.