Awesome

arekit-ss 0.25.0

📜 List of binded sources

arekit-ss [AREkit double "s"] -- is an object-pair context sampler for datasources, powered by AREkit

NOTE: For custom text sampling, please follow the ARElight project.

Installation

Install dependencies:

pip install git+https://github.com/nicolay-r/arekit-ss.git@0.25.0

Download resources:

python -m arekit_ss.download_data

Usage

Example of composing prompts:

python -m arekit_ss.sample --writer csv --source rusentrel --sampler prompt \
  --prompt "For text: '{text}', the attitude between '{s_val}' and '{t_val}' is: '{label_val}'" \
  --dest_lang en --docs_limit 1

Mind the case (issue #18): switching to another language may affect on amount of extracted data because of terms_per_context parameter that crops context by fixed and predefined amount of words.

Parameters

</summary>

source -- source name from the list of the supported sources.
- terms_per_context -- amount of words (terms) in between SOURCE and TARGET objects.
- object-source-types -- filter specific source object types
- object-target-types -- filter specific target object types
- relation_types -- list of types, in which items separated with | char; all by default
- splits -- Manual selection of the data-types related splits that should be chosen for the sampling process; types should be separated by ':' sign; for example: 'train:test'
sampler -- List of the supported samplers:
- nn -- CNN/LSTM architecture related, including frames annotation from RuSentiFrames.
  - no-vectorize -- flag is applicable only for nn, and denotes no need to generate embeddings for features
- bert -- BERT-based, single-input sequence.
- prompt -- prompt-based sampler for LLM systems [prompt engeneering guide]
  - prompt -- text of the prompt which includes the following parameters:
    - {text} is an original text of the sample
    - {s_val} and {t_val} values of the source and target of the pairs respectively
    - {label_val} value of the label
writer -- the output format of samples:
- csv -- for AREnets framework;
- jsonl -- for OpenNRE framework.
- sqlite -- SQLite-3.0 database.
mask_entities -- mask entity mode.
Text translation parameters:
- src_lang -- original language of the text.
- dest_lang -- target language of the text.
output_dir -- target directory for samples storing
Limiting the amount of documents from source:
- docs_limit -- amount of documents to be considered for sampling from the whole source.
- doc_ids -- list of the document IDs.

</details>

output_prompts

Powered by

AREkit framework