Home

Awesome

ner_dataset_recognition

Dataset information

The datasets are provided in the datasets file.

Datasets explained

Only information about the datasets without _sentences are provided, _sentences contain the same sentences as there counterpart

Code information

The code is providid in the code file

Getting the right data

In the code the standard situation is shown, so training on the train set and testing on the test set.

For the more specialised cases, code is provided below

Domain adaptability

DATAset = pd.read_csv('Dataset_sentences.csv')

BIOset = pd.read_csv('Dataset.csv')

ids = DATAset[DATAset.conference == 'VISION'].id.to_list()

sentences = ['Sentence: ' + str(id) for id in ids]

data = BIOset[~BIOset['Sentence #'].isin(sentences)]

test = BIOset[BIOset['Sentence #'].isin(sentences)]

Amount of training data

This is done via slicing, slices of the stratisfied 20 fold are used and each time added to the previous slice

Settings:

skf = StratifiedKFold(20, shuffle=True, random_state=42)

Positive/negative ratio

The right data is selected using slicing, example: np.array(X_tr)[:2168]

TRAINset = pd.read_csv('Train_set_sentences.csv')

ids = TRAINset[~TRAINset.labels.str.contains('Geen')].id.to_list()

sentences = ['Sentence: ' + str(id) for id in ids]

ds = data[data['Sentence #'].isin(sentences)]

nds = data[~data['Sentence #'].isin(sentences)]

data = ds.append(nds)

Other

This is done by appending, example:

data = pd.read_csv('Train_set.csv') data2 = pd.read_csv('SSC.csv') data = data.append(data2)