Awesome

ner_dataset_recognition

Only information about the datasets without _sentences are provided, _sentences contain the same sentences as there counterpart

Dataset.csv - The whole dataset
Zero-shot.csv - Contains 200 sentences with datasets that do not occur in the train or test set
Train_set.csv - The 80% of the 80/20 split of the remaining sentences of the whole dataset (so without the zero-shot set)
Test_set.csv - The 20% of the 80/20 split of the remaining sentences of the whole dataset (so without the zero-shot set)
Test_set_real_ratio - A test set where only 1% of the sentences contain dataset names
Test_set_easy - 444 sentences from the test set that only contain dataset names with three or fewer words
Test_set_hard - 444 sentences from the test set that only contain dataset names with four or more words
SSC - A dataset containing only weak supervised examples
SSC_positives - Only the potive weak supervised examples

In the code the standard situation is shown, so training on the train set and testing on the test set.

For the more specialised cases, code is provided below

DATAset = pd.read_csv('Dataset_sentences.csv')

BIOset = pd.read_csv('Dataset.csv')

ids = DATAset[DATAset.conference == 'VISION'].id.to_list()

sentences = ['Sentence: ' + str(id) for id in ids]

data = BIOset[~BIOset['Sentence #'].isin(sentences)]

test = BIOset[BIOset['Sentence #'].isin(sentences)]

This is done via slicing, slices of the stratisfied 20 fold are used and each time added to the previous slice

Settings:

skf = StratifiedKFold(20, shuffle=True, random_state=42)

The right data is selected using slicing, example: np.array(X_tr)[:2168]

TRAINset = pd.read_csv('Train_set_sentences.csv')

ids = TRAINset[~TRAINset.labels.str.contains('Geen')].id.to_list()

sentences = ['Sentence: ' + str(id) for id in ids]

ds = data[data['Sentence #'].isin(sentences)]

nds = data[~data['Sentence #'].isin(sentences)]

data = ds.append(nds)

This is done by appending, example:

data = pd.read_csv('Train_set.csv') data2 = pd.read_csv('SSC.csv') data = data.append(data2)