Awesome
Scripts – Data curation and processing logic for the Swedish Parliament Corpus
General setup and use
Setting up an environment
Set up a conda environment : Follow the steps here.
With the environment active, install the pyriksdagen module, either from PyPi
pip install pyriksdagen
or from a local copy in the pyriksdagen repo
pip install .
The LazyArchive
The LazyArchive()
class attempts to connect to the KB labs in the lazyest way possible. If you'll use the scripts often, it's worthwhile to set 3 environment variables:
KBLMYLAB=https://betalab.kb.se
KBLUSER=
KBLPASS=
They can be added to the environment variables, e.g. ~/miniconda3/envs/tf/etc/conda/activate.d/env_vars.sh
. If these are not present, you will be prompted for the username and password.
Curating data
Most scripts take --start
YEAR and --end
YEAR arguments to define a span of time to operate on. Other options are noted in with the file below.
-1. Create new curation branch from dev.
git checkout -b curation-<decade_start_year>s dev
-
Generate an input csv by querying protocol packages using
scripts/query2csv.py
- this creates
input/protocols/scanned.csv
orinput/protocols/digital_originals.csv
, to be read byscripts/pipeline.py
- with the
-m
option the script will create year directories incorpus/protocols/
if they don't already exist obs., unlike the other scripts use of– updated to behave like the other scripts – obs. 2, a potential problem is that this doesn't handle the two-year formats - 199495--start
and--end
to define a range of dates is exclusive of the end year
- this creates
-
Compile parlaclarin for years queried in (1) with
scripts/pipeline.py
– make sureinput/raw/
exists. -
Look for introductions with
scripts/classify_intros.py
- this creates
input/segmentation/intros.csv
- had to add
/home/bob/miniconda3/envs/tf/lib/python3.9/site-packages/nvidia/cublas/lib/
to $LD_LIBRARY_PATH
- this creates
-
Run
scripts/resegment.py
to segment and label introductions incorpus/protocols/<year>/*.xml
files -
Run
scripts/add_uuid.py
to make sure any new segments have a uuid. -
Run
scripts/find_dates.py
to find marginal notes with dates and add dates to metadata. -
Run
scripts/build_classifier.py
(the classifier doesn't need to be built every time) different args!?--datapath
: needs a file currently atinput/curation/classifier_data.csv
(but how is this file generated? it's a mystery... it just exists)--epochs
(can use the default)- writes to the
segment-classifier/
... how does it relate to years of protocols? it doesn't – it's apparently trained generally andscripts/reclassify.py
allows to specify which years are operated on
-
Run
scripts/reclassify.py
to reclassify utterances and notes- nb.
build_classifier
writes tosegment-classifier/
, but this reads frominput/segment-classifier/
, so the output needs to be moved, or we can fix the discrepancy - do this one year at a time for dolan's sakie
for year in {START..END}; do python scripts/reclassify.py -s $year -e $year; done
- nb.
-
Run
add_uuid.py
again. -
Run
scripts/dollar_sign_replace.py
to replace dollar signs. -
Run
scripts/fix_capitalized_dashes.py
. -
Run
scripts/wikidata_process.py
(makes metadata available for redetect.py) -
Run
scripts/redetect.py
. -
Run
scripts/split_into_sections.py
.
Quality Control
-
generate a sample for by decade with
sample_pages_new.py
.- This generates a csv file in
input/quality_control/sample_<decade-start-year>.csv
and a list of protocols in the sampleinput/quality_control/sample_<decade-start-year>.txt
- This generates a csv file in
-
Add (
git-add_QC-sample.sh
for the lazy) and commit the sample to working branch. -
Populate the quality control csv file with
populate-QC-sample-test.py
- sample protocols need to be on the local machine where the script is run. Since it pops open protocols in github an originals in betalab in a browser, this script doesn't play nice with working over ssh
- QC should distinguish between the same segment classes that
scripts/reclassify.py
produces <u> and <note>. Other classes may become relevant later.
-
Does data pass QC test? If yes, add and push the rest of the protocols.