Awesome
Employee Count Extraction
This repo contains code to extract and classify employee counts from unstructured text in html SEC filings (financial reports, such as Form 10-K, that many companies must file); the final output is a table. A set of "golden," labeled data was provided for evaluation purposes.
Currently, the code correctly extracts and labels 87% of the facts that exist in the golden dataset (for the validation subest - the test subset has not been evaluated yet).
Motivation
There are enormous amounts of data available in text, but it’s very hard to get information from most of it. Getting actual numbers from documents (web pages, SEC filings, EMRs, etc.) is very resource intensive. I hope to use this project to build a fact extraction framework (specifically for extracting quantities) that I can use in other domains.
SEC filings are notoriously inconsistent in sentence structure, word usage, etc. For example, the first sentence below is easy to parse, but the next two require additional handling.
Example inputs and outputs:
(Note that this project actually starts from html verions of 50-200+ page documents)
Input:
-
filing
"(Including our full and part-time personnel , we estimate that we have the equivalent of 12 full time employees."
-
filing link
"The number of regular employees was 71.1 thousand, 73.5 thousand, and 75.3 thousand at years ended 2016, 2015 and 2014, respectively. Regular employees are defined as active executive, management, professional, technical and wage employees who work full time or part time for the Corporation and are covered by the Corporation’s benefit plans and programs. Regular employees do not include employees of the company‑operated retail sites (CORS). The number of CORS employees was 1.6 thousand, 2.1 thousand, and 8.4 thousand at years ended 2016, 2015 and 2014, respectively. The decrease in CORS employees reflects the multi‑year transition of the company‑operated retail network to a more capital‑efficient Branded Wholesaler model."
-
filing link
"Total workforce level at December 31, 2016 was approximately 150,500."
Output:
doc_id | data values | quantity_type | subject | verb | quantity | type_token | word |
---|---|---|---|---|---|---|---|
1 | 12 | Full-Time Employees | we | have | 12 | full time | employees |
2 | 71100 | Other Employees | The number of regular employees | was | 71.1 thousand | regular | employees |
2 | 1600 | Other Employees | The number of CORS employees | was | 1.6 thousand | CORS | employees |
3 | 150500 | Other Employees | level | was | 150,500 | workforce |
General process
The html files are first parsed into potentially-relevant chunks. The chunks (usually paragraphs) are then passed to a natural-language processing pipeline. The pipeline inspects each sentence for "cues" that the sentence contains information about employeee counts. Finally, the code extracts and cleans "facts" about employee counts, and produces a table that is ready for database ingestion.