<div align="center"> <img src="https://github.com/caufieldjh/awesome-bioie/blob/main/images/abie_head.png" alt="Awesome BioIE Logo"/> <br> <a href="https://awesome.re"> <img src="https://awesome.re/badge-flat2.svg" alt="Awesome"> </a> <br> How to extract information from unstructured biomedical data and text. <br> </div>

What is BioIE? It includes any effort to extract structured information from unstructured (or, at least inconsistently structured) biological, clinical, or other biomedical data. The data source is often some collection of text documents written in technical language. If the resulting information is verifiable and consistent across sources, we may then consider it knowledge. Extracting information and producing knowledge from bio data requires adaptations upon methods developed for other types of unstructured data.

BioIE has undergone massive changes since the introduction of language models like BERT and the more recently created Large Language Models (LLMs; e.g., GPT-3/4, LLAMA2/3, Gemini, etc).

Resources included here are preferentially those available at no monetary cost and limited license requirements. Methods and datasets should be publicly accessible and actively maintained.

See also awesome-nlp, awesome-biology and Awesome-Bioinformatics.

Please read the contribution guidelines before contributing. Please add your favourite resource by raising a pull request.


Research Overviews

LLMs in Biomedical IE

Pre-LLM Overviews

Back to Top

Groups Active in the Field

Back to Top


Back to Top

Journals and Events

The interdisciplinary nature of BioIE means researchers in this space may share their findings and tools in a variety of ways. They may publish papers in journals, as is common in the biomedical and life sciences. They may publish conference papers and, upon acceptance, give a poster and/or oral presentation at an event; this is common practice in computer science and engineering fields. Conference papers are often published in collections of proceedings. Preprint publication is an increasingly popular and institutionally-accepted way to publish findings as well. Surrounding these formal, written products are the ideas of open science, open data, and open source: the code, data, and software BioIE researchers develop are valuable resources to the community.


For preprints, try arXiv, especially the subjects Computation and Language (cs.CL) and Information Retrieval (cs.IR); bioRxiv; or medRxiv, especially the Health Informatics subject area.

Conferences and Other Events


Some events in BioIE are organized around formal tasks and challenges in which groups develop their own computational solutions, given a dataset.

Back to Top


The field changes rapidly enough that tutorials any older than a few years are missing crucial details. A few more recent educational resources are listed below. A good foundational understanding of text mining techniques is very helpful, as is some basic experience with the Python and or R languages. The best option may be to learn by doing.

LLM Guides

TBD - watch this space!

Pre-LLM Guides, Lectures, and Courses

Back to Top

Code Libraries

Repos for Specific Datasets

Back to Top

Tools, Platforms, and Services

Annotation Tools

Back to Top

Techniques and Models

Large Language Models

TBD - watch this space!

BERT models

GPT-2 models

Other models

Text Embeddings

Back to Top


Some of the datasets listed below require a UMLS Terminology Services (UTS) account to access. Please note that the license granted with the UTS account requires users to submit an annual report about their use of UMLS resources. This is less challenging than it sounds.

Biomedical Text Sources

The following resources contain indexed text documents in the biomedical sciences.

Annotated Text Data

Protein-protein Interaction Annotated Corpora

Protein-protein interactions are abbreviated as PPI. The following sets are available in BioC format. The older sets (AIMed, BioInfer, HPRD50, IEPA, and LLL) are available courtesy of the WBI corpora repository and were originally derived from the original sets by a group at Turku University.

Other Datasets

Back to Top

Ontologies and Controlled Vocabularies

Back to Top

Data Models

Do you need a data model? If you are working with biomedical data, then the answer is probably "Yes".

Back to Top


Credits for curators and sources.