Home

Awesome

High-Dimensional Gene Expression and Morphology Profiles of Cells across 28,000 Genetic and Chemical Perturbations

Populations of cells can be perturbed by various chemical and genetic treatments and the impact on the cells’ gene expression (transcription, i.e. mRNA levels) and morphology (in an image-based assay) can be measured in high dimensions. The patterns observed in this data can be used for more than a dozen applications in drug discovery and basic biology research. We provide a collection of four datasets where both gene expression and morphological data are available; roughly a thousand features are measured for each data type, across more than 28,000 thousand chemical and genetic perturbations. We have defined a set of biological problems that can be investigated using these two data modalities and provided baseline analysis and evaluation metrics for addressing each.

Link to Paper

Data Modalities

<details> <summary>Click to expand</summary>

Gene expression (GE) profiles

Each cell has DNA in the nucleus which is transcribed into various mRNA molecules which are then translated into proteins that carry out functions in the cell. The levels of mRNA in the cell are often biologically meaningful - collectively, mRNA levels for a cell are known as its transcriptional state; each individual mRNA level is referred to as the corresponding gene's "expression". The L1000 assay was used to measure the transcriptional state of cells in the datasets here. The assay reports a sample's mRNA levels for 978 genes at high-throughput, from the bulk population of cells treated with a given perturbation. These 978 "landmark" genes capture approximately $80%$ of the transcriptional variance for the entire genome. The data processing tools and workflows to produce these profiles are available at https://clue.io/.

Cell Painting morphological (CP) profiles

We used the Cell Painting assay to measure the morphological state of cells treated with a given perturbation. The assay captures fluorescence images of cells colored by six well-characterized fluorescent dyes to stain the nucleus, nucleoli, cytoplasmic RNA, endoplasmic reticulum, actin cytoskeleton, Golgi apparatus and plasma membrane. These eight labeled cell compartments are captured through five channels of high-resolution microscopy images (DNA, RNA, ER, AGP, and Mito). Images are then processed using CellProfiler software to extract thousands of features of each cell’s morphology and form a high-dimensional profile for each single cell. These features are based on various shape, intensity and texture statistics and are then aggregated for all the single cells in a "well" (a miniature test tube) that are called replicate-level profiles of perturbations. Aggregation of replicate-level profiles across all the wells or replicates of a perturbation is called a treatment-level profile. In our study, we used treatment-level profiles in all experiments but have provided replicate-level profiles for researchers interested in further data exploration.

</details>

Datasets

References to raw profiles and images

<details> <summary>Click to expand</summary> </details>

Preprocessed publicly available profiles

Preprocessed profiles (~9.5GB) are available on a S3 bucket. They can be downloaded at no cost and no need for registration of any sort, using the command:

aws s3 sync \
  --no-sign-request \
  s3://cellpainting-gallery/cpg0003-rosetta/broad/workspace/preprocessed_data .

See this wiki for sample Cell Painting images and the meaning of (CellProfiler-derived) Cell Painting features.

Data version

The Etags of these files are listed here.

They were generated using:

aws s3api list-objects --bucket cellpainting-gallery --prefix rosetta/broad/workspace/preprocessed_data/

CP-L1000 Profile descriptions

We gathered four available data sets that had both Cell Painting morphological (CP) and L1000 gene expression (GE) profiles, preprocessed the data from different sources and in different formats in a unified .csv format, and made the data publicly available. Single cell morphological (CP) profiles were created using CellProfiler software and processed to form aggregated replicate and treatment levels using the R cytominer package cytominer. We made the following three types of profiles available for cell-painting modality of each of four datasets:

FolderFile nameDescription
CellPaintingreplicate_level_cp_augmented.csvAggregated and Metadata annotated profiles which are the average of single cell profiles in each well.
CellPaintingreplicate_level_cp_normalized.csv.gzNormalized profiles which are the z-scored aggregated profiles, where the scores are computing using the distribution of negative controls as the reference.
CellPaintingreplicate_level_cp_normalized_variable_selected.csv.gzNormalized variable selected which are normalized profiles with features selection applied
L1000replicate_level_l1k.csvAggregated and Metadata annotated profiles which are the average of single cell profiles in each well.

Metadata information

This spreadsheet contains a description all the metadata fields across all 8 datasets.

Keywords to match tables across modalities for each dataset

Datasetperturbation match column<br/>CPperturbation match column<br/>GEControl perturbation value in each of columns <br/>CP and GE
CDRP-BBBC047-BrayMetadata_Sample_Dosepert_sample_dosenegcon
CDRPBIO-BBBC036-BrayMetadata_Sample_Dosepert_sample_dosenegcon
TA-ORF-BBBC037-RohbanMetadata_broad_samplepert_idnegcon
LUAD-BBBC041-Caicedox_mutation_statusallelenegcon
LINCS-Pilot1Metadata_pert_id_dosepert_id_dosenegcon

Number of features for each dataset

DatasetGECP<br/>normalizedCP<br/>normalized_variable_selected
CDRP9771565727
CDRP-BIO9771570601
LUAD9781569291
TA-ORF978167763
LINCS9781670119

Lookup table for L1000 genes predictability

Table

License

We license the data, results, and figures as CC0 1.0 and the source code as BSD 3-Clause.