Iteratively train an ML model on a dataset#

In the previous tutorial, we loaded an entire dataset into memory to perform a simple analysis.

Here, we’ll iterate over the files within the dataset, to train an ML model.

import lamindb as ln
import anndata as ad
import numpy as np

💡 loaded instance: testuser1/test-scrna (lamindb 0.55.2)

ln.track()

💡 notebook imports: anndata==0.9.2 lamindb==0.55.2 numpy==1.25.2 scvi-tools==1.0.3

💡 Transform(id='Qr1kIHvK506rz8', name='Iteratively train an ML model on a dataset', short_name='scrna5', version='0', type=notebook, updated_at=2023-10-10 15:44:08, created_by_id='DzTjkKse')

💡 Run(id='BQF9EdTZBeF1PaqhL7x9', run_at=2023-10-10 15:44:08, transform_id='Qr1kIHvK506rz8', created_by_id='DzTjkKse')

Setup#

dataset_v2 = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="2").one()

dataset_v2

Dataset(id='oYS8NR45oBcEPgCISfHj', name='My versioned scRNA-seq dataset', version='2', hash='JNjc88f22TLVPJHdgo7X', updated_at=2023-10-10 15:43:32, transform_id='ManDYgmftZ8Cz8', run_id='JRscXJ0ZxufwqGUoGmIJ', initial_version_id='oYS8NR45oBcEPgCISfCm', created_by_id='DzTjkKse')

We import scvi-tools.

import scvi

Similar to what we did in the previous tutorial, we could load the entire dataset into memory and train a model in 4 lines of code.

Let us instead load all file records:

file1, file2 = dataset_v2.files.list()

We’d like some context on what the first file contains and where it’s from:

file1.describe()
file1.view_flow()

Show code cell output Hide code cell output

File(id='oYS8NR45oBcEPgCISfCm', suffix='.h5ad', accessor='AnnData', description='Conde22', size=57615999, hash='6Hu1BywwK6bfIU2Dpku2xZ', hash_type='sha1-fl', updated_at=2023-10-10 15:42:41)

Provenance:
  🗃️ storage: Storage(id='YpFBBtr4', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-10-10 15:41:26, created_by_id='DzTjkKse')
  📔 transform: Transform(id='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type='notebook', updated_at=2023-10-10 15:41:34, created_by_id='DzTjkKse')
  👣 run: Run(id='RNgU2xx7TeUrL3d83b4F', run_at=2023-10-10 15:41:34, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-10 15:41:26)
  ⬇️ input_of (core.Run): ['2023-10-10 15:43:41', '2023-10-10 15:42:50']
Features:
  var: FeatureSet(id='XGD5bALjzlxe3dYm2tcM', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-10-10 15:42:29, modality_id='nG6MZ3aj', created_by_id='DzTjkKse')
    'TAF11L2', 'PGAP6', 'None', 'PTBP2', 'C5orf34-AS1', 'B4GALNT4', 'LINC02958', 'DMD', 'LINC00706', 'EEF1AKMT1', 'None', 'None', 'METRNL', 'MPND', 'NOBOX', 'LINC02706', 'TRIM50', 'IGKV6D-21', 'ZFHX4', 'AHCYL1', ...
  obs: FeatureSet(id='KrYPEOnuTBTRi4WqoelO', n=4, registry='core.Feature', hash='Z0BvFRBSIr9xpTLjV1nb', updated_at=2023-10-10 15:42:35, modality_id='NIjDnou1', created_by_id='DzTjkKse')
    🔗 donor (12, core.ULabel): '582C', 'A36', 'D503', 'A37', 'A29', '640C', 'D496', 'A52', 'A35', 'A31', ...
    🔗 tissue (17, bionty.Tissue): 'jejunal epithelium', 'caecum', 'ileum', 'lamina propria', 'thymus', 'duodenum', 'thoracic lymph node', 'skeletal muscle tissue', 'omentum', 'mesenteric lymph node', ...
    🔗 assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
    🔗 cell_type (32, bionty.CellType): 'naive thymus-derived CD8-positive, alpha-beta T cell', 'dendritic cell, human', 'non-classical monocyte', 'effector memory CD4-positive, alpha-beta T cell', 'megakaryocyte', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'germinal center B cell', 'mast cell', 'alveolar macrophage', 'T follicular helper cell', ...
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ tissues (17, bionty.Tissue): 'jejunal epithelium', 'caecum', 'ileum', 'lamina propria', 'thymus', 'duodenum', 'thoracic lymph node', 'skeletal muscle tissue', 'omentum', 'mesenteric lymph node', ...
  🏷️ cell_types (32, bionty.CellType): 'naive thymus-derived CD8-positive, alpha-beta T cell', 'dendritic cell, human', 'non-classical monocyte', 'effector memory CD4-positive, alpha-beta T cell', 'megakaryocyte', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'germinal center B cell', 'mast cell', 'alveolar macrophage', 'T follicular helper cell', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
  🏷️ ulabels (12, core.ULabel): '582C', 'A36', 'D503', 'A37', 'A29', '640C', 'D496', 'A52', 'A35', 'A31', ...

https://d33wubrfki0l68.cloudfront.net/fe9401ffa781298b9e0ad1a27b28778418fc1d6c/a6c5f/_images/08a66434a64ca1c4c7e6fff33a77ee38bc3b9514d602e5e447c68d012e926b74.svg

We’ll need to make a decision on the features that we want to use for training the model.

Because each file is validated, they’re all indexed by ensembl_gene_id in the var slot of AnnData.

shared_genes = file1.features["var"] & file2.features["var"]
shared_genes_ensembl = shared_genes.list("ensembl_gene_id")

Train the model#

Let us load the first file into memory:

data_train1 = file1.load().raw[:, shared_genes_ensembl].to_adata()
data_train1

AnnData object with n_obs × n_vars = 1648 × 749
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_name', 'feature_reference', 'feature_biotype'
    uns: 'cell_type_ontology_term_id_colors', 'default_embedding', 'schema_version', 'title'
    obsm: 'X_umap'

Train the model on this first file:

scvi.model.SCVI.setup_anndata(data_train1)
vae = scvi.model.SCVI(data_train1)
vae.train(max_epochs=1)  # we use max_epochs=1 to run it on CI
vae.save("saved_models/scvi1")

Load the second file and resume training the model:

data_train2 = file2.load().raw[:, shared_genes_ensembl].to_adata()
vae = scvi.model.SCVI.load("saved_models/scvi1", data_train2)
vae.train(max_epochs=1)
vae.save("saved_models/scvi1", overwrite=True)

Save the model#

weights = ln.File("saved_models/scvi1/model.pt", description="My trained model")
weights.save()

Save latent representation as a new dataset#

latent1 = vae.get_latent_representation(data_train1)
latent2 = vae.get_latent_representation(data_train2)

adata_latent1 = ad.AnnData(X=latent1, obs=data_train1.obs)
adata_latent2 = ad.AnnData(X=latent2, obs=data_train2.obs)

INFO

 Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup

Because the latent representation is low-dimensional, we can typically fit very high number of observations into memory.

Hence, let’s store it as a concatenated adata.

adata_latent = ad.concat([adata_latent1, adata_latent2])

dataset_v2_latent = ln.Dataset(
    adata_latent,
    name="Latent representation of scRNA-seq dataset v2",
    description="For the original data, see dataset T5x0SkRJNviE0jYGbJKt",
)
dataset_v2_latent.save()

Let us look at the data flow:

dataset_v2_latent.view_flow()

https://d33wubrfki0l68.cloudfront.net/3814e7bce9e115318304de5fe26bc18b48db5f0d/475ed/_images/b674aa94a00774bfa073f6f9150bfc71dcaa07092678ddfde4798f9247230f56.svg

Compare this with the model:

weights.view_flow()

https://d33wubrfki0l68.cloudfront.net/e89ebb9ee933fe973371b9d70570ab57f5d9f30c/89926/_images/059144309b78fda2ba7949703ed6b1927422f69a43517474c58cf93486e4bd2c.svg

Annotate with labels:

dataset_v2_latent.labels.add_from(dataset_v2)

dataset_v2_latent.describe()

Dataset(id='DKompnkhURYfbOQonfMK', name='Latent representation of scRNA-seq dataset v2', description='For the original data, see dataset T5x0SkRJNviE0jYGbJKt', hash='E3vUjPQxbq0c5RJvbW1URA', updated_at=2023-10-10 15:44:16)

Provenance:
  💫 transform: Transform(id='Qr1kIHvK506rz8', name='Iteratively train an ML model on a dataset', short_name='scrna5', version='0', type=notebook, updated_at=2023-10-10 15:44:08, created_by_id='DzTjkKse')
  👣 run: Run(id='BQF9EdTZBeF1PaqhL7x9', run_at=2023-10-10 15:44:08, transform_id='Qr1kIHvK506rz8', created_by_id='DzTjkKse')
  📄 file: File(id='DKompnkhURYfbOQonfMK', suffix='.h5ad', accessor='AnnData', description='See dataset DKompnkhURYfbOQonfMK', size=220226, hash='E3vUjPQxbq0c5RJvbW1URA', hash_type='md5', updated_at=2023-10-10 15:44:16, storage_id='YpFBBtr4', transform_id='Qr1kIHvK506rz8', run_id='BQF9EdTZBeF1PaqhL7x9', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-10 15:41:26)
Features:
  external: FeatureSet(id='2wMUvgNO40eTHeB45gJn', n=5, registry='core.Feature', hash='LAqfNE-fOEP9Ai5h3Etp', updated_at=2023-10-10 15:44:16, modality_id='NIjDnou1', created_by_id='DzTjkKse')
    🔗 species (1, bionty.Species): 'human'
    🔗 assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v1', '10x 5' v2'
    🔗 tissue (17, bionty.Tissue): 'skeletal muscle tissue', 'bone marrow', 'omentum', 'jejunal epithelium', 'lamina propria', 'duodenum', 'sigmoid colon', 'thymus', 'caecum', 'lung', ...
    🔗 donor (12, core.ULabel): '582C', 'A36', 'D503', 'A37', 'A29', '640C', 'D496', 'A52', 'A35', 'A31', ...
    🔗 cell_type (39, bionty.CellType): 'germinal center B cell', 'macrophage', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'CD16-positive, CD56-dim natural killer cell, human', 'mature T cell', 'conventional dendritic cell', 'dendritic cell, human', 'CD16-negative, CD56-bright natural killer cell, human', 'group 3 innate lymphoid cell', 'monocyte', ...
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ tissues (17, bionty.Tissue): 'skeletal muscle tissue', 'bone marrow', 'omentum', 'jejunal epithelium', 'lamina propria', 'duodenum', 'sigmoid colon', 'thymus', 'caecum', 'lung', ...
  🏷️ cell_types (39, bionty.CellType): 'germinal center B cell', 'macrophage', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'CD16-positive, CD56-dim natural killer cell, human', 'mature T cell', 'conventional dendritic cell', 'dendritic cell, human', 'CD16-negative, CD56-bright natural killer cell, human', 'group 3 innate lymphoid cell', 'monocyte', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v1', '10x 5' v2'
  🏷️ ulabels (12, core.ULabel): '582C', 'A36', 'D503', 'A37', 'A29', '640C', 'D496', 'A52', 'A35', 'A31', ...

# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna

💡 deleting instance testuser1/test-scrna
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env

✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna