Append a new batch of data#

We have one file in storage and are about to receive a new batch of data.

In this notebook, we’ll see how to manage the situation.

import lamindb as ln
import lnschema_bionty as lb
import readfcs

lb.settings.species = "human"

💡 loaded instance: testuser1/test-facs (lamindb 0.55.2)

ln.track()

💡 notebook imports: anndata==0.9.2 lamindb==0.55.2 lnschema_bionty==0.31.2 pytometry==0.1.4 readfcs==1.1.7 scanpy==1.9.5

💡 Transform(id='SmQmhrhigFPLz8', name='Append a new batch of data', short_name='facs2', version='0', type=notebook, updated_at=2023-10-10 15:45:37, created_by_id='DzTjkKse')

💡 Run(id='q7FG1FkyFAdnRdyp17Hk', run_at=2023-10-10 15:45:37, transform_id='SmQmhrhigFPLz8', created_by_id='DzTjkKse')

Ingest a new file#

Access #

Let us validate and register another .fcs file from Oetjen18:

filepath = readfcs.datasets.Oetjen18_t1()

adata = readfcs.read(filepath)

adata

AnnData object with n_obs × n_vars = 241552 × 20
    var: 'n', 'channel', 'marker', '$PnR', '$PnB', '$PnE', '$PnV', '$PnG'
    uns: 'meta'

Transform: normalize #

import anndata as ad
import pytometry as pm

pm.pp.split_signal(adata, var_key="channel")

pm.pp.compensate(adata)

pm.tl.normalize_biExp(adata)

adata = adata[  # subset to rows that do not have nan values
    adata.to_df().isna().sum(axis=1) == 0
]

adata.to_df().describe()

	CD95	CD8	CD27	CXCR4	CCR7	LIVE/DEAD	CD4	CD45RA	CD3	CD49B	CD14/19	CD69	CD103
count	241552.000000	241552.000000	241552.000000	241552.000000	241552.000000	241552.000000	241552.000000	241552.000000	241552.000000	241552.000000	241552.000000	241552.000000	241552.000000
mean	887.579860	1302.985717	1221.257257	877.533482	977.505533	1883.358298	556.687953	929.493316	941.166747	966.012244	1210.769935	741.523184	1003.064857
std	573.549695	827.850302	672.851319	411.966073	584.217139	932.113729	480.875917	795.550133	658.984751	456.437094	694.622980	473.287558	642.728024
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	462.757715	493.413744	605.463427	588.047798	495.437303	1063.670965	240.623098	404.087640	477.932659	592.294399	575.401173	380.247262	475.108131
50%	774.350833	1207.624048	1110.367681	782.939692	782.981430	1951.855099	484.355203	557.904360	655.909639	800.280049	1124.574275	705.802991	775.101973
75%	1327.792103	2036.849496	1721.730010	1070.479036	1453.929567	2623.975657	729.754419	1345.771633	1218.445208	1347.042403	1742.288464	1069.175380	1420.744291
max	4053.903716	4065.495666	4095.351322	4025.827267	3999.075551	4096.000000	4088.719985	3961.255364	3940.061146	4089.445928	3982.769373	3810.774988	4023.968008

Validate cell markers #

Let’s see how many markers validate:

validated = lb.CellMarker.validate(adata.var.index)

❗ 9 terms (69.20%) are not validated for name: CD95, CXCR4, CCR7, LIVE/DEAD, CD4, CD49B, CD14/19, CD69, CD103

Let’s standardize and re-validate:

adata.var.index = lb.CellMarker.standardize(adata.var.index)
validated = lb.CellMarker.validate(adata.var.index)

❗ 7 terms (53.80%) are not validated for name: CD95, CXCR4, LIVE/DEAD, CD49B, CD14/19, CD69, CD103

Next, register non-validated markers from Bionty:

records = lb.CellMarker.from_values(adata.var.index[~validated])
ln.save(records)

❗ did not create CellMarker records for 2 non-validated names: 'CD14/19', 'LIVE/DEAD'

Manually create 1 marker:

lb.CellMarker(name="CD14/19").save()

❗ record with similar name exist! did you mean to load it?

	id	synonyms	__ratio__
name
Cd14	roEbL8zuLC5k		90.0

Move metadata to obs:

validated = lb.CellMarker.validate(adata.var.index)
adata.obs = adata[:, ~validated].to_df()
adata = adata[:, validated].copy()

❗ 1 term (7.70%) is not validated for name: LIVE/DEAD

Now all markers pass validation:

validated = lb.CellMarker.validate(adata.var.index)
assert all(validated)

Register #

modalities = ln.Modality.lookup()
features = ln.Feature.lookup()
efs = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
markers = lb.CellMarker.lookup()

file = ln.File.from_anndata(
    adata,
    description="Oetjen18_t1",
    field=lb.CellMarker.name,
    modality=modalities.protein,
)

file.save()

file.labels.add(efs.fluorescence_activated_cell_sorting, features.assay)
file.labels.add(species.human, features.species)

file.features

Features:
  var: FeatureSet(id='QiBxVMXGEKDjPeu6mND4', n=12, type='number', registry='bionty.CellMarker', hash='fUG1EYvGEKSvt_GuvNaA', updated_at=2023-10-10 15:45:44, modality_id='3TDhLk86', created_by_id='DzTjkKse')
    'CD103', 'CD14/19', 'CD3', 'CD69', 'CD95', 'CD8', 'CXCR4', 'CD27', 'Cd4', 'CD49B', 'Ccr7', 'CD45RA'
  external: FeatureSet(id='7JULwhnrxdEE6EhiTQ6p', n=2, registry='core.Feature', hash='GMD7vVUijgIvAGay7AZf', updated_at=2023-10-10 15:45:44, modality_id='UxkvT8BD', created_by_id='DzTjkKse')
    🔗 assay (1, bionty.ExperimentalFactor): 'fluorescence-activated cell sorting'
    🔗 species (1, bionty.Species): 'human'

View data flow:

file.view_flow()

https://d33wubrfki0l68.cloudfront.net/c54df664b6aeed35e84299b8120374fff35e1239/2ba19/_images/1ae1f21fb7be0f2a07ad52de998b72ce55ff9a50b20c8d550509af0811afe720.svg

Inspect a PCA fo QC - this dataset looks much like noise:

import scanpy as sc

sc.pp.pca(adata)
sc.pl.pca(adata, color=markers.cd8.name)

https://d33wubrfki0l68.cloudfront.net/b63f88c63d3193f6829986a504482f52cfedd7b0/c8b11/_images/c6cadcbb225291f93527cc432bd80436402abc9d44540dea35e7ab9166941a40.png

Create a new version of the dataset by appending a file#

Query the old version:

dataset_v1 = ln.Dataset.filter(name="My versioned cytometry dataset").one()

dataset_v2 = ln.Dataset(
    [file, dataset_v1.file], is_new_version_of=dataset_v1, version="2"
)

dataset_v2

Dataset(id='QerZGf4taLaLa0UCUjCk', name='My versioned cytometry dataset', version='2', hash='ZKQxIw0uAvtMtdZk8SAj', transform_id='SmQmhrhigFPLz8', run_id='q7FG1FkyFAdnRdyp17Hk', initial_version_id='QerZGf4taLaLa0UCUjEO', created_by_id='DzTjkKse')

dataset_v2.features

Features:
  var: FeatureSet(id='OyXpU6niI86lQokhQFvn', n=47, type='number', registry='bionty.CellMarker', hash='KwNe7q-_6MK4WdcHsRuJ', created_by_id='DzTjkKse')
    'CD20', 'Cd14', 'CD27', 'CD94', 'CD38', 'Cd19', 'CD161', 'Ccr6', 'CD33', 'ICOS', 'CD24', 'TCRgd', 'CD25', 'Cd4', 'CXCR5', 'CD57', 'PD1', 'CD11B', 'CD85j', 'CD86', ...
  external: FeatureSet(id='7JULwhnrxdEE6EhiTQ6p', n=2, registry='core.Feature', hash='GMD7vVUijgIvAGay7AZf', updated_at=2023-10-10 15:45:44, modality_id='UxkvT8BD', created_by_id='DzTjkKse')
    🔗 assay (0, bionty.ExperimentalFactor): 
    🔗 species (0, bionty.Species): 
  obs: FeatureSet(id='P9uMy6qwTGgqSC57SXSB', n=5, registry='core.Feature', hash='HZvsq45MXE5c1h84OUmQ', updated_at=2023-10-10 15:45:28, modality_id='UxkvT8BD', created_by_id='DzTjkKse')
    Cell_length (number)
    Dead (number)
    (Ba138)Dd (number)
    Bead (number)
    Time (number)

dataset_v2

Dataset(id='QerZGf4taLaLa0UCUjCk', name='My versioned cytometry dataset', version='2', hash='ZKQxIw0uAvtMtdZk8SAj', transform_id='SmQmhrhigFPLz8', run_id='q7FG1FkyFAdnRdyp17Hk', initial_version_id='QerZGf4taLaLa0UCUjEO', created_by_id='DzTjkKse')

dataset_v2.save()

dataset_v2.labels.add(efs.fluorescence_activated_cell_sorting, features.assay)
dataset_v2.labels.add(species.human, features.species)

dataset_v2.view_flow()

https://d33wubrfki0l68.cloudfront.net/2ebcd69807eaf3fd037036130fa5358437ccacdd/518cd/_images/df93c90c6eb9d21a7a2b48f071555a310c6f2f6c0d865119fd2dba9c5a8c20eb.svg