Look up records of Bionty entities#

Entities and ontologies can be complex with many different identifiers.

Here we show Bionty’s lookup model for species, genes, proteins and cell markers. You’ll see how to

  • access the reference table via .df()

  • look up an entity term via .lookup()

  • look up an entity term via .search()

import bionty as bt

.fields: fields of an ontology reference#

gene_bionty = bt.Gene()

gene_bionty


Gene
Species: human
Source: ensembl, release-109

📖 Gene.df(): ontology reference table
🔎 Gene.lookup(): autocompletion of terms
🎯 Gene.search(): free text search of terms
🧐 Gene.inspect(): check if identifiers are mappable
👽 Gene.map_synonyms(): map synonyms to standardized names
🔗 Gene.ontology: Pronto.Ontology object
gene_bionty.fields
{'biotype',
 'description',
 'ensembl_gene_id',
 'hgnc_id',
 'ncbi_gene_id',
 'symbol',
 'synonyms'}

Fields can be accessed as attributes for autocompletion:

(You can pass them to the field parameter in any bionty function instead of strings.)

gene_bionty.ncbi_gene_id
ncbi_gene_id

.df(): reference table#

Data scientists love DataFrames, and every entity has a reference table containing all the fields.

df = gene_bionty.df()
df.head()
ensembl_gene_id symbol ncbi_gene_id hgnc_id biotype description synonyms
0 ENSG00000000003 TSPAN6 7105 HGNC:11858 protein_coding tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858] TSPAN-6|T245|TM4SF6
1 ENSG00000000005 TNMD 64102 HGNC:17757 protein_coding tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757] tendin|ChM1L|TEM|myodulin|BRICD4
2 ENSG00000000419 DPM1 8813 HGNC:3005 protein_coding dolichyl-phosphate mannosyltransferase subunit... CDGIE|MPDS
3 ENSG00000000457 SCYL3 57147 HGNC:19285 protein_coding SCY1 like pseudokinase 3 [Source:HGNC Symbol;A... PACE1|PACE-1
4 ENSG00000000460 C1orf112 55732 HGNC:25565 protein_coding chromosome 1 open reading frame 112 [Source:HG... FLJ10706

To access the information of, for example the multiple gene symbols, we select the corresponding species through Pandas:

df.set_index("symbol").loc[["LMNA", "TCF7", "BRCA1"]]
ensembl_gene_id ncbi_gene_id hgnc_id biotype description synonyms
symbol
LMNA ENSG00000160789 4000 HGNC:6636 protein_coding lamin A/C [Source:HGNC Symbol;Acc:HGNC:6636] PRO1|LMNL1|MADA|CMD1A|HGPS|LMN1|LGMD1B
TCF7 ENSG00000081059 6932 HGNC:11639 protein_coding transcription factor 7 [Source:HGNC Symbol;Acc... TCF-1
BRCA1 ENSG00000012048 672 HGNC:1100 protein_coding BRCA1 DNA repair associated [Source:HGNC Symbo... PPP1R53|RNF53|FANCS|BRCC1

.lookup(): Lookup terms and records with autocompletion#

Terms can be searched with auto-complete using a lookup object.

lookup = gene_bionty.lookup()

We provide dot. accessor for normalized terms (lower case, only contains alphanumeric characters and underscores):

lookup.tcf7
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', hgnc_id='HGNC:11639', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')

To look up the exact original strings, convert the lookup object to dict and use the bracket[] accessor for autocompletion:

lookup_dict = lookup.dict()
lookup_dict["TCF7"]
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', hgnc_id='HGNC:11639', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')

By default, the name field is used to generate lookup keys.

You can specify another field to look up:

lookup = gene_bionty.lookup(gene_bionty.hgnc_id)

If multiple entries are matched, they are returned as a list:

lookup.hgnc_10478
[Gene(ensembl_gene_id='ENSG00000231321', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000206289', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000227322', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000204231', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000235712', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000228333', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP')]
lookup_dict = lookup.dict()
lookup_dict["HGNC:10478"]
[Gene(ensembl_gene_id='ENSG00000231321', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000206289', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000227322', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000204231', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000235712', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
 Gene(ensembl_gene_id='ENSG00000228333', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP')]

.search: Search a term against a field#

celltype_bionty = bt.CellType()


celltype_bionty.search("cytotoxic T cells")
SearchResult(name='cytotoxic T cell', ontology_id='CL:0000910', definition='A Mature T Cell That Differentiated And Acquired Cytotoxic Function With The Phenotype Perforin-Positive And Granzyme-B Positive.', synonyms='cytotoxic T lymphocyte|cytotoxic T-lymphocyte|cytotoxic T-cell', children=array([], dtype=object))

By default, search also matches against each of the synonyms:

celltype_bionty.search("P cell")
SearchResult(name='nodal myocyte', ontology_id='CL:0002072', definition='A Specialized Cardiac Myocyte In The Sinoatrial And Atrioventricular Nodes. The Cell Is Slender And Fusiform Confined To The Nodal Center, Circumferentially Arranged Around The Nodal Artery.', synonyms='cardiac pacemaker cell|myocytus nodalis|P cell', children=array(['CL:1000409', 'CL:1000410'], dtype=object))

You can turn off synonym matching with synonyms_field=None:

celltype_bionty.search("P cell", synonyms_field=None)
SearchResult(name='PP cell', ontology_id='CL:0000696', definition='A Cell That Stores And Secretes Pancreatic Polypeptide Hormone.', synonyms='type F enteroendocrine cell', children=array(['CL:0002680'], dtype=object))

Match against another field (default is “name”):

celltype_bionty.search("CD8+ alpha beta T cells", field=celltype_bionty.definition)
SearchResult(definition='A T Cell That Expresses An Alpha-Beta T Cell Receptor Complex.', ontology_id='CL:0000789', name='alpha-beta T cell', synonyms='alpha-beta T-cell|alpha-beta T-lymphocyte|alpha-beta T lymphocyte', children=array(['CL:0000790', 'CL:0000791'], dtype=object))

Return all results as a DataFrame ranked by matching ratios:

celltype_bionty.search("P cell", return_ranked_results=True).head()
ontology_id definition synonyms children __ratio__
name
nodal myocyte CL:0002072 A Specialized Cardiac Myocyte In The Sinoatria... cardiac pacemaker cell|myocytus nodalis|P cell [CL:1000409, CL:1000410] 100.000000
double-positive, alpha-beta thymocyte CL:0000809 A Thymocyte Expressing The Alpha-Beta T Cell R... DP cell|DP thymocyte|double-positive, alpha-be... [CL:0002430, CL:0002427, CL:0002431, CL:000242... 92.307692
PP cell CL:0000696 A Cell That Stores And Secretes Pancreatic Pol... type F enteroendocrine cell [CL:0002680] 92.307692
pigmented ciliary epithelial cell CL:0002303 A Cell That Is Part Of Pigmented Ciliary Epith... PE cell [] 92.307692
GIP cell CL:0002278 An Enteroendocrine Cell Of Duodenum And Jejunu... type K enteroendocrine cell [] 85.714286

Tied results will all be returns:

celltype_bionty.search("A cell", synonyms_field=None)
[SearchResult(name='T cell', ontology_id='CL:0000084', definition='A Type Of Lymphocyte Whose Defining Characteristic Is The Expression Of A T Cell Receptor Complex.', synonyms='T-cell|T lymphocyte|T-lymphocyte', children=array(['CL:0000798', 'CL:0002420', 'CL:0002419', 'CL:0000789'],
       dtype=object)),
 SearchResult(name='B cell', ontology_id='CL:0000236', definition='A Lymphocyte Of B Lineage That Is Capable Of B Cell Mediated Immunity.', synonyms='B lymphocyte|B-lymphocyte|B-cell', children=array(['CL:0009114', 'CL:0001201'], dtype=object))]