Considerations on Gene symbol and Ensembl ID mapping

#576
by lorenzocampini - opened

Hello there,

I was wondering if you ever came up with the problem of mapping gene symbol to ensembl before the tokenization process. Since some Symbol map to more than one Ensembl, I was wondering if you had a way to procede. Do you think one should keep the first ensembl mapped to it or to keep all the ensembls mapped to that gene symbol and treat them as different genes (with the risk of that influencing the embedding of the cell).

My goal is indeed to embed the single cells and then do in silico perturbation. I am not focusing on gene embedding ATM.

Moreover, how do you think the embedding is going to be influenced if I chose one option or the other?

Thanks for the amazing work wit this tool btw

Thank you for your question. Yes, gene symbols are problematic due to nonstandardized naming schemes and multiple different versions that are not tracked in datasets. We recommend conversion to Ensembl IDs to ensure genes are mapped consistently. If you are using the pretrained model we provide here, the important thing is for the genes to be mapped to the same Ensembl ID as what the model saw during pretraining, or otherwise they may be misinterpreted as expression levels for the wrong gene. We use the mapping dict and name id dict provided in the geneformer directory in this repository to help map gene names to a consistent Ensembl ID. If possible, it would be best to map the genes using this system, and/or checking genes that are not mappable to determine if they have an Ensembl ID that is most commonly used or what position the original authors aligned to that they referenced as the given gene name.

ctheodoris changed discussion status to closed

Sign up or log in to comment