scLinaX_preprocessing_example
scLinaX_preprocessing_example.Rmd
Preprocessing for scLinaX
scLinaX takes single-cell-level allele-specific expression (or chromatin accessibility) data with gene annotation as an input. In this vignette, we will show an example workflow for making an input with cellsnp-lite and Annovar as described in the scLinaX paper.
1. Data preparation
First, prepare the scRNA-seq (or scATAC-seq) dataset for scLinaX analysis. In this vignette, we use a multiome dataset provided on the 10X website. (https://www.10xgenomics.com)
Since we only use X choromosome data, we made bam files which include only reads derived from X chromosomes. Below, we show an example python script.
# Python script
# Extract the X chromosome
# Replace the suffix of the barcode (-1) with the data modality (RNA or ATAC)
# Renaming the barcode is actually unnecessary in this case, but it will be necessary if you want to merge multiple runs of the same sample.
# Therefore, we are changing the barcode name here as an example.
import pysam
import pandas as pd
import subprocess
bam_files=[]
DIR='/path/to/DL_data/'
BAM_LIST=['pbmc_granulocyte_sorted_10k_gex_possorted_bam','pbmc_granulocyte_sorted_10k_atac_possorted_bam']
for BAM in BAM_LIST:
input_bam=DIR+BAM+'.bam'
output_bam=BAM+'.barcode.renamed.X.bam'
bam_files.append(output_bam)
in_bam = pysam.AlignmentFile(input_bam, 'rb')
out_bam = pysam.AlignmentFile(output_bam, 'wb', template=in_bam)
if(BAM == 'pbmc_granulocyte_sorted_10k_gex_possorted_bam'):
for aln in in_bam.fetch("chrX"):
if aln.has_tag("CB"):
old_bc = aln.get_tag('CB')
new_bc = old_bc.replace("-1", "-RNA")
aln.set_tag('CB', new_bc)
out_bam.write(aln)
elif(BAM == 'pbmc_granulocyte_sorted_10k_atac_possorted_bam'):
for aln in in_bam.fetch("chrX"):
if aln.has_tag("CB"):
old_bc = aln.get_tag('CB')
new_bc = old_bc.replace("-1", "-ATAC")
aln.set_tag('CB', new_bc)
out_bam.write(aln)
in_bam.close()
out_bam.close()
subprocess.run(["samtools", "merge", "pbmc_granulocyte_sorted_10k_gex_plus_atac_possorted.barcode.renamed.X.bam"] + bam_files, check=True)
subprocess.run(["samtools", "sort", "-o", "Multiome_10x.RNA.ATAC.sorted.bam", "pbmc_granulocyte_sorted_10k_gex_plus_atac_possorted.barcode.renamed.X.bam"], check=True)
subprocess.run(["samtools", "index", "Multiome_10x.RNA.ATAC.sorted.bam"], check=True)
subprocess.run(["samtools", "sort", "-o", "Multiome_10x.RNA.sorted.bam", "pbmc_granulocyte_sorted_10k_gex_possorted_bam.barcode.renamed.X.bam"], check=True)
subprocess.run(["samtools", "index", "Multiome_10x.RNA.sorted.bam"], check=True)
subprocess.run(["samtools", "sort", "-o", "Multiome_10x.ATAC.sorted.bam", "pbmc_granulocyte_sorted_10k_atac_possorted_bam.barcode.renamed.X.bam"], check=True)
subprocess.run(["samtools", "index", "Multiome_10x.ATAC.sorted.bam"], check=True)
We also need lists of cell barcodes to use with cellsnp-lite. In this example, we used the barcode list generated by the Cell Ranger Arc software.
zcat /path/to/DL_data/filtered_feature_bc_matrix/barcodes.tsv.gz | \
sed 's/-1/-RNA/g' > Multiome_10x.RNA.filtered.barcode.list.tsv
zcat /path/to/DL_data/filtered_feature_bc_matrix/barcodes.tsv.gz | \
sed 's/-1/-ATAC/g' > Multiome_10x.ATAC.filtered.barcode.list.tsv
In the case of scATAC-seq analysis, it can be beneficial to remove duplicate sequences. For example, you can use a flag reported by the Cell Ranger software.
2. Run cellsnp-lite
Now, we are ready to run cellsnp-lite. In this example, we use it in variant calling mode [2a]. However, if you have genotype data, you can try the reference-based mode [1a]. You can also specify common variants defined in other projects (e.g., the 1000 Genome Project) and run it in reference-based mode, which is faster.
#scRNA-seq data info
barcode_rna="Multiome_10x.RNA.filtered.barcode.list.tsv"
bam_rna="Multiome_10x.RNA.sorted.bam"
UMITAG_rna="UB"
#run cellsnp for RNA
cellsnp-lite -s ${bam_rna} \
--minCOUNT 20 -b ${barcode_rna} \
--UMItag ${UMITAG_rna} -p 4 \
--minMAF 0.05 \
--minMAPQ 20 \
--refseq /path/to/refdata-cellranger-GRCh38-3.0.0/fasta/genome.fa \ #reference used in cell ranger arc
--chrom=chrX -O RNA.chrX.snp.call
#process output for RNA
Rscript /path/to/RCODE_process_cellsnp.r \
RNA.chrX.snp.call/cellSNP.samples.tsv \
RNA.chrX.snp.call/cellSNP.tag.AD.mtx \
RNA.chrX.snp.call/cellSNP.tag.DP.mtx \
RNA.chrX.snp.call/cellSNP.tag.OTH.mtx \
RNA.chrX.snp.call/cellSNP.base.vcf \
RNA.chrX.snp.call.summary.tsv.gz
We use custom R script RCODE_process_cellsnp.r
to
process the output.
#RCODE_process_cellsnp.r
library(dplyr)
library(readr)
library(stringr)
library(tidyr)
argv<-commandArgs(T)
SAMPLE_NAME=argv[1]
ADmtx=argv[2]
DPmtx=argv[3]
OTHmtx=argv[4]
VCF=argv[5]
OUT=argv[6]
sample_name<-read_tsv(SAMPLE_NAME,col_names=FALSE)
AD<-Matrix::readMM(ADmtx)
DP<-Matrix::readMM(DPmtx)
OTH<-Matrix::readMM(OTHmtx)
RD<-DP-AD
vcf<-read_tsv(VCF,comment="#",col_names=FALSE)
SNP_INFO<-mutate(vcf,SNP_ID=str_c(X1,":",X2,":",X4,":",X5))%>%
dplyr::select(SNP_ID,X1,X2,X4,X5)
colnames(SNP_INFO)<-c("SNP_ID","CHR","POS","REF","ALT")
alt_res<-AD%>%as.matrix()%>%as_tibble()
colnames(alt_res)<-sample_name$X1
alt_res$SNP<-SNP_INFO$SNP_ID
ref_res<-RD%>%as.matrix()%>%as_tibble()
colnames(ref_res)<-sample_name$X1
ref_res$SNP<-SNP_INFO$SNP_ID
oth_res<-OTH%>%as.matrix()%>%as_tibble()
colnames(oth_res)<-sample_name$X1
oth_res$SNP<-SNP_INFO$SNP_ID
alt_res<-alt_res%>%
pivot_longer(names_to="cell_barcode",values_to="ALTcount",-SNP)%>%
filter(ALTcount>0)
ref_res<-ref_res%>%
pivot_longer(names_to="cell_barcode",values_to="REFcount",-SNP)%>%
filter(REFcount>0)
oth_res<-oth_res%>%
pivot_longer(names_to="cell_barcode",values_to="OTHcount",-SNP)%>%
filter(OTHcount>0)
result<-full_join(ref_res,alt_res,by=c("SNP","cell_barcode"))%>%
full_join(oth_res,by=c("SNP","cell_barcode"))
result<-left_join(SNP_INFO,result,alt_res,by=c("SNP_ID"="SNP"))
result[is.na(result)]<-0
result<-dplyr::distinct(result)
write_tsv(result,OUT)
In the case of scATAC-seq data, we generally do not have UMI. Therefore, we need to make slight adjustments to the configuration.
#scATAC-seq data info
barcode_atac="Multiome_10x.ATAC.filtered.barcode.list.tsv"
bam_atac="Multiome_10x.ATAC.sorted.rmdup.bam"
UMITAG_atac="None" # Different from the scRNA-seq
#run cellsnp for ATAC
cellsnp-lite -s ${bam_atac} \
--minCOUNT 20 -b ${barcode_atac} \
--UMItag ${UMITAG_atac} -p 4 \
--minMAF 0.05 \
--minMAPQ 20 \
--refseq /path/to/refdata-cellranger-GRCh38-3.0.0/fasta/genome.fa \ #reference used in cell ranger arc
--chrom=chrX -O ATAC.chrX.snp.call
#process output for ATAC
Rscript /path/to/RCODE_process_cellsnp.r \
ATAC.chrX.snp.call/cellSNP.samples.tsv \
ATAC.chrX.snp.call/cellSNP.tag.AD.mtx \
ATAC.chrX.snp.call/cellSNP.tag.DP.mtx \
ATAC.chrX.snp.call/cellSNP.tag.OTH.mtx \
ATAC.chrX.snp.call/cellSNP.base.vcf \
ATAC.chrX.snp.call.summary.tsv.gz
3. Run ANNOVAR
ANNOVAR is a software to annotate various information on genetic variants. You can download and find the detailed documentation on their website (https://annovar.openbioinformatics.org/en/latest/). In addition to the gene-based annotation, we also added allele frequency information to the variants for subsequent quality control. Please download the annotations necessary for your analysis following the ANNOVAR User Guide (https://annovar.openbioinformatics.org/en/latest/user-guide/download/).
Let’s run ANNOVAR for genotype data derived from the scRNA-seq and scATAC-seq data.
# scRNA-seq
zcat RNA.chrX.snp.call.summary.tsv.gz | \
awk -F"\t" 'NR>1 {print $2"\t"$3"\t"$3"\t"$4"\t"$5"\t"$1}' | \
LANG=C sort | LANG=C uniq \
> RNA.chrX.snp.call.summary.annov.input
/path/to/annovar/table_annovar.pl \
RNA.chrX.snp.call.summary.annov.input \
/path/to/annovar/humandb/ -buildver hg38 \
-out RNA.chrX.snp.call \
-remove -protocol refGene,ensGene,avsnp150,ALL.sites.2015_08 \
-operation g,g,f,f -nastring NA -polish
# scATAC-seq
zcat ATAC.chrX.snp.call.summary.tsv.gz | \
awk -F"\t" 'NR>1 {print $2"\t"$3"\t"$3"\t"$4"\t"$5"\t"$1}' | \
LANG=C sort | LANG=C uniq \
> ATAC.chrX.snp.call.summary.annov.input
/path/to/annovar/table_annovar.pl \
ATAC.chrX.snp.call.summary.annov.input \
/path/to/annovar/humandb/ -buildver hg38 \
-out ATAC.chrX.snp.call \
-remove -protocol refGene,ensGene,avsnp150,ALL.sites.2015_08 \
-operation g,g,f,f -nastring NA -polish
4. Quality control
As a final step, we apply quality control (QC) to the allele-specific
expression (chromatin accessibility) data based on the Annovar
annotation. Below, we provide an example R script for QC, which filters
based on gene annotation and allele information. We can use
Multiome_RNA_QC_passed_SNP_df.tsv.gz
for scLinaX analysis.
Additionally, we can use
Multiome_ATAC_QC_passed_SNP_df.tsv.gz
for scLinaX-multi
analysis.
#RCODE_QC.r
library(tidyverse)
library(viridis)
library(ggpointdensity)
library(patchwork)
argv<-commandArgs(T)
RNA_SNPCELL=argv[1]
RNA_ANNOVAR=argv[2]
ATAC_SNPCELL=argv[3]
ATAC_ANNOVAR=argv[4]
if(FALSE){
#Example of input
RNA_SNPCELL="RNA.chrX.snp.call.summary.tsv.gz"
RNA_ANNOVAR="RNA.chrX.snp.call.hg38_multianno.txt"
ATAC_SNPCELL="ATAC.chrX.snp.call.summary.tsv.gz"
ATAC_ANNOVAR="ATAC.chrX.snp.call.hg38_multianno.txt"
}
#load raw data
rna<-read_tsv(RNA_SNPCELL)
atac<-read_tsv(ATAC_SNPCELL)
#load annotation data
load_anno<-function(ANNOVAR){
anno<-read_tsv(ANNOVAR)%>%
mutate(SNP_ID=str_c(Chr,Start,Ref,Alt,sep=":"))%>%
dplyr::select(SNP_ID,Func.refGene,Gene.refGene,ALL.sites.2015_08)
colnames(anno)<-c("SNP_ID","Region","Gene","ALL_Freq")
anno$ALL_Freq[is.na(anno$ALL_Freq)]<-0
return(anno)
}
rna_anno<-load_anno(RNA_ANNOVAR)
atac_anno<-load_anno(ATAC_ANNOVAR)
#merge raw data + annotation data
merge_anno<-function(data,anno){
df<-left_join(data,anno,by="SNP_ID")%>%
filter(!str_detect(Gene,pattern=";"))%>%
filter(!is.na(Gene))%>%filter(Region %in% c("intronic","UTR5","UTR3","exonic","ncRNA_exonic","ncRNA_intronic","splicing"))
return(df)
}
rna_df<-merge_anno(rna,rna_anno)
atac_df<-left_join(atac,atac_anno,by="SNP_ID")
rna_df_QC_pass<-filter(rna_df,ALL_Freq>0.01)
atac_df_QC_pass<-filter(atac_df,ALL_Freq>0.01)
write_tsv(rna_df_QC_pass,"Multiome_RNA_QC_passed_SNP_df.tsv.gz") #This QC-passed data should be used for the subsequent analyses
write_tsv(atac_df_QC_pass,"Multiome_ATAC_QC_passed_SNP_df.tsv.gz") #This QC-passed data should be used for the subsequent analyses