Variant Simulator


Variant Simulator is a tool for generating all possible single base substitutions (SNPs) in protein coding genes. It can run for a specific species, specific chromosome or specific gene.

One can restrict the SNPs to be generated only for the introns, exons or only coding exons and a specific number of bases around each of them.

For each generated variant the variant_simulator reports the gene symbol or gene stable_id and the Ensembl id of the feature.


Download and install

Variant Simulator is part of variation tools.

Note

Variant Simulator depends on database access for identifier and sequence retrieval and cannot be used in offline mode.


Usage

Variant Simulator depends on database access for identifier lookup, and cannot be used in offline mode as per VEP.

The output format is VCF and the INFO field will contain the GENE symbol and FEATURE id.

Generate SNPs for a chromosome

# Running on one chromosome, default species is Homo sapiens:
./simulate_variation -chrom 2

./simulate_variation -species pig -chrom 2

Output

# First 7 rows of the output:
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
2	38814	2-38814-T-A	T	A	.	.	GENE=FAM110C;FEATURE=ENSG00000184731
2	38814	2-38814-T-C	T	C	.	.	GENE=FAM110C;FEATURE=ENSG00000184731
2	38814	2-38814-T-G	T	G	.	.	GENE=FAM110C;FEATURE=ENSG00000184731

Generate SNPs for a gene

# Running on one gene, default species is Homo sapiens:
./simulate_variation -gene ENSG00000139618

./simulate_variation -gene BRCA2

Output

# First 7 rows of the output:
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
13	32315474	13-32315474-G-A	G	A	.	.	GENE=BRCA2;FEATURE=ENSG00000139618
13	32315474	13-32315474-G-C	G	C	.	.	GENE=BRCA2;FEATURE=ENSG00000139618
13	32315474	13-32315474-G-T	G	T	.	.	GENE=BRCA2;FEATURE=ENSG00000139618

Generate SNPs for a gene using exonsOnly

# Running on one gene using only the exons, default species is Homo sapiens:
./simulate_variation -gene BRCA2 -exonsOnly

Output

# First 7 rows of the output:
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
13	32357742	13-32357742-C-A	C	A	.	.	GENE=BRCA2;FEATURE=ENSE00003719469
13	32357742	13-32357742-C-T	C	T	.	.	GENE=BRCA2;FEATURE=ENSE00003719469
13	32357742	13-32357742-C-G	C	G	.	.	GENE=BRCA2;FEATURE=ENSE00003719469

Generate SNPs for a gene using codingOnly exons

# Running on one gene using only the coding exons, default species is Homo sapiens:
./simulate_variation -gene BRCA2 -codingOnly

Output

# First 7 rows of the output:
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
13	32325076	13-32325076-G-A	G	A	.	.	GENE=BRCA2;FEATURE=ENSE00003659301
13	32325076	13-32325076-G-C	G	C	.	.	GENE=BRCA2;FEATURE=ENSE00003659301
13	32325076	13-32325076-G-T	G	T	.	.	GENE=BRCA2;FEATURE=ENSE00003659301

Generate SNPs for a gene using codingOnly exons with 5bp upstream/downstream of each exon

# Running on one gene using only the coding exons with 5bp flanks, default species is Homo sapiens:
./simulate_variation -gene BRCA2 -codingOnly -edge 5

Output

# First 7 rows of the output:
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
13	32325071	13-32325071-T-A	T	A	.	.	GENE=BRCA2;FEATURE=ENSE00003659301
13	32325071	13-32325071-T-C	T	C	.	.	GENE=BRCA2;FEATURE=ENSE00003659301
13	32325071	13-32325071-T-G	T	G	.	.	GENE=BRCA2;FEATURE=ENSE00003659301

Output

Output is in VCF format, for each position three lines will be created, with the following header:
  • CHROM: chromosome number
  • POS: variant position
  • ID: string concatenation of chrom-pos-ref-alt
  • REF: reference allele
  • ALT: alternate allele
  • QUAL: empty (.)
  • FILTER: empty (.)
  • INFO: GENE= will have the value of the gene symbol if it exists, otherwise the Ensembl gene stable_id, FEATURE= will contain the gene or exon stable_id or intron display_id

Options

Flag Alternate Description
--chrom
-chr
Chromosome name to restrict script to.
--gene
-g
Gene symbol or gene Ensembl stable_id to restrict script to.
--species
-s
Species to use. Default value: homo_sapiens
--assembly
-a
Assembly to use if species is homo_sapiens. Default value: grch38
--refseq
Use RefSeq genes/transcripts if species is human.
--registry
File containing database connections in Ensembl registry format (see Ensembl Registry). Default value: connect to latest public Ensembl database
--exonsOnly
Generate all possible SNPs for exons only.
--intronsOnly
Generate all possible SNPs for introns only.
--codingOnly
Generate all possible SNPs for coding exons only.
--edge
upstream and downstream bp for each feature. Default value: 0
--output_file
-o
Output file. Default value: simulated.vcf
--help
Help usage message