2.3 User Input Examples
$ TFBS_footprinter PATH_TO/sample_ids.txt
$ TFBS_footprinter PATH_TO/sample_ids.txt -s homo_sapiens -g mammals -pb 900 -pa 100 -l 5 -c 2 -tx 10 -o PATH_TO/Results/
2.3 Arguments
Executing tfbs_footprinter version 1.0.0b23.
usage: tfbs_footprinter [-h] [--tf_ids_file] [--target_species]
[--species_group] [--coverage] [--promoter_before_tss]
[--promoter_after_tss] [--locality_threshold]
[--conservation_min] [--top_x_tfs] [--output_dir]
TFBS Footprinting - Identification of conserved vertebrate transcription factor binding sites (TFBSs).
See https://github.com/thirtysix/TFBS_footprinting for additional usage instructions.
------------------------------------------------------------------------------------------------------
Example Usage:
simplest:
tfbs_footprinter PATH_TO/sample_ids.txt
all arguments:
tfbs_footprinter PATH_TO/sample_ids.txt -tfs PATH_TO/tf_ids.txt -s homo_sapiens -g mammals -e low -pb 900 -pa 100 -l 5 -c 2 -tx 10 -o PATH_TO/Results/
------------------------------------------------------------------------------------------------------
positional arguments:
Required: Location of a file containing Ensembl
target_species transcript ids (see sample file
sample_ids.txt at
https://github.com/thirtysix/TFBS_footprinting)")
optional arguments:
-h, --help show this help message and exit
--tf_ids_file , -tfs
Optional: Location of a file containing a limited list
of Jaspar TFs to use in scoring alignment (see sample
file tf_ids.txt at
https://github.com/thirtysix/TFBS_footprinting)
[default: all Jaspar TFs]
--target_species , -s
[default: "homo_sapiens"] - Target species (string),
options are located at (https://github.com/thirtysix/T
FBS_footprinting/blob/master/README.md#species).
Conservation of TFs across other species will be based
on identifying them in this species first.
--species_group , -g
("mammals", "primates", "sauropsids", or "fish")
[default: "mammals"] - Group of species (string) to
identify conservation of TFs within. Your target
species should be a member of this species group (e.g.
"homo_sapiens" and "mammals" or "primates"). The
"primates" group does not have a low-coverage version.
Groups and members are listed at (https://github.com/t
hirtysix/TFBS_footprinting/blob/master/README.md#speci
es)
--coverage , -e ("low" or "high") [default: "low"] - Which Ensembl EPO
alignment of species to use. The low coverage contains
significantly more species and is recommended. The
primate group does not have a low-coverage version.
--promoter_before_tss , -pb
(0-100,000) [default: 900] - Number (integer) of
nucleotides upstream of TSS to include in analysis
(0-100,000).
--promoter_after_tss , -pa
(0-100,000) [default: 100] - Number (integer) of
nucleotides downstream of TSS to include in analysis.
--locality_threshold , -l
(0-100) [default: 5] - Nucleotide distance (integer)
upstream/downstream within which TF predictions in
other species will be included to support a hit in the
target species.
--conservation_min , -c
(1-20)[default: 2] - Minimum number (integer) of
species a predicted TF is found in, in alignment, to
be considered conserved .
--top_x_tfs , -tx (1-20) [default: 10] - Number (integer) of unique TFs
to include in output .svg figure.
--output_dir , -o [default: /home/harlan/Dropbox/github/TFBS_footprintin
g/tfbs_results ] - Full path of directory where result
directories will be output.
3 Process
Iterate through each user provided Ensembl transcript id:
- Retrieve EPO aligned orthologous sequences from Ensembl database for user-defined species group (mammals, primates, fish, sauropsids) for promoter of transcript id.
- Edit retrieved alignment:
- Remove species sequences that are less than 75% length of target_species sequence.
- Replace characters not corresponding to nucleotides (ACGT), with gaps characters “-“.
- Remove gap-only columns from alignment.
- Generate position weight matrices (PWMs) from Jaspar position frequency matrices (PFMs).
- Score each species sequence in the alignment using all PWMs.
- Keep predictions with a score greater than score threshold corresponding to p-value of 0.001.
- Identify predicted TFBSs in target_species which are conserved in non-target_species species of the the species_group within the locality_threshold and totaling at least conservation_min.
- For each conserved TFBS, compute ‘combined affinity score’ as a sum of position weight scores of species possessing a prediction.
- Sort target_species predictions by combined affinity score, generate a vector graphics figure showing top_x_tfs unique TFs mapped onto the promoter of the target transcript, and additional output as described below.
4 Output
- Original alignment as retrieved from Ensembl (alignment_uncleaned.fasta).
- Cleaned alignment (alignment_cleaned.fasta).
- Regulatory information for the target transcripts user-defined promoter region (regulatory_decoded.json).
- Transcript properties for target transcript (transcript_dict.json).
- All predicted TFBSs for all species which satisfy p-value threshold (TFBSs_found.all.json).
- All predicted TFBSs for target species which are supported by at least conservation_min predictions in other species, and those supporting species, grouped into clusters (TFBSs_found.clusters.csv).
- All predicted TFBSs for target species which are supported by at least conservation_min predictions in other species, sorted by combined affinity score (TFBSs_found.sortedclusters.csv).
- Figure showing top_x_tfs highest scoring (combined affinity score) TFBSs mapped onto target_species promoter (ENSxxxxxxxxxxxx_mammals.Promoterhisto.svg).
5 Species
The promoter region of any Ensembl transcript of any species within any column can be compared against the other members of the same column in order to identify a conserved binding site of the 519 transcription factors described in the Jaspar database. The Enredo-Pecan-Ortheus pipeline was used to create whole genome alignments between the species in each column. ‘EPO_LOW’ indicates this column also contains genomes for which the sequencing of the current version is still considered low-coverage. The TFBS footprinting pipeline partially accounts for this by removing sequences from alignments which appear to be missing segments. Due to the significantly greater number of species, we recommend using the low coverage versions except for primate comparisons which do not have a low coverage version.