Documentation for the Code¶
bam¶
-
mirtop.bam.filter.
clean_hits
(reads)¶ Select only best matches from a list of hits from the same read.
- Args:
reads: dictionary as:
>>> {'read_id': mirtop.realign.hits, ...}
Returns:
reads: same than input but with best hits only.
-
mirtop.bam.filter.
tune
(seq, precursor, start, cigar)¶ The actual fn that will realign the sequence to find the nt changes at 5’, 3’ sequence and nt variations.
- Args:
seq (str): sequence of the read.
precursor (str): sequence of the precursor.
start (int): start position of sequence on the precursor, +1.
cigar (str): similar to SAM CIGAR attribute.
Returns:
list with:
subs (list): substitutions
add (list): nt added to the end
cigar (str): updated cigar
exporter¶
Read GFF files and output isomiRs compatible format
-
mirtop.exporter.isomirs.
convert
(args)¶ Main function to convert from GFF3 to isomiRs Bioc Package.
Reads a GFF file to produces output file containing Expression counts
- Args:
- args(namedtuple): arguments parsed from command line with
- mirtop.libs.parse.add_subparser_counts().
- Returns:
- file (file): with columns like:
- UID miRNA Variant Sample1 Sample2 … Sample N
Read GFF files and output FASTA format
-
mirtop.exporter.fasta.
convert
(args)¶ Main function to convert from GFF3 to FASTA format.
- Args:
- args: supported options for this sub-command.
- See mirtop.libs.parse.add_subparser_export().
-
mirtop.exporter.vcf.
cigar_2_key
(cigar, readseq, refseq, pos, var5p, var3p, parent_ini_pos, parent_end_pos, hairpin)¶ - Args:
- ‘cigar(str)’: CIGAR standard of a compressed alignment representation, this CIGAR omits the ‘1’ integer. ‘readseq(str)’: the read sequence ‘refseq(str)’: the reference sequence ‘pos(str)’: the start current position ‘var5p(int)’: extra nucleotides not in the reference miRNA (5p strand) ‘var3p(int)’: extra nucleotides not in the reference miRNA (3p strand) ‘parent_ini_pos(int)’: the start position of the parent miRNA ‘parent_end_pos(int)’: the last position of the parent miRNA ‘hairpin(str)’: the string of the hairpin for all the miRNA
- Returns:
- ‘key_pos(str list)’: a list with the positions of the variants. ‘key_var(str list)’: a list with the variant keys found. ‘ref(str)’: reference base(s). ‘alt(str)’: altered base(s).
-
mirtop.exporter.vcf.
convert
(args)¶ Main function to convert from GFF3 to VCF.
- Args:
- args: supported options for this sub-command.
- See mirtop.libs.parse.add_subparser_export().
-
mirtop.exporter.vcf.
create_vcf
(mirgff3, precursor, gtf, vcffile)¶ - Args:
- ‘mirgff3(str)’: File with mirGFF3 format that will be converted ‘precursor(str)’: Fasta format sequences of all miRNA hairpins ‘gtf(str)’: Genome coordinates ‘vcffile’: name of the file to be saved
- Returns:
- Nothing is returned, instead, a VCF file is generated
gff¶
GFF reader and creator helpers
-
mirtop.gff.body.
create
(reads, database, sample, args, quiet=False)¶
-
mirtop.gff.body.
lift_to_genome
(line, mapper)¶ - Function to get a class of type feature from classgff.py
- and map the precursors coordinates to the genomic coordinates
- Args:
line(str): string GFF line. mapper(dict): dict with mirna-precursor-genomic coordinas from
mirna.mapper.read_gtf_to_mirna function.- Returns:
- (line): string with GFF line with updated chr, star, end, strand
-
mirtop.gff.body.
paste_columns
(line, sep=' ')¶ Create GFF/GTF line from read_gff_line
-
mirtop.gff.body.
read
(fn, args)¶ Read GTF/GFF file and load into annotate, chrom counts, sample, line
-
mirtop.gff.body.
read_gff_line
(line)¶ Read GFF/GTF line and return dictionary with fields
-
mirtop.gff.body.
read_variant
(attrb, sep=' ')¶ Read string in variants attribute.
- Args:
- attrb(str): string in Variant attribute.
- Returns:
- (gff_dict): dictionary with:
>>> {'iso_3p': -3, ...}
-
mirtop.gff.body.
variant_with_nt
(line, precursors, matures)¶ Return nucleotides changes for each variant type using Variant attribute, precursor sequences and mature position.
Compare multiple GFF files to a reference
-
mirtop.gff.compare.
compare
(args)¶ From a list of GFF files produce comparison with a reference set.
- Args:
- args(namedtuple): arguments parsed from command line with
- mirtop.libs.parse.add_subparser_compare(). First file will be considered the reference set.
- Returns:
- (out_file): comparison of the GFF files with the reference.
-
mirtop.gff.compare.
read_reference
(fn)¶ Read GFF into UID:Variant
- Args:
- fn (str): GFF file.
- Returns:
- srna (dict): dict with >>> {‘UID’: ‘iso_snp:-2,…’}
Helpers to define the header fo the GFF file
-
mirtop.gff.header.
create
(samples, database, custom, filter=None)¶ Create header for GFF file.
- Args:
samples (list): character list with names for samples
database (str): name of the database.
custom (str): extra lines.
filter (list): character list with filter definition.
- Returns:
- header (str): header string.
-
mirtop.gff.header.
read_samples
(fn)¶ Read samples from the header of a GFF file.
- Args:
- fn(str): GFF file to read.
- Returns:
- (list): character list with sample names.
-
mirtop.gff.header.
read_version
(fn)¶ Extract mirGFF3 version
-
mirtop.gff.merge.
merge
(dts, samples)¶ For dictionary with sample as keys and values as lines merge them into one GFF file.
- Args:
dts(dict): dictionary as >>> {‘file’: {‘mirna’: {start: gff_list}}}. gff_list has the format as defined in mirtop.gff.body.read().
samples(list): character list with sample names.
- Returns:
- merged_lines (nested dicts):gff_list has the format as defined in mirtop.gff.body.read().
Produce stats from GFF3 format
-
mirtop.gff.stats.
stats
(args)¶ From a list of GFF files produce general isomiRs stats.
- Args:
- args (namedtupled): arguments parsed from command line with
- mirtop.libs.parse.add_subparser_stats().
- Returns:
- (stdout) or (out_file): GFF general stats.
Update gff3 files to newest version
-
mirtop.gff.update.
convert
(args)¶ Update previous GFF3 versions.
- Args:
- args (namedtupled): arguments parsed from command line with
- mirtop.libs.parse.add_subparser_update().
- Returns:
- (out_file): most updated GFF3 file.
-
mirtop.gff.update.
update_file
(gff_file, new_gff_file)¶ Update file from file version to current version
-
mirtop.gff.validator.
check_multiple
(args)¶ Check GFF3 format.
- Args:
- args (namedtupled): arguments parsed from command line with
- mirtop.libs.parse.add_subparser_validator().
- Returns:
- (std_out): warnings or errors of the files showing issues with the format.
importer¶
Read isomiR GFF files
-
mirtop.importer.isomirsea.
cigar2variants
(cigar, sequence, tag)¶ From cigar to Variants in GFF format
-
mirtop.importer.isomirsea.
header
(fn)¶ Custom header for isomiR-SEA importer.
- Args:
- fn (str): file name with isomiR-SEA GFF output
- Returns:
- (str): isomiR-SEA header string.
-
mirtop.importer.isomirsea.
read_file
(fn, args)¶ Read isomiR-SEA file and convert to mirtop GFF format.
- Args:
fn(str): file name with isomiR-SEA output information.
database(str): database name.
- args(namedtuple): arguments from command line.
- See mirtop.libs.parse.add_subparser_gff().
- Returns:
- reads (nested dicts):gff_list has the format as
- defined in mirtop.gff.body.read().
Read prost! files
-
mirtop.importer.prost.
header
()¶ Custom header for PROST! importer.
- Returns:
- (str): PROST! header string.
-
mirtop.importer.prost.
read_file
(fn, hairpins, database, mirna_gtf)¶ Read PROST! file and convert to mirtop GFF format.
- Args:
fn(str): file name with PROST output information.
database(str): database name.
- args(namedtuple): arguments from command line.
- See mirtop.libs.parse.add_subparser_gff().
- Returns:
- reads: dictionary where keys are read_id and values are mirtop.realign.hits
Read seqbuster files
-
mirtop.importer.seqbuster.
header
()¶ Custom header for seqbuster importer.
- Returns:
- (str): seqbuster header string.
-
mirtop.importer.seqbuster.
read_file
(fn, args)¶ Read seqbuster file and convert to mirtop GFF format.
- Args:
fn(str): file name with seqbuster output information.
database(str): database name.
- args(namedtuple): arguments from command line.
- See mirtop.libs.parse.add_subparser_gff().
- Returns:
- reads: dictionary where keys are read_id and values are mirtop.realign.hits
Read sRNAbench files
-
mirtop.importer.srnabench.
read_file
(folder, args)¶ Read sRNAbench file and convert to mirtop GFF format.
- Args:
fn(str): file name with sRNAbench output information.
database(str): database name.
- args(namedtuple): arguments from command line.
- See mirtop.libs.parse.add_subparser_gff().
- Returns:
- reads (nested dicts):gff_list has the format as
- defined in mirtop.gff.body.read().
Read isomiR GFF files from optimir tool
-
mirtop.importer.optimir.
read_file
(fn, args)¶ Read OptimiR file and convert to mirtop GFF format.
- Args:
fn(str): file name with isomiR-SEA output information.
database(str): database name.
- args(namedtuple): arguments from command line.
- See mirtop.libs.parse.add_subparser_gff().
- Returns:
- reads (nested dicts):gff_list has the format as
- defined in mirtop.gff.body.read().
Read Manatee files
-
mirtop.importer.manatee.
read_file
(fn, database, args)¶ Read Manatee file and convert to mirtop GFF format.
- Args:
fn(str): file name with Manatee output information.
database(str): database name.
- args(namedtuple): arguments from command line.
- See mirtop.libs.parse.add_subparser_gff().
- Returns:
- reads (nested dicts):gff_list has the format as
- defined in mirtop.gff.body.read().
libs¶
Centralize running of external commands, providing logging and tracking. Integrated from bcbio package with some changes.
-
mirtop.libs.do.
find_bash
()¶ Find bash full path
-
mirtop.libs.do.
find_cmd
(cmd)¶ Find comand in session
-
mirtop.libs.do.
run
(cmd, data=None, checks=None, region=None, log_error=True, log_stdout=False)¶ Run the provided command, logging details and checking for errors.
Helpers to work with fastq files
-
mirtop.libs.fastq.
is_fastq
(in_file)¶ - Check whether file is fastq accepting
- txt, fq and fastq extensions understanding compression with gzip: .gzip and .gz (copy from bcbio)
- Args:
- in_file(str): file name.
- Returns:
- (boolean): Yes or Not.
-
mirtop.libs.fastq.
open_fastq
(in_file)¶ - open a fastq file, using gzip if it is gzipped
- (from bcbio package)
- Args:
- in_file(str): file name.
- Returns:
- (File): file handler.
-
mirtop.libs.fastq.
splitext_plus
(fn)¶ - Split on file extensions, allowing for zipped extensions.
- (copy from bcbio)
- Args:
- fn(str): file name.
- Returns:
- base, ext(str, str): basename and extesion.
-
mirtop.libs.parse.
parse_cl
(in_args)¶ Function to parse the subcommands arguments.
utils from http://www.github.com/chapmanb/bcbio-nextgen.git
-
mirtop.libs.utils.
chdir
(*args, **kwds)¶ Change dir temporaly using with:
>>> with chdir(temporal): do_something()
-
mirtop.libs.utils.
file_exists
(fname)¶ Check if a file exists and is non-empty.
-
mirtop.libs.utils.
safe_dirs
(dirs)¶ Create folder if not exitsts
-
mirtop.libs.utils.
safe_remove
(fn)¶ Remove file skipping
mirna¶
Read bam files
-
mirtop.mirna.annotate.
annotate
(reads, mature_ref, precursors, quiet=False)¶ Using coordinates, mismatches and realign to annotate isomiRs
- Args:
- reads(dicts of hits):
- dict object that comes from mirotp.bam.bam.read_bam()
- mirbase_ref (dict of mirna positions):
- dict object that comers from mirtop.mirna.read_mature()
- precursors dict object (key : fasta):
- that comes from mirtop.mirna.fasta.read_precursor()
- quiet(boolean):
- verbosity state
- Return:
- reads (dict):
- dictionary where keys are read_id and values are mirtop.realign.hits
Read precursor fasta file
-
mirtop.mirna.fasta.
read_precursor
(precursor, sps=None)¶ Load precursor file for that species
- Args:
precursor(str): file name with fasta sequences
- sps(str): if any, select species to keep.
- It’ll do a header_sequence.find(sps).
- Returns:
- hairpin(dict): keys are precursor names and
- values are precursor sequences.
Read database information
-
mirtop.mirna.mapper.
get_primary_transcript
(database)¶ - Get the ID to identify the primary transcript in the
- GTF file with the miRNA and precursor coordinates to be able to parse BAM files with genomic coordinates.
-
mirtop.mirna.mapper.
guess_database
(args)¶ Guess database name from GFF file.
- Args:
- gtf(str): file name with GFF miRNA genomic positions and
- header lines.
- Returns:
- database(str): name of the database
TODO: this needs to be generic to other databases.
-
mirtop.mirna.mapper.
read_gtf_chr2mirna
(gtf)¶ Load GTF file with precursor positions on genome.
- Args:
- gtf(str): file name with GFF miRNA genomic positions and
- header lines.
- Returns:
- db_mir(dict): dictionary with keys being chr and values
- mirna and genomic positions.
-
mirtop.mirna.mapper.
read_gtf_to_mirna
(gtf)¶ Load GTF file with precursor positions on genome.
- Args:
- gtf(str): file name with GFF miRNA genomic positions and
- header lines.
- Returns:
- db_mir(dict): dictionary with keys being mirnas and values
- genomic positions.
-
mirtop.mirna.mapper.
read_gtf_to_precursor
(gtf)¶ Load GTF file with precursor positions on genome Return dict with key being precursor name and value a dict of mature miRNA with relative position to precursor.
- Args:
- gtf(str): file name with GFF miRNA genomic positions and
- header lines.
- Returns:
map_dict(dict):
>>> {'parent': {mirna: [start, end]}}
-
mirtop.mirna.mapper.
read_gtf_to_precursor_mirbase
(gtf, format='precursor')¶ Load GTF file with precursor positions on genome Return dict with key being precursor name and value a dict of mature miRNA with relative position to precursor. For miRBase and similar GFF3 files.
- Args:
- gtf(str): file name with GFF miRNA genomic positions and
- header lines.
- Returns:
map_dict(dict):
>>> {'parent': {mirna: [start, end]}}
-
mirtop.mirna.mapper.
read_gtf_to_precursor_mirgenedb
(gtf, format='precursor')¶ Load GTF file with precursor positions on genome Return dict with key being precursor name and value a dict of mature miRNA with relative position to precursor. For MirGeneDB and similar GFF3 files.
- Args:
- gtf(str): file name with GFF miRNA genomic positions and
- header lines.
- Returns:
map_dict(dict):
>>> {'parent': {mirna: [start, end]}}
-
mirtop.mirna.realign.
align
(x, y, local=False)¶ Pairwise alignments between two sequenes. https://medium.com/towards-data-science/pairwise-sequence-alignment-using-biopython-d1a9d0ba861f
- Args:
x(str): short sequence.
y(str): long sequence.
local(boolean): local or global alignment.
- Returns:
- aligned_x(hit): alignment information, socre and positions.
-
mirtop.mirna.realign.
align_from_variants
(sequence, mature, variants)¶ - Giving the sequence read,
- the mature from get_mature_sequence, and the variant GFF annotation: get a list of substitutions
- Args:
sequence(str): read sequence.
- mature(str): mature sequence from
- mirtop.mirna.realing.get_mature_sequence().
variants(str): string from Variant attribute in GFF file.
- Returns:
- snp(list): [[pos, target, reference]]
-
mirtop.mirna.realign.
cigar2snp
(cigar, reference)¶ From a CIGAR string and reference sequence detect mistmatches positions and reference and target nucleotides.
- Args:
cigar(str): CIGAR string.
reference(str): reference sequence.
- Returns:
snp(list): position of mismatches (indels included) as:
>>> [pos, seq_nt, ref_nt]
-
mirtop.mirna.realign.
cigar_correction
(cigarLine, query, target)¶ Read from CIGAR in BAM file to define mismatches.
- Args:
cirgarLine(str): CIGAR string from BAM file.
query(str): read sequence.
target(str): target sequence.
- Returns:
- (list): [query_nts, target_nts]
-
mirtop.mirna.realign.
expand_cigar
(cigar)¶ From short CIGAR version to long CIGAR version where each character is each nts in the sequence.
- Args:
cigar(str): CIGAR string.
>>> 10MA3M
- Returns:
cigar_long(str): CIGAR long.
>>> MMMMMMMMMMAMMM
-
mirtop.mirna.realign.
get_mature_sequence
(precursor, mature, exact=False, nt=5)¶ - From precursor and mature positions
- get mature sequence with +/- 4 flanking nts.
- Args:
precursor(str): long sequence.
mature(list): [start, end].
exact(boolean): not add 4+/- flanking nts.
nt(int): number of nts to get.
- Returns:
- (str): mature sequence.
-
class
mirtop.mirna.realign.
hits
¶ “Class with alignment information.
-
mirtop.mirna.realign.
is_sequence
(seq)¶ This function check whether the sequence is valid or not.
- Args:
- seq(str): string acting as a sequence.
- Returns:
- boolean: whether is or not a valid nucleotide sequence.
-
class
mirtop.mirna.realign.
isomir
¶ Class to represent isomiRs information.
-
format
(sep='\t')¶ Create tabular line from variant fields.
-
formatGFF
()¶ Create Variant attribute.
-
format_id
(sep='\t')¶ Create simple identifier from variant fields.
-
get_score
(sc)¶ Get score from variant fields.
-
is_iso
()¶ Define whether element is isomiR or not.
-
set_pos
(start, l, strand='+')¶ Set end position
-
-
mirtop.mirna.realign.
make_cigar
(seq, mature)¶ Function that will create CIGAR string from aligment between read and reference sequence.
- Args:
seq(str): read sequence.
mature(str): short sequence.
- Return:
- short(str): CIGAR string.
-
mirtop.mirna.realign.
make_id
(seq)¶ Create a unique identifier for the sequence from the nucleotides, replacing 5 nts for a unique sequence.
It uses the code from mirtop.mirna.keys().
Inspired by MINTplate: https://cm.jefferson.edu/MINTbase https://github.com/TJU-CMC-Org/MINTmap/tree/master/MINTplates
- Args:
- seq(str): nucleotides sequences.
- Returns:
- idName(str): unique identifier for the sequence.
-
mirtop.mirna.realign.
read_id
(idu)¶ Read a unique identifier for the sequence and convert it to the nucleotides, replacing an unique code for 5 nts.
It uses the code from mirtop.mirna.keys().
Inspired by MINTplate: https://cm.jefferson.edu/MINTbase https://github.com/TJU-CMC-Org/MINTmap/tree/master/MINTplates
- Args:
- idu(str): unique identifier for the sequence.
- Returns:
- seq(str): nucleotides sequences.
-
mirtop.mirna.realign.
reverse_complement
(seq)¶ Get reverse complement of a sequences
- Args:
seq(str): sequence.
>>> GCAT
- Returns:
(str): reverse complemente sequence:
>>> ATGC
-
mirtop.mirna.realign.
variant_to_3p
(hairpin, pos, variant)¶ - From a sequence and a start position get the nts
- +/- indicated by iso_3p. Pos option is 0-base-index
- Args:
- hairpin(str): long sequence:
>>> AAATTTT
position(int): >>> 3
- variant(int): number of nts involved in the variant:
>>> -1
- Returns:
- (str): nucleotide involved in the variant:
>>> A
-
mirtop.mirna.realign.
variant_to_5p
(hairpin, pos, variant)¶ - From a sequence and a start position get the nts
- +/- indicated by iso_5p. Pos option is 0-base-index
- Args:
- hairpin(str): long sequence:
>>> AAATTTT
position(int): >>> 3
- variant(int): number of nts involved in the variant:
>>> -1
- Returns:
- (str): nucleotide involved in the variant:
>>> T
-
mirtop.mirna.realign.
variant_to_add
(read, variant)¶ - From a sequence and a start position get the nts
- +/- indicated by iso_3p. Pos option is 0-base-index
- Args:
- hairpin(str): long sequence:
>>> AAATTTT
position(int): >>> 3
- variant(int): number of nts involved in the variant:
>>> 2
- Returns:
- (str): nucleotide involved in the variant:
>>> TT
-
mirtop.mirna.snps.
create_vcf
(isomirs, matures, gtf, vcf_file=None)¶ Create vcf file of changes for all samples. PASS will be ones with > 3 isomiRs supporting the position and > 30% of reads, otherwise LOW
-
mirtop.mirna.snps.
liftover
(pass_pos, matures)¶ Make position at precursor scale
-
mirtop.mirna.snps.
liftover_to_genome
(pass_pos, gtf)¶ Liftover from precursor to genome
-
mirtop.mirna.snps.
print_vcf
(data)¶ Print vcf line following rules.
classes¶
-
class
mirtop.mirna.realign.
hits
“Class with alignment information.
-
class
mirtop.mirna.realign.
isomir
Class to represent isomiRs information.
-
format
(sep='\t') Create tabular line from variant fields.
-
formatGFF
() Create Variant attribute.
-
format_id
(sep='\t') Create simple identifier from variant fields.
-
get_score
(sc) Get score from variant fields.
-
is_iso
() Define whether element is isomiR or not.
-
set_pos
(start, l, strand='+') Set end position
-