Documentation for the Code

bam

mirtop.bam.filter.clean_hits(reads)

Select only best matches from a list of hits from the same read.

Args:

reads: dictionary as:

>>> {'read_id': mirtop.realign.hits, ...}

Returns:

reads: same than input but with best hits only.
mirtop.bam.filter.tune(seq, precursor, start, cigar)

The actual fn that will realign the sequence to find the nt changes at 5’, 3’ sequence and nt variations.

Args:

seq (str): sequence of the read.

precursor (str): sequence of the precursor.

start (int): start position of sequence on the precursor, +1.

cigar (str): similar to SAM CIGAR attribute.

Returns:

list with:

subs (list): substitutions

add (list): nt added to the end

cigar (str): updated cigar

exporter

Read GFF files and output isomiRs compatible format

mirtop.exporter.isomirs.convert(args)

Main function to convert from GFF3 to isomiRs Bioc Package.

Reads a GFF file to produces output file containing Expression counts

Args:
args(namedtuple): arguments parsed from command line with
mirtop.libs.parse.add_subparser_counts().
Returns:
file (file): with columns like:
UID miRNA Variant Sample1 Sample2 … Sample N

Read GFF files and output FASTA format

mirtop.exporter.fasta.convert(args)

Main function to convert from GFF3 to FASTA format.

Args:
args: supported options for this sub-command.
See mirtop.libs.parse.add_subparser_export().
mirtop.exporter.vcf.cigar_2_key(cigar, readseq, refseq, pos, var5p, var3p, parent_ini_pos, parent_end_pos, hairpin)
Args:
‘cigar(str)’: CIGAR standard of a compressed alignment representation, this CIGAR omits the ‘1’ integer. ‘readseq(str)’: the read sequence ‘refseq(str)’: the reference sequence ‘pos(str)’: the start current position ‘var5p(int)’: extra nucleotides not in the reference miRNA (5p strand) ‘var3p(int)’: extra nucleotides not in the reference miRNA (3p strand) ‘parent_ini_pos(int)’: the start position of the parent miRNA ‘parent_end_pos(int)’: the last position of the parent miRNA ‘hairpin(str)’: the string of the hairpin for all the miRNA
Returns:
‘key_pos(str list)’: a list with the positions of the variants. ‘key_var(str list)’: a list with the variant keys found. ‘ref(str)’: reference base(s). ‘alt(str)’: altered base(s).
mirtop.exporter.vcf.convert(args)

Main function to convert from GFF3 to VCF.

Args:
args: supported options for this sub-command.
See mirtop.libs.parse.add_subparser_export().
mirtop.exporter.vcf.create_vcf(mirgff3, precursor, gtf, vcffile)
Args:
‘mirgff3(str)’: File with mirGFF3 format that will be converted ‘precursor(str)’: Fasta format sequences of all miRNA hairpins ‘gtf(str)’: Genome coordinates ‘vcffile’: name of the file to be saved
Returns:
Nothing is returned, instead, a VCF file is generated

gff

GFF reader and creator helpers

mirtop.gff.body.create(reads, database, sample, args, quiet=False)

Read https://github.com/miRTop/mirtop/issues/9

mirtop.gff.body.lift_to_genome(line, mapper)
Function to get a class of type feature from classgff.py
and map the precursors coordinates to the genomic coordinates
Args:

line(str): string GFF line. mapper(dict): dict with mirna-precursor-genomic coordinas from

mirna.mapper.read_gtf_to_mirna function.
Returns:
(line): string with GFF line with updated chr, star, end, strand
mirtop.gff.body.paste_columns(line, sep=' ')

Create GFF/GTF line from read_gff_line

mirtop.gff.body.read(fn, args)

Read GTF/GFF file and load into annotate, chrom counts, sample, line

mirtop.gff.body.read_gff_line(line)

Read GFF/GTF line and return dictionary with fields

mirtop.gff.body.read_variant(attrb, sep=' ')

Read string in variants attribute.

Args:
attrb(str): string in Variant attribute.
Returns:
(gff_dict): dictionary with:
>>> {'iso_3p': -3, ...}
mirtop.gff.body.variant_with_nt(line, precursors, matures)

Return nucleotides changes for each variant type using Variant attribute, precursor sequences and mature position.

Compare multiple GFF files to a reference

mirtop.gff.compare.compare(args)

From a list of GFF files produce comparison with a reference set.

Args:
args(namedtuple): arguments parsed from command line with
mirtop.libs.parse.add_subparser_compare(). First file will be considered the reference set.
Returns:
(out_file): comparison of the GFF files with the reference.
mirtop.gff.compare.read_reference(fn)

Read GFF into UID:Variant

Args:
fn (str): GFF file.
Returns:
srna (dict): dict with >>> {‘UID’: ‘iso_snp:-2,…’}

Helpers to define the header fo the GFF file

mirtop.gff.header.create(samples, database, custom, filter=None)

Create header for GFF file.

Args:

samples (list): character list with names for samples

database (str): name of the database.

custom (str): extra lines.

filter (list): character list with filter definition.

Returns:
header (str): header string.
mirtop.gff.header.read_samples(fn)

Read samples from the header of a GFF file.

Args:
fn(str): GFF file to read.
Returns:
(list): character list with sample names.
mirtop.gff.header.read_version(fn)

Extract mirGFF3 version

mirtop.gff.merge.merge(dts, samples)

For dictionary with sample as keys and values as lines merge them into one GFF file.

Args:

dts(dict): dictionary as >>> {‘file’: {‘mirna’: {start: gff_list}}}. gff_list has the format as defined in mirtop.gff.body.read().

samples(list): character list with sample names.

Returns:
merged_lines (nested dicts):gff_list has the format as defined in mirtop.gff.body.read().

Produce stats from GFF3 format

mirtop.gff.stats.stats(args)

From a list of GFF files produce general isomiRs stats.

Args:
args (namedtupled): arguments parsed from command line with
mirtop.libs.parse.add_subparser_stats().
Returns:
(stdout) or (out_file): GFF general stats.

Update gff3 files to newest version

mirtop.gff.update.convert(args)

Update previous GFF3 versions.

Args:
args (namedtupled): arguments parsed from command line with
mirtop.libs.parse.add_subparser_update().
Returns:
(out_file): most updated GFF3 file.
mirtop.gff.update.update_file(gff_file, new_gff_file)

Update file from file version to current version

mirtop.gff.validator.check_multiple(args)

Check GFF3 format.

Args:
args (namedtupled): arguments parsed from command line with
mirtop.libs.parse.add_subparser_validator().
Returns:
(std_out): warnings or errors of the files showing issues with the format.

importer

Read isomiR GFF files

mirtop.importer.isomirsea.cigar2variants(cigar, sequence, tag)

From cigar to Variants in GFF format

mirtop.importer.isomirsea.header(fn)

Custom header for isomiR-SEA importer.

Args:
fn (str): file name with isomiR-SEA GFF output
Returns:
(str): isomiR-SEA header string.
mirtop.importer.isomirsea.read_file(fn, args)

Read isomiR-SEA file and convert to mirtop GFF format.

Args:

fn(str): file name with isomiR-SEA output information.

database(str): database name.

args(namedtuple): arguments from command line.
See mirtop.libs.parse.add_subparser_gff().
Returns:
reads (nested dicts):gff_list has the format as
defined in mirtop.gff.body.read().

Read prost! files

mirtop.importer.prost.header()

Custom header for PROST! importer.

Returns:
(str): PROST! header string.
mirtop.importer.prost.read_file(fn, hairpins, database, mirna_gtf)

Read PROST! file and convert to mirtop GFF format.

Args:

fn(str): file name with PROST output information.

database(str): database name.

args(namedtuple): arguments from command line.
See mirtop.libs.parse.add_subparser_gff().
Returns:
reads: dictionary where keys are read_id and values are mirtop.realign.hits

Read seqbuster files

mirtop.importer.seqbuster.header()

Custom header for seqbuster importer.

Returns:
(str): seqbuster header string.
mirtop.importer.seqbuster.read_file(fn, args)

Read seqbuster file and convert to mirtop GFF format.

Args:

fn(str): file name with seqbuster output information.

database(str): database name.

args(namedtuple): arguments from command line.
See mirtop.libs.parse.add_subparser_gff().
Returns:
reads: dictionary where keys are read_id and values are mirtop.realign.hits

Read sRNAbench files

mirtop.importer.srnabench.read_file(folder, args)

Read sRNAbench file and convert to mirtop GFF format.

Args:

fn(str): file name with sRNAbench output information.

database(str): database name.

args(namedtuple): arguments from command line.
See mirtop.libs.parse.add_subparser_gff().
Returns:
reads (nested dicts):gff_list has the format as
defined in mirtop.gff.body.read().

Read isomiR GFF files from optimir tool

mirtop.importer.optimir.read_file(fn, args)

Read OptimiR file and convert to mirtop GFF format.

Args:

fn(str): file name with isomiR-SEA output information.

database(str): database name.

args(namedtuple): arguments from command line.
See mirtop.libs.parse.add_subparser_gff().
Returns:
reads (nested dicts):gff_list has the format as
defined in mirtop.gff.body.read().

Read Manatee files

mirtop.importer.manatee.read_file(fn, database, args)

Read Manatee file and convert to mirtop GFF format.

Args:

fn(str): file name with Manatee output information.

database(str): database name.

args(namedtuple): arguments from command line.
See mirtop.libs.parse.add_subparser_gff().
Returns:
reads (nested dicts):gff_list has the format as
defined in mirtop.gff.body.read().

libs

Centralize running of external commands, providing logging and tracking. Integrated from bcbio package with some changes.

mirtop.libs.do.find_bash()

Find bash full path

mirtop.libs.do.find_cmd(cmd)

Find comand in session

mirtop.libs.do.run(cmd, data=None, checks=None, region=None, log_error=True, log_stdout=False)

Run the provided command, logging details and checking for errors.

Helpers to work with fastq files

mirtop.libs.fastq.is_fastq(in_file)
Check whether file is fastq accepting
txt, fq and fastq extensions understanding compression with gzip: .gzip and .gz (copy from bcbio)
Args:
in_file(str): file name.
Returns:
(boolean): Yes or Not.
mirtop.libs.fastq.open_fastq(in_file)
open a fastq file, using gzip if it is gzipped
(from bcbio package)
Args:
in_file(str): file name.
Returns:
(File): file handler.
mirtop.libs.fastq.splitext_plus(fn)
Split on file extensions, allowing for zipped extensions.
(copy from bcbio)
Args:
fn(str): file name.
Returns:
base, ext(str, str): basename and extesion.
mirtop.libs.parse.parse_cl(in_args)

Function to parse the subcommands arguments.

utils from http://www.github.com/chapmanb/bcbio-nextgen.git

mirtop.libs.utils.chdir(*args, **kwds)

Change dir temporaly using with:

>>> with chdir(temporal):
        do_something()
mirtop.libs.utils.file_exists(fname)

Check if a file exists and is non-empty.

mirtop.libs.utils.safe_dirs(dirs)

Create folder if not exitsts

mirtop.libs.utils.safe_remove(fn)

Remove file skipping

mirna

Read bam files

mirtop.mirna.annotate.annotate(reads, mature_ref, precursors, quiet=False)

Using coordinates, mismatches and realign to annotate isomiRs

Args:
reads(dicts of hits):
dict object that comes from mirotp.bam.bam.read_bam()
mirbase_ref (dict of mirna positions):
dict object that comers from mirtop.mirna.read_mature()
precursors dict object (key : fasta):
that comes from mirtop.mirna.fasta.read_precursor()
quiet(boolean):
verbosity state
Return:
reads (dict):
dictionary where keys are read_id and values are mirtop.realign.hits

Read precursor fasta file

mirtop.mirna.fasta.read_precursor(precursor, sps=None)

Load precursor file for that species

Args:

precursor(str): file name with fasta sequences

sps(str): if any, select species to keep.
It’ll do a header_sequence.find(sps).
Returns:
hairpin(dict): keys are precursor names and
values are precursor sequences.

Read database information

mirtop.mirna.mapper.get_primary_transcript(database)
Get the ID to identify the primary transcript in the
GTF file with the miRNA and precursor coordinates to be able to parse BAM files with genomic coordinates.
mirtop.mirna.mapper.guess_database(args)

Guess database name from GFF file.

Args:
gtf(str): file name with GFF miRNA genomic positions and
header lines.
Returns:
database(str): name of the database

TODO: this needs to be generic to other databases.

mirtop.mirna.mapper.read_gtf_chr2mirna(gtf)

Load GTF file with precursor positions on genome.

Args:
gtf(str): file name with GFF miRNA genomic positions and
header lines.
Returns:
db_mir(dict): dictionary with keys being chr and values
mirna and genomic positions.
mirtop.mirna.mapper.read_gtf_to_mirna(gtf)

Load GTF file with precursor positions on genome.

Args:
gtf(str): file name with GFF miRNA genomic positions and
header lines.
Returns:
db_mir(dict): dictionary with keys being mirnas and values
genomic positions.
mirtop.mirna.mapper.read_gtf_to_precursor(gtf)

Load GTF file with precursor positions on genome Return dict with key being precursor name and value a dict of mature miRNA with relative position to precursor.

Args:
gtf(str): file name with GFF miRNA genomic positions and
header lines.
Returns:

map_dict(dict):

>>> {'parent': {mirna: [start, end]}}
mirtop.mirna.mapper.read_gtf_to_precursor_mirbase(gtf, format='precursor')

Load GTF file with precursor positions on genome Return dict with key being precursor name and value a dict of mature miRNA with relative position to precursor. For miRBase and similar GFF3 files.

Args:
gtf(str): file name with GFF miRNA genomic positions and
header lines.
Returns:

map_dict(dict):

>>> {'parent': {mirna: [start, end]}}
mirtop.mirna.mapper.read_gtf_to_precursor_mirgenedb(gtf, format='precursor')

Load GTF file with precursor positions on genome Return dict with key being precursor name and value a dict of mature miRNA with relative position to precursor. For MirGeneDB and similar GFF3 files.

Args:
gtf(str): file name with GFF miRNA genomic positions and
header lines.
Returns:

map_dict(dict):

>>> {'parent': {mirna: [start, end]}}
mirtop.mirna.realign.align(x, y, local=False)

Pairwise alignments between two sequenes. https://medium.com/towards-data-science/pairwise-sequence-alignment-using-biopython-d1a9d0ba861f

Args:

x(str): short sequence.

y(str): long sequence.

local(boolean): local or global alignment.

Returns:
aligned_x(hit): alignment information, socre and positions.
mirtop.mirna.realign.align_from_variants(sequence, mature, variants)
Giving the sequence read,
the mature from get_mature_sequence, and the variant GFF annotation: get a list of substitutions
Args:

sequence(str): read sequence.

mature(str): mature sequence from
mirtop.mirna.realing.get_mature_sequence().

variants(str): string from Variant attribute in GFF file.

Returns:
snp(list): [[pos, target, reference]]
mirtop.mirna.realign.cigar2snp(cigar, reference)

From a CIGAR string and reference sequence detect mistmatches positions and reference and target nucleotides.

Args:

cigar(str): CIGAR string.

reference(str): reference sequence.

Returns:

snp(list): position of mismatches (indels included) as:

>>> [pos, seq_nt, ref_nt]
mirtop.mirna.realign.cigar_correction(cigarLine, query, target)

Read from CIGAR in BAM file to define mismatches.

Args:

cirgarLine(str): CIGAR string from BAM file.

query(str): read sequence.

target(str): target sequence.

Returns:
(list): [query_nts, target_nts]
mirtop.mirna.realign.expand_cigar(cigar)

From short CIGAR version to long CIGAR version where each character is each nts in the sequence.

Args:

cigar(str): CIGAR string.

>>> 10MA3M
Returns:

cigar_long(str): CIGAR long.

>>> MMMMMMMMMMAMMM
mirtop.mirna.realign.get_mature_sequence(precursor, mature, exact=False, nt=5)
From precursor and mature positions
get mature sequence with +/- 4 flanking nts.
Args:

precursor(str): long sequence.

mature(list): [start, end].

exact(boolean): not add 4+/- flanking nts.

nt(int): number of nts to get.

Returns:
(str): mature sequence.
class mirtop.mirna.realign.hits

“Class with alignment information.

mirtop.mirna.realign.is_sequence(seq)

This function check whether the sequence is valid or not.

Args:
seq(str): string acting as a sequence.
Returns:
boolean: whether is or not a valid nucleotide sequence.
class mirtop.mirna.realign.isomir

Class to represent isomiRs information.

format(sep='\t')

Create tabular line from variant fields.

formatGFF()

Create Variant attribute.

format_id(sep='\t')

Create simple identifier from variant fields.

get_score(sc)

Get score from variant fields.

is_iso()

Define whether element is isomiR or not.

set_pos(start, l, strand='+')

Set end position

mirtop.mirna.realign.make_cigar(seq, mature)

Function that will create CIGAR string from aligment between read and reference sequence.

Args:

seq(str): read sequence.

mature(str): short sequence.

Return:
short(str): CIGAR string.
mirtop.mirna.realign.make_id(seq)

Create a unique identifier for the sequence from the nucleotides, replacing 5 nts for a unique sequence.

It uses the code from mirtop.mirna.keys().

Inspired by MINTplate: https://cm.jefferson.edu/MINTbase https://github.com/TJU-CMC-Org/MINTmap/tree/master/MINTplates

Args:
seq(str): nucleotides sequences.
Returns:
idName(str): unique identifier for the sequence.
mirtop.mirna.realign.read_id(idu)

Read a unique identifier for the sequence and convert it to the nucleotides, replacing an unique code for 5 nts.

It uses the code from mirtop.mirna.keys().

Inspired by MINTplate: https://cm.jefferson.edu/MINTbase https://github.com/TJU-CMC-Org/MINTmap/tree/master/MINTplates

Args:
idu(str): unique identifier for the sequence.
Returns:
seq(str): nucleotides sequences.
mirtop.mirna.realign.reverse_complement(seq)

Get reverse complement of a sequences

Args:

seq(str): sequence.

>>> GCAT
Returns:

(str): reverse complemente sequence:

>>> ATGC
mirtop.mirna.realign.variant_to_3p(hairpin, pos, variant)
From a sequence and a start position get the nts
+/- indicated by iso_3p. Pos option is 0-base-index
Args:
hairpin(str): long sequence:
>>> AAATTTT

position(int): >>> 3

variant(int): number of nts involved in the variant:
>>> -1
Returns:
(str): nucleotide involved in the variant:
>>> A
mirtop.mirna.realign.variant_to_5p(hairpin, pos, variant)
From a sequence and a start position get the nts
+/- indicated by iso_5p. Pos option is 0-base-index
Args:
hairpin(str): long sequence:
>>> AAATTTT

position(int): >>> 3

variant(int): number of nts involved in the variant:
>>> -1
Returns:
(str): nucleotide involved in the variant:
>>> T
mirtop.mirna.realign.variant_to_add(read, variant)
From a sequence and a start position get the nts
+/- indicated by iso_3p. Pos option is 0-base-index
Args:
hairpin(str): long sequence:
>>> AAATTTT

position(int): >>> 3

variant(int): number of nts involved in the variant:
>>> 2
Returns:
(str): nucleotide involved in the variant:
>>> TT
mirtop.mirna.snps.create_vcf(isomirs, matures, gtf, vcf_file=None)

Create vcf file of changes for all samples. PASS will be ones with > 3 isomiRs supporting the position and > 30% of reads, otherwise LOW

mirtop.mirna.snps.liftover(pass_pos, matures)

Make position at precursor scale

mirtop.mirna.snps.liftover_to_genome(pass_pos, gtf)

Liftover from precursor to genome

mirtop.mirna.snps.print_vcf(data)

Print vcf line following rules.

classes

class mirtop.mirna.realign.hits

“Class with alignment information.

class mirtop.mirna.realign.isomir

Class to represent isomiRs information.

format(sep='\t')

Create tabular line from variant fields.

formatGFF()

Create Variant attribute.

format_id(sep='\t')

Create simple identifier from variant fields.

get_score(sc)

Get score from variant fields.

is_iso()

Define whether element is isomiR or not.

set_pos(start, l, strand='+')

Set end position