Bibliothèque d'échantillons

Parcourir | Soumettre un nouvel échantillon | Créer un paquet

fastq correcter

Type:
Exemple de Code (HOWTO)
Category:
Autre
License:
GNU General Public License
Language:
Python
 
Description:
Unwrap lines of sequence and quality, remove bad formatted reads (listed in a log file) and fix coherence sequence length - quality length (truncate the longer one).
Change quality encoding, verifications and many other stuffs may be written in write_current() fonction

Versions de cet échantillon

Numéro d'identification de l'échantillon Téléchargement Date d'envoi Auteur Effacer
30.125/02/2011 12:55nicolas allias

Téléchargez une version en texte brut du code en cliquant sur "Download Version"

 


Dernière version d'échantillon : 0.1

import sys, os VERBOSE = False LENGTH=36 CHAR_VERIF_TITLE = 5 filename = sys.argv[1] flowcellName = sys.argv[2] outfile = filename+".cleaned" pos=1 title, seq, qual = "", "", "" if os.path.isfile(outfile): os.remove(outfile) def string_is_title(string, flowcellName): return string.startswith("@"+flowcellName) def write_current(title, seq, qual): output=open(outfile,"a") if not title or not seq or not qual: open(outfile+".log","a").write("# Incomplete sequence:\n"+"@"+str(title)+"\n"+str(seq)+"\n+\n"+str(qual)+"\n") return None offset_seq = len(seq) offset_qual = len(qual) acceptable_len = min (len(seq), len(qual)) if len(seq)>acceptable_len: if VERBOSE: sys.stderr.write("\nNot enought quality scores for the sequence:\n\t"+seq+"\nQual:\t"+qual+"\n") offset_seq = acceptable_len elif len(qual)>acceptable_len: if VERBOSE: sys.stderr.write("\nToo much quality scores for the sequence:\n\t"+seq+"\nQual:\t"+qual+"\n") offset_qual = acceptable_len output.write("@"+ title +"\n"+ seq[:offset_seq] +"\n+\n"+ qual[:offset_qual]+"\n") output.close() sys.stderr.flush() ################################MAIN############################ sys.stderr.write("Start\n") for line in open(filename, 'r', 256): if pos == 1: if string_is_title(line,flowcellName): #title write_current(title, seq, qual) title = line[1:].strip() pos = 2 seq, qual = "", "" continue else: if VERBOSE: sys.stderr.write("\nFirst line is not a fastq title !\n"+flowcellName+"\t !=~ \t"+line) qual += line.strip() continue # sequence part if pos == 2: seq = line.strip() pos = 3 continue # "+" part, maybe with the title again if pos == 3: # still sequence (multilined) if not line.startswith("+"): seq += line.strip() continue # "+" part else: pos = 4 continue # pos == 4: should be quality string # line is the quality string else: if string_is_title(line,flowcellName): #title write_current(title, seq, qual) title = line[1:].strip() pos = 2 seq, qual = "", "" continue else: # quality qual += line.strip() pos = 1 continue # Process the last sequence write_current(title, seq, qual)

Soummetre une nouvelle version

Vous pouvez fournir une nouvelle version de cet échantillon si vous l'avez modifié et que vous trouvez bien de le partager avec les autres..

Powered By FusionForge