Getting data from fastq by generator2019 Community Moderator ElectionHow to randomly select an item from a list?How to get the current time in PythonGetting the class name of an instance?Getting the last element of a list in PythonHow to get the number of elements in a list in Python?Python progression path - From apprentice to guruParsing values from a JSON file?Generate random integers between 0 and 9Why is reading lines from stdin much slower in C++ than Python?“Large data” work flows using pandas

If sound is a longitudinal wave, why can we hear it if our ears aren't aligned with the propagation direction?

Short scifi story where reproductive organs are converted to produce "materials", pregnant protagonist is "found fit" to be a mother

How do I increase the number of TTY consoles?

Why aren't there more Gauls like Obelix?

Why restrict private health insurance?

School performs periodic password audits. Is my password compromised?

Finding the minimum value of a function without using Calculus

Converting from "matrix" data into "coordinate" data

How do spaceships determine each other's mass in space?

Either of .... (Plural/Singular)

Trocar background-image com delay via jQuery

The (Easy) Road to Code

Do Paladin Auras of Differing Oaths Stack?

What happened to the colonial estates belonging to loyalists after the American Revolution?

Which country has more?

How do we create new idioms and use them in a novel?

Is it possible to clone a polymorphic object without manually adding overridden clone method into each derived class in C++?

Is divide-by-zero a security vulnerability?

How can a demon take control of a human body during REM sleep?

What will happen if my luggage gets delayed?

Is it appropriate to ask a former professor to order a book for me through an inter-library loan?

What does the Digital Threat scope actually do?

I can't die. Who am I?

Does the US political system, in principle, allow for a no-party system?

Getting data from fastq by generator

2019 Community Moderator ElectionHow to randomly select an item from a list?How to get the current time in PythonGetting the class name of an instance?Getting the last element of a list in PythonHow to get the number of elements in a list in Python?Python progression path - From apprentice to guruParsing values from a JSON file?Generate random integers between 0 and 9Why is reading lines from stdin much slower in C++ than Python?“Large data” work flows using pandas

I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:

@hhhhhhhh

ATGCGTAGGGG

+

IIIIIIIIIIIII

I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea?
I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions.
thanks in advance.
Paulo
I got some ideas from a code in Biostar:

import sys
import gzip

filename = sys.argv[1]

def parsing_fastq_files(filename):
 with gzip.open(filename, "rb") as infile:
 count_lines = 0
 for line in infile:
 line = line.decode()
 if count_lines % 4 == 0:
 ids = line[1:].strip()
 yield ids
 if count_lines == 1:
 reads = line.rstrip()
 yield reads
 count_lines += 1

total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))

I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.

edited Mar 6 at 13:44

asked Mar 6 at 13:25

Paulo Sergio Schlogl

165

Can you post your code? its easier to give advice if you do.

– Felix Martinez
Mar 6 at 13:33

add a comment |

I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:

@hhhhhhhh

ATGCGTAGGGG

+

IIIIIIIIIIIII

import sys
import gzip

filename = sys.argv[1]

def parsing_fastq_files(filename):
 with gzip.open(filename, "rb") as infile:
 count_lines = 0
 for line in infile:
 line = line.decode()
 if count_lines % 4 == 0:
 ids = line[1:].strip()
 yield ids
 if count_lines == 1:
 reads = line.rstrip()
 yield reads
 count_lines += 1

total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))

I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.

edited Mar 6 at 13:44

asked Mar 6 at 13:25

Paulo Sergio Schlogl

165

Can you post your code? its easier to give advice if you do.

– Felix Martinez
Mar 6 at 13:33

add a comment |

I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:

@hhhhhhhh

ATGCGTAGGGG

+

IIIIIIIIIIIII

import sys
import gzip

filename = sys.argv[1]

def parsing_fastq_files(filename):
 with gzip.open(filename, "rb") as infile:
 count_lines = 0
 for line in infile:
 line = line.decode()
 if count_lines % 4 == 0:
 ids = line[1:].strip()
 yield ids
 if count_lines == 1:
 reads = line.rstrip()
 yield reads
 count_lines += 1

total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))

I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.

edited Mar 6 at 13:44

asked Mar 6 at 13:25

Paulo Sergio Schlogl

165

I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:

@hhhhhhhh

ATGCGTAGGGG

+

IIIIIIIIIIIII

import sys
import gzip

filename = sys.argv[1]

def parsing_fastq_files(filename):
 with gzip.open(filename, "rb") as infile:
 count_lines = 0
 for line in infile:
 line = line.decode()
 if count_lines % 4 == 0:
 ids = line[1:].strip()
 yield ids
 if count_lines == 1:
 reads = line.rstrip()
 yield reads
 count_lines += 1

total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))

I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.

python bigdata

edited Mar 6 at 13:44

asked Mar 6 at 13:25

Paulo Sergio Schlogl

165

edited Mar 6 at 13:44

asked Mar 6 at 13:25

Paulo Sergio Schlogl

165

edited Mar 6 at 13:44

asked Mar 6 at 13:25

Paulo Sergio Schlogl

165

asked Mar 6 at 13:25

Paulo Sergio Schlogl

165

asked Mar 6 at 13:25

Paulo Sergio Schlogl

165

Can you post your code? its easier to give advice if you do.

– Felix Martinez
Mar 6 at 13:33

add a comment |

Can you post your code? its easier to give advice if you do.

– Felix Martinez
Mar 6 at 13:33

Can you post your code? its easier to give advice if you do.

– Felix Martinez
Mar 6 at 13:33

add a comment |

1 Answer
1

active

oldest

votes

Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.

Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.

Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.

answered Mar 6 at 14:01

Pallie

3547

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55024209%2fgetting-data-from-fastq-by-generator%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.

Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.

answered Mar 6 at 14:01

Pallie

3547

add a comment |

Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.

Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.

answered Mar 6 at 14:01

Pallie

3547

add a comment |

Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.

Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.

answered Mar 6 at 14:01

Pallie

3547

Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.

Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.

answered Mar 6 at 14:01

Pallie

3547

answered Mar 6 at 14:01

Pallie

3547

answered Mar 6 at 14:01

Pallie

3547

answered Mar 6 at 14:01

Pallie

3547

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ufdjrw

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Алба-Юлія

Захаров Федір Захарович

Гладіатор

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Алба-Юлія

Захаров Федір Захарович

Гладіатор

1 Answer
1

1 Answer
1

1 Answer
1