Getting data from fastq by generator2019 Community Moderator ElectionHow to randomly select an item from a list?How to get the current time in PythonGetting the class name of an instance?Getting the last element of a list in PythonHow to get the number of elements in a list in Python?Python progression path - From apprentice to guruParsing values from a JSON file?Generate random integers between 0 and 9Why is reading lines from stdin much slower in C++ than Python?“Large data” work flows using pandas
If sound is a longitudinal wave, why can we hear it if our ears aren't aligned with the propagation direction?
Short scifi story where reproductive organs are converted to produce "materials", pregnant protagonist is "found fit" to be a mother
How do I increase the number of TTY consoles?
Why aren't there more Gauls like Obelix?
Why restrict private health insurance?
School performs periodic password audits. Is my password compromised?
Finding the minimum value of a function without using Calculus
Converting from "matrix" data into "coordinate" data
How do spaceships determine each other's mass in space?
Either of .... (Plural/Singular)
Trocar background-image com delay via jQuery
The (Easy) Road to Code
Do Paladin Auras of Differing Oaths Stack?
What happened to the colonial estates belonging to loyalists after the American Revolution?
Which country has more?
How do we create new idioms and use them in a novel?
Is it possible to clone a polymorphic object without manually adding overridden clone method into each derived class in C++?
Is divide-by-zero a security vulnerability?
How can a demon take control of a human body during REM sleep?
What will happen if my luggage gets delayed?
Is it appropriate to ask a former professor to order a book for me through an inter-library loan?
What does the Digital Threat scope actually do?
I can't die. Who am I?
Does the US political system, in principle, allow for a no-party system?
Getting data from fastq by generator
2019 Community Moderator ElectionHow to randomly select an item from a list?How to get the current time in PythonGetting the class name of an instance?Getting the last element of a list in PythonHow to get the number of elements in a list in Python?Python progression path - From apprentice to guruParsing values from a JSON file?Generate random integers between 0 and 9Why is reading lines from stdin much slower in C++ than Python?“Large data” work flows using pandas
I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:
@hhhhhhhh
ATGCGTAGGGG
+
IIIIIIIIIIIII
I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea?
I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions.
thanks in advance.
Paulo
I got some ideas from a code in Biostar:
import sys
import gzip
filename = sys.argv[1]
def parsing_fastq_files(filename):
with gzip.open(filename, "rb") as infile:
count_lines = 0
for line in infile:
line = line.decode()
if count_lines % 4 == 0:
ids = line[1:].strip()
yield ids
if count_lines == 1:
reads = line.rstrip()
yield reads
count_lines += 1
total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))
I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.
python bigdata
add a comment |
I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:
@hhhhhhhh
ATGCGTAGGGG
+
IIIIIIIIIIIII
I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea?
I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions.
thanks in advance.
Paulo
I got some ideas from a code in Biostar:
import sys
import gzip
filename = sys.argv[1]
def parsing_fastq_files(filename):
with gzip.open(filename, "rb") as infile:
count_lines = 0
for line in infile:
line = line.decode()
if count_lines % 4 == 0:
ids = line[1:].strip()
yield ids
if count_lines == 1:
reads = line.rstrip()
yield reads
count_lines += 1
total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))
I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.
python bigdata
Can you post your code? its easier to give advice if you do.
– Felix Martinez
Mar 6 at 13:33
add a comment |
I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:
@hhhhhhhh
ATGCGTAGGGG
+
IIIIIIIIIIIII
I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea?
I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions.
thanks in advance.
Paulo
I got some ideas from a code in Biostar:
import sys
import gzip
filename = sys.argv[1]
def parsing_fastq_files(filename):
with gzip.open(filename, "rb") as infile:
count_lines = 0
for line in infile:
line = line.decode()
if count_lines % 4 == 0:
ids = line[1:].strip()
yield ids
if count_lines == 1:
reads = line.rstrip()
yield reads
count_lines += 1
total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))
I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.
python bigdata
I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:
@hhhhhhhh
ATGCGTAGGGG
+
IIIIIIIIIIIII
I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea?
I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions.
thanks in advance.
Paulo
I got some ideas from a code in Biostar:
import sys
import gzip
filename = sys.argv[1]
def parsing_fastq_files(filename):
with gzip.open(filename, "rb") as infile:
count_lines = 0
for line in infile:
line = line.decode()
if count_lines % 4 == 0:
ids = line[1:].strip()
yield ids
if count_lines == 1:
reads = line.rstrip()
yield reads
count_lines += 1
total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))
I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.
python bigdata
python bigdata
edited Mar 6 at 13:44
Paulo Sergio Schlogl
asked Mar 6 at 13:25
Paulo Sergio SchloglPaulo Sergio Schlogl
165
165
Can you post your code? its easier to give advice if you do.
– Felix Martinez
Mar 6 at 13:33
add a comment |
Can you post your code? its easier to give advice if you do.
– Felix Martinez
Mar 6 at 13:33
Can you post your code? its easier to give advice if you do.
– Felix Martinez
Mar 6 at 13:33
Can you post your code? its easier to give advice if you do.
– Felix Martinez
Mar 6 at 13:33
add a comment |
1 Answer
1
active
oldest
votes
Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.
Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.
Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55024209%2fgetting-data-from-fastq-by-generator%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.
Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.
Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.
add a comment |
Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.
Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.
Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.
add a comment |
Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.
Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.
Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.
Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.
Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.
Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.
answered Mar 6 at 14:01
PalliePallie
3547
3547
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55024209%2fgetting-data-from-fastq-by-generator%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Can you post your code? its easier to give advice if you do.
– Felix Martinez
Mar 6 at 13:33