Getting data from fastq by generator2019 Community Moderator ElectionHow to randomly select an item from a list?How to get the current time in PythonGetting the class name of an instance?Getting the last element of a list in PythonHow to get the number of elements in a list in Python?Python progression path - From apprentice to guruParsing values from a JSON file?Generate random integers between 0 and 9Why is reading lines from stdin much slower in C++ than Python?“Large data” work flows using pandas

If sound is a longitudinal wave, why can we hear it if our ears aren't aligned with the propagation direction?

Short scifi story where reproductive organs are converted to produce "materials", pregnant protagonist is "found fit" to be a mother

How do I increase the number of TTY consoles?

Why aren't there more Gauls like Obelix?

Why restrict private health insurance?

School performs periodic password audits. Is my password compromised?

Finding the minimum value of a function without using Calculus

Converting from "matrix" data into "coordinate" data

How do spaceships determine each other's mass in space?

Either of .... (Plural/Singular)

Trocar background-image com delay via jQuery

The (Easy) Road to Code

Do Paladin Auras of Differing Oaths Stack?

What happened to the colonial estates belonging to loyalists after the American Revolution?

Which country has more?

How do we create new idioms and use them in a novel?

Is it possible to clone a polymorphic object without manually adding overridden clone method into each derived class in C++?

Is divide-by-zero a security vulnerability?

How can a demon take control of a human body during REM sleep?

What will happen if my luggage gets delayed?

Is it appropriate to ask a former professor to order a book for me through an inter-library loan?

What does the Digital Threat scope actually do?

I can't die. Who am I?

Does the US political system, in principle, allow for a no-party system?



Getting data from fastq by generator



2019 Community Moderator ElectionHow to randomly select an item from a list?How to get the current time in PythonGetting the class name of an instance?Getting the last element of a list in PythonHow to get the number of elements in a list in Python?Python progression path - From apprentice to guruParsing values from a JSON file?Generate random integers between 0 and 9Why is reading lines from stdin much slower in C++ than Python?“Large data” work flows using pandas










1















I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:



@hhhhhhhh

ATGCGTAGGGG

+

IIIIIIIIIIIII



I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea?
I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions.
thanks in advance.
Paulo
I got some ideas from a code in Biostar:



import sys
import gzip

filename = sys.argv[1]

def parsing_fastq_files(filename):
with gzip.open(filename, "rb") as infile:
count_lines = 0
for line in infile:
line = line.decode()
if count_lines % 4 == 0:
ids = line[1:].strip()
yield ids
if count_lines == 1:
reads = line.rstrip()
yield reads
count_lines += 1

total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))


I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.










share|improve this question
























  • Can you post your code? its easier to give advice if you do.

    – Felix Martinez
    Mar 6 at 13:33















1















I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:



@hhhhhhhh

ATGCGTAGGGG

+

IIIIIIIIIIIII



I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea?
I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions.
thanks in advance.
Paulo
I got some ideas from a code in Biostar:



import sys
import gzip

filename = sys.argv[1]

def parsing_fastq_files(filename):
with gzip.open(filename, "rb") as infile:
count_lines = 0
for line in infile:
line = line.decode()
if count_lines % 4 == 0:
ids = line[1:].strip()
yield ids
if count_lines == 1:
reads = line.rstrip()
yield reads
count_lines += 1

total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))


I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.










share|improve this question
























  • Can you post your code? its easier to give advice if you do.

    – Felix Martinez
    Mar 6 at 13:33













1












1








1


0






I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:



@hhhhhhhh

ATGCGTAGGGG

+

IIIIIIIIIIIII



I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea?
I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions.
thanks in advance.
Paulo
I got some ideas from a code in Biostar:



import sys
import gzip

filename = sys.argv[1]

def parsing_fastq_files(filename):
with gzip.open(filename, "rb") as infile:
count_lines = 0
for line in infile:
line = line.decode()
if count_lines % 4 == 0:
ids = line[1:].strip()
yield ids
if count_lines == 1:
reads = line.rstrip()
yield reads
count_lines += 1

total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))


I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.










share|improve this question
















I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:



@hhhhhhhh

ATGCGTAGGGG

+

IIIIIIIIIIIII



I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea?
I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions.
thanks in advance.
Paulo
I got some ideas from a code in Biostar:



import sys
import gzip

filename = sys.argv[1]

def parsing_fastq_files(filename):
with gzip.open(filename, "rb") as infile:
count_lines = 0
for line in infile:
line = line.decode()
if count_lines % 4 == 0:
ids = line[1:].strip()
yield ids
if count_lines == 1:
reads = line.rstrip()
yield reads
count_lines += 1

total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))


I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.







python bigdata






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 6 at 13:44







Paulo Sergio Schlogl

















asked Mar 6 at 13:25









Paulo Sergio SchloglPaulo Sergio Schlogl

165




165












  • Can you post your code? its easier to give advice if you do.

    – Felix Martinez
    Mar 6 at 13:33

















  • Can you post your code? its easier to give advice if you do.

    – Felix Martinez
    Mar 6 at 13:33
















Can you post your code? its easier to give advice if you do.

– Felix Martinez
Mar 6 at 13:33





Can you post your code? its easier to give advice if you do.

– Felix Martinez
Mar 6 at 13:33












1 Answer
1






active

oldest

votes


















1














Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.



Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.



Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.






share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55024209%2fgetting-data-from-fastq-by-generator%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.



    Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.



    Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.






    share|improve this answer



























      1














      Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.



      Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.



      Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.






      share|improve this answer

























        1












        1








        1







        Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.



        Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.



        Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.






        share|improve this answer













        Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.



        Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.



        Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 6 at 14:01









        PalliePallie

        3547




        3547





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55024209%2fgetting-data-from-fastq-by-generator%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            1928 у кіно

            Захаров Федір Захарович

            Ель Греко