Parquet with Athena VS RedshiftSpark Redshift saving into s3 as ParquetRedshift COPY command for Parquet format with Snappy compressionSpark & Parquet Query PerformanceIs there a data architecture for efficient joins in Spark (a la RedShift)?Apahce Spark on Redshift vs Apache Spark on HIVE EMRWhich would be a quicker (and better) tool for querying data stored in the Parquet format - Spark SQL, Athena or ElasticSearch?Error while Connecting PySpark to AWS RedshiftAthena/Hive timestamp in parquet files written by sparkSpark 2.3.1 AWS EMR not returning data for some columns yet works in Athena/Presto and SpectrumQuery Cassandra UDT via Spark SQL

What defenses are there against being summoned by the Gate spell?

Why can't I see bouncing of a switch on an oscilloscope?

Accidentally leaked the solution to an assignment, what to do now? (I'm the prof)

Is it important to consider tone, melody, and musical form while writing a song?

Why not use SQL instead of GraphQL?

Arthur Somervell: 1000 Exercises - Meaning of this notation

How do we improve the relationship with a client software team that performs poorly and is becoming less collaborative?

If I cast Expeditious Retreat, can I Dash as a bonus action on the same turn?

What do the dots in this tr command do: tr .............A-Z A-ZA-Z <<< "JVPQBOV" (with 13 dots)

How to say job offer in Mandarin/Cantonese?

What's the point of deactivating Num Lock on login screens?

What typically incentivizes a professor to change jobs to a lower ranking university?

How can I make my BBEG immortal short of making them a Lich or Vampire?

How could an uplifted falcon's brain work?

Show that if two triangles built on parallel lines, with equal bases have the same perimeter only if they are congruent.

What's the output of a record cartridge playing an out-of-speed record

Can I make popcorn with any corn?

Today is the Center

How do I create uniquely male characters?

Why did the Germans forbid the possession of pet pigeons in Rostov-on-Don in 1941?

strToHex ( string to its hex representation as string)

Fencing style for blades that can attack from a distance

Can divisibility rules for digits be generalized to sum of digits

Have astronauts in space suits ever taken selfies? If so, how?



Parquet with Athena VS Redshift


Spark Redshift saving into s3 as ParquetRedshift COPY command for Parquet format with Snappy compressionSpark & Parquet Query PerformanceIs there a data architecture for efficient joins in Spark (a la RedShift)?Apahce Spark on Redshift vs Apache Spark on HIVE EMRWhich would be a quicker (and better) tool for querying data stored in the Parquet format - Spark SQL, Athena or ElasticSearch?Error while Connecting PySpark to AWS RedshiftAthena/Hive timestamp in parquet files written by sparkSpark 2.3.1 AWS EMR not returning data for some columns yet works in Athena/Presto and SpectrumQuery Cassandra UDT via Spark SQL






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








1















I hope someone out there can help me with this issue. I am currently working on a data pipeline project, my current dilemma is whether to use parquet with Athena or storing it to Redshift



2 Scenarios:
First,



EVENTS --> STORE IT IN S3 AS JSON.GZ --> USE SPARK(EMR) TO CONVERT TO PARQUET --> STORE PARQUET BACK INTO S3 --> ATHENA FOR QUERY --> VIZ


Second,



EVENTS --> STORE IT IN S3 --> USE SPARK(EMR) TO STORE DATA INTO REDSHIFT


Issues with this scenario:



  1. Spark JDBC with Redshift is slow

  2. Spark-Redshift repo by data bricks have a fail build and was updated 2 years ago

I am unable to find useful information on which method is better. Should I even use Redshift or is parquet good enough?



Also it would be great if someone could tell me if there are any other methods for connecting spark with Redshift because there's only 2 solution that I saw online - JDBC and Spark-Reshift(Databricks)



P.S. the pricing model is not a concern to me also I'm dealing with millions of events data.










share|improve this question






























    1















    I hope someone out there can help me with this issue. I am currently working on a data pipeline project, my current dilemma is whether to use parquet with Athena or storing it to Redshift



    2 Scenarios:
    First,



    EVENTS --> STORE IT IN S3 AS JSON.GZ --> USE SPARK(EMR) TO CONVERT TO PARQUET --> STORE PARQUET BACK INTO S3 --> ATHENA FOR QUERY --> VIZ


    Second,



    EVENTS --> STORE IT IN S3 --> USE SPARK(EMR) TO STORE DATA INTO REDSHIFT


    Issues with this scenario:



    1. Spark JDBC with Redshift is slow

    2. Spark-Redshift repo by data bricks have a fail build and was updated 2 years ago

    I am unable to find useful information on which method is better. Should I even use Redshift or is parquet good enough?



    Also it would be great if someone could tell me if there are any other methods for connecting spark with Redshift because there's only 2 solution that I saw online - JDBC and Spark-Reshift(Databricks)



    P.S. the pricing model is not a concern to me also I'm dealing with millions of events data.










    share|improve this question


























      1












      1








      1








      I hope someone out there can help me with this issue. I am currently working on a data pipeline project, my current dilemma is whether to use parquet with Athena or storing it to Redshift



      2 Scenarios:
      First,



      EVENTS --> STORE IT IN S3 AS JSON.GZ --> USE SPARK(EMR) TO CONVERT TO PARQUET --> STORE PARQUET BACK INTO S3 --> ATHENA FOR QUERY --> VIZ


      Second,



      EVENTS --> STORE IT IN S3 --> USE SPARK(EMR) TO STORE DATA INTO REDSHIFT


      Issues with this scenario:



      1. Spark JDBC with Redshift is slow

      2. Spark-Redshift repo by data bricks have a fail build and was updated 2 years ago

      I am unable to find useful information on which method is better. Should I even use Redshift or is parquet good enough?



      Also it would be great if someone could tell me if there are any other methods for connecting spark with Redshift because there's only 2 solution that I saw online - JDBC and Spark-Reshift(Databricks)



      P.S. the pricing model is not a concern to me also I'm dealing with millions of events data.










      share|improve this question
















      I hope someone out there can help me with this issue. I am currently working on a data pipeline project, my current dilemma is whether to use parquet with Athena or storing it to Redshift



      2 Scenarios:
      First,



      EVENTS --> STORE IT IN S3 AS JSON.GZ --> USE SPARK(EMR) TO CONVERT TO PARQUET --> STORE PARQUET BACK INTO S3 --> ATHENA FOR QUERY --> VIZ


      Second,



      EVENTS --> STORE IT IN S3 --> USE SPARK(EMR) TO STORE DATA INTO REDSHIFT


      Issues with this scenario:



      1. Spark JDBC with Redshift is slow

      2. Spark-Redshift repo by data bricks have a fail build and was updated 2 years ago

      I am unable to find useful information on which method is better. Should I even use Redshift or is parquet good enough?



      Also it would be great if someone could tell me if there are any other methods for connecting spark with Redshift because there's only 2 solution that I saw online - JDBC and Spark-Reshift(Databricks)



      P.S. the pricing model is not a concern to me also I'm dealing with millions of events data.







      apache-spark amazon-s3 amazon-redshift parquet






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 8 at 7:34









      Red Boy

      2,26621124




      2,26621124










      asked Mar 8 at 4:15









      Louis WongLouis Wong

      93




      93






















          2 Answers
          2






          active

          oldest

          votes


















          1














          Here are some ideas / recommendations



          • Don't use JDBC.

          • Spark-Redshift works fine but is a complex solution.

          • You don't have to use spark to convert to parquet, there is also the option of using hive.
            see
            https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html

          • Athena is great when used against parquet, so you don't need to use
            Redshift at all

          • If you want to use Redshift, then use Redshift spectrum to set up a
            view against your parquet tables, then if necessary a CTAS within
            Redshift to bring the data in if you need to.

          • AWS Glue Crawler can be a great way to create the metadata needed to
            map the parquet in to Athena and Redshift Spectrum.

          My proposed architecture:



          EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Athena



          and/or



          EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Redshift using Redshift Spectrum



          You MAY NOT need to convert to parquet, if you use the right partitioning structure (s3 folders) and gzip the data then Athena/spectrum then performance can be good enough without the complexity of conversion to parquet. This is dependent on your use case (volumes of data and types of query that you need to run).






          share|improve this answer






























            0














            Which one to use depends on your data and access patterns. Athena directly uses S3 key structure to limit the amount of data to be scanned. Let's assume you have event type and time in events. The S3 keys could be e.g. yyyy/MM/dd/type/* or type/yyyy/MM/dd/*. The former key structure allows you to limit the amount of data to be scanned by date or date and type but not type alone. If you wanted to search only by type x but don't know the date, it would require a full bucket scan. The latter key schema would be the other way around. If you mostly need to access the data just one way (e.g. by time), Athena might be a good choice.



            On the other hand, Redshift is a PostgreSQL based data warehouse which is much more complicated and flexible than Athena. The data partitioning plays a big role in terms of performance, but schema can be designed in many ways to suit your use-case. In my experience the best way to load data to Redshift is first to store it to S3 and then use COPY https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html . It is multiple magnitudes faster than JDBC which I found only good for testing with small amounts of data. This is also how Kinesis Firehose loads data into Redshift. If you don't want to implement S3 copying yourself, Firehose provides an alternative for that.






            share|improve this answer























              Your Answer






              StackExchange.ifUsing("editor", function ()
              StackExchange.using("externalEditor", function ()
              StackExchange.using("snippets", function ()
              StackExchange.snippets.init();
              );
              );
              , "code-snippets");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "1"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55056640%2fparquet-with-athena-vs-redshift%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1














              Here are some ideas / recommendations



              • Don't use JDBC.

              • Spark-Redshift works fine but is a complex solution.

              • You don't have to use spark to convert to parquet, there is also the option of using hive.
                see
                https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html

              • Athena is great when used against parquet, so you don't need to use
                Redshift at all

              • If you want to use Redshift, then use Redshift spectrum to set up a
                view against your parquet tables, then if necessary a CTAS within
                Redshift to bring the data in if you need to.

              • AWS Glue Crawler can be a great way to create the metadata needed to
                map the parquet in to Athena and Redshift Spectrum.

              My proposed architecture:



              EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Athena



              and/or



              EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Redshift using Redshift Spectrum



              You MAY NOT need to convert to parquet, if you use the right partitioning structure (s3 folders) and gzip the data then Athena/spectrum then performance can be good enough without the complexity of conversion to parquet. This is dependent on your use case (volumes of data and types of query that you need to run).






              share|improve this answer



























                1














                Here are some ideas / recommendations



                • Don't use JDBC.

                • Spark-Redshift works fine but is a complex solution.

                • You don't have to use spark to convert to parquet, there is also the option of using hive.
                  see
                  https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html

                • Athena is great when used against parquet, so you don't need to use
                  Redshift at all

                • If you want to use Redshift, then use Redshift spectrum to set up a
                  view against your parquet tables, then if necessary a CTAS within
                  Redshift to bring the data in if you need to.

                • AWS Glue Crawler can be a great way to create the metadata needed to
                  map the parquet in to Athena and Redshift Spectrum.

                My proposed architecture:



                EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Athena



                and/or



                EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Redshift using Redshift Spectrum



                You MAY NOT need to convert to parquet, if you use the right partitioning structure (s3 folders) and gzip the data then Athena/spectrum then performance can be good enough without the complexity of conversion to parquet. This is dependent on your use case (volumes of data and types of query that you need to run).






                share|improve this answer

























                  1












                  1








                  1







                  Here are some ideas / recommendations



                  • Don't use JDBC.

                  • Spark-Redshift works fine but is a complex solution.

                  • You don't have to use spark to convert to parquet, there is also the option of using hive.
                    see
                    https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html

                  • Athena is great when used against parquet, so you don't need to use
                    Redshift at all

                  • If you want to use Redshift, then use Redshift spectrum to set up a
                    view against your parquet tables, then if necessary a CTAS within
                    Redshift to bring the data in if you need to.

                  • AWS Glue Crawler can be a great way to create the metadata needed to
                    map the parquet in to Athena and Redshift Spectrum.

                  My proposed architecture:



                  EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Athena



                  and/or



                  EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Redshift using Redshift Spectrum



                  You MAY NOT need to convert to parquet, if you use the right partitioning structure (s3 folders) and gzip the data then Athena/spectrum then performance can be good enough without the complexity of conversion to parquet. This is dependent on your use case (volumes of data and types of query that you need to run).






                  share|improve this answer













                  Here are some ideas / recommendations



                  • Don't use JDBC.

                  • Spark-Redshift works fine but is a complex solution.

                  • You don't have to use spark to convert to parquet, there is also the option of using hive.
                    see
                    https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html

                  • Athena is great when used against parquet, so you don't need to use
                    Redshift at all

                  • If you want to use Redshift, then use Redshift spectrum to set up a
                    view against your parquet tables, then if necessary a CTAS within
                    Redshift to bring the data in if you need to.

                  • AWS Glue Crawler can be a great way to create the metadata needed to
                    map the parquet in to Athena and Redshift Spectrum.

                  My proposed architecture:



                  EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Athena



                  and/or



                  EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Redshift using Redshift Spectrum



                  You MAY NOT need to convert to parquet, if you use the right partitioning structure (s3 folders) and gzip the data then Athena/spectrum then performance can be good enough without the complexity of conversion to parquet. This is dependent on your use case (volumes of data and types of query that you need to run).







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Mar 8 at 8:06









                  Jon ScottJon Scott

                  2,107718




                  2,107718























                      0














                      Which one to use depends on your data and access patterns. Athena directly uses S3 key structure to limit the amount of data to be scanned. Let's assume you have event type and time in events. The S3 keys could be e.g. yyyy/MM/dd/type/* or type/yyyy/MM/dd/*. The former key structure allows you to limit the amount of data to be scanned by date or date and type but not type alone. If you wanted to search only by type x but don't know the date, it would require a full bucket scan. The latter key schema would be the other way around. If you mostly need to access the data just one way (e.g. by time), Athena might be a good choice.



                      On the other hand, Redshift is a PostgreSQL based data warehouse which is much more complicated and flexible than Athena. The data partitioning plays a big role in terms of performance, but schema can be designed in many ways to suit your use-case. In my experience the best way to load data to Redshift is first to store it to S3 and then use COPY https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html . It is multiple magnitudes faster than JDBC which I found only good for testing with small amounts of data. This is also how Kinesis Firehose loads data into Redshift. If you don't want to implement S3 copying yourself, Firehose provides an alternative for that.






                      share|improve this answer



























                        0














                        Which one to use depends on your data and access patterns. Athena directly uses S3 key structure to limit the amount of data to be scanned. Let's assume you have event type and time in events. The S3 keys could be e.g. yyyy/MM/dd/type/* or type/yyyy/MM/dd/*. The former key structure allows you to limit the amount of data to be scanned by date or date and type but not type alone. If you wanted to search only by type x but don't know the date, it would require a full bucket scan. The latter key schema would be the other way around. If you mostly need to access the data just one way (e.g. by time), Athena might be a good choice.



                        On the other hand, Redshift is a PostgreSQL based data warehouse which is much more complicated and flexible than Athena. The data partitioning plays a big role in terms of performance, but schema can be designed in many ways to suit your use-case. In my experience the best way to load data to Redshift is first to store it to S3 and then use COPY https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html . It is multiple magnitudes faster than JDBC which I found only good for testing with small amounts of data. This is also how Kinesis Firehose loads data into Redshift. If you don't want to implement S3 copying yourself, Firehose provides an alternative for that.






                        share|improve this answer

























                          0












                          0








                          0







                          Which one to use depends on your data and access patterns. Athena directly uses S3 key structure to limit the amount of data to be scanned. Let's assume you have event type and time in events. The S3 keys could be e.g. yyyy/MM/dd/type/* or type/yyyy/MM/dd/*. The former key structure allows you to limit the amount of data to be scanned by date or date and type but not type alone. If you wanted to search only by type x but don't know the date, it would require a full bucket scan. The latter key schema would be the other way around. If you mostly need to access the data just one way (e.g. by time), Athena might be a good choice.



                          On the other hand, Redshift is a PostgreSQL based data warehouse which is much more complicated and flexible than Athena. The data partitioning plays a big role in terms of performance, but schema can be designed in many ways to suit your use-case. In my experience the best way to load data to Redshift is first to store it to S3 and then use COPY https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html . It is multiple magnitudes faster than JDBC which I found only good for testing with small amounts of data. This is also how Kinesis Firehose loads data into Redshift. If you don't want to implement S3 copying yourself, Firehose provides an alternative for that.






                          share|improve this answer













                          Which one to use depends on your data and access patterns. Athena directly uses S3 key structure to limit the amount of data to be scanned. Let's assume you have event type and time in events. The S3 keys could be e.g. yyyy/MM/dd/type/* or type/yyyy/MM/dd/*. The former key structure allows you to limit the amount of data to be scanned by date or date and type but not type alone. If you wanted to search only by type x but don't know the date, it would require a full bucket scan. The latter key schema would be the other way around. If you mostly need to access the data just one way (e.g. by time), Athena might be a good choice.



                          On the other hand, Redshift is a PostgreSQL based data warehouse which is much more complicated and flexible than Athena. The data partitioning plays a big role in terms of performance, but schema can be designed in many ways to suit your use-case. In my experience the best way to load data to Redshift is first to store it to S3 and then use COPY https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html . It is multiple magnitudes faster than JDBC which I found only good for testing with small amounts of data. This is also how Kinesis Firehose loads data into Redshift. If you don't want to implement S3 copying yourself, Firehose provides an alternative for that.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Mar 8 at 8:07









                          ollik1ollik1

                          67819




                          67819



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55056640%2fparquet-with-athena-vs-redshift%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Save data to MySQL database using ExtJS and PHP [closed]2019 Community Moderator ElectionHow can I prevent SQL injection in PHP?Which MySQL data type to use for storing boolean valuesPHP: Delete an element from an arrayHow do I connect to a MySQL Database in Python?Should I use the datetime or timestamp data type in MySQL?How to get a list of MySQL user accountsHow Do You Parse and Process HTML/XML in PHP?Reference — What does this symbol mean in PHP?How does PHP 'foreach' actually work?Why shouldn't I use mysql_* functions in PHP?

                              Compiling GNU Global with universal-ctags support Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!Tags for Emacs: Relationship between etags, ebrowse, cscope, GNU Global and exuberant ctagsVim and Ctags tips and trickscscope or ctags why choose one over the other?scons and ctagsctags cannot open option file “.ctags”Adding tag scopes in universal-ctagsShould I use Universal-ctags?Universal ctags on WindowsHow do I install GNU Global with universal ctags support using Homebrew?Universal ctags with emacsHow to highlight ctags generated by Universal Ctags in Vim?

                              Add ONERROR event to image from jsp tldHow to add an image to a JPanel?Saving image from PHP URLHTML img scalingCheck if an image is loaded (no errors) with jQueryHow to force an <img> to take up width, even if the image is not loadedHow do I populate hidden form field with a value set in Spring ControllerStyling Raw elements Generated from JSP tagds with Jquery MobileLimit resizing of images with explicitly set width and height attributeserror TLD use in a jsp fileJsp tld files cannot be resolved