serialize RDD with Avro The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) The Ask Question Wizard is Live! Data science time! April 2019 and salary with experienceUsing apache avro reflectSerializing to JSON in jQueryGenerate Avro Schema from certain Java ObjectGeneric RDD in SparkCompatibility of Avro dates and times with BigQuery?KafkaAvroSerializer for serializing Avro without schema.registry.urlHandling namespace prefix while saving data to Avro using Sparkcan I enable data compression when using binary encoding in avro serialization, without using DataFileWriter?MessageConversionException when trying to serialize a List of Avro SpecificRecords and split them with @SplitterPySpark: Deserializing an Avro serialized message contained in an eventhub capture avro fileHow can I set a logicalType in a spark-avro 2.4 schema?

Do warforged have souls?

How is simplicity better than precision and clarity in prose?

Why does the Event Horizon Telescope (EHT) not include telescopes from Africa, Asia or Australia?

How do I add random spotting to the same face in cycles?

In horse breeding, what is the female equivalent of putting a horse out "to stud"?

Problems with Ubuntu mount /tmp

Working through the single responsibility principle (SRP) in Python when calls are expensive

Create an outline of font

How can I protect witches in combat who wear limited clothing?

Finding the path in a graph from A to B then back to A with a minimum of shared edges

Make it rain characters

Simulating Exploding Dice

Does the AirPods case need to be around while listening via an iOS Device?

system() function string length limit

Is this wall load bearing? Blueprints and photos attached

Did the new image of black hole confirm the general theory of relativity?

Windows 10: How to Lock (not sleep) laptop on lid close?

What aspect of planet Earth must be changed to prevent the industrial revolution?

Sort list of array linked objects by keys and values

What was the last x86 CPU that did not have the x87 floating-point unit built in?

Did God make two great lights or did He make the great light two?

Would it be possible to rearrange a dragon's flight muscle to somewhat circumvent the square-cube law?

Hopping to infinity along a string of digits

Am I ethically obligated to go into work on an off day if the reason is sudden?



serialize RDD with Avro



The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
The Ask Question Wizard is Live!
Data science time! April 2019 and salary with experienceUsing apache avro reflectSerializing to JSON in jQueryGenerate Avro Schema from certain Java ObjectGeneric RDD in SparkCompatibility of Avro dates and times with BigQuery?KafkaAvroSerializer for serializing Avro without schema.registry.urlHandling namespace prefix while saving data to Avro using Sparkcan I enable data compression when using binary encoding in avro serialization, without using DataFileWriter?MessageConversionException when trying to serialize a List of Avro SpecificRecords and split them with @SplitterPySpark: Deserializing an Avro serialized message contained in an eventhub capture avro fileHow can I set a logicalType in a spark-avro 2.4 schema?



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I have this scenario. We have to provide a functionality that takes a whatever type of RDD, with the generics notation you could say RDD[T] and serialize and save to HDFS using Avro DataFile.



Beware that the RDD could be of anything, so the functionality should be generic to the given RDD type, for example, RDD[(String, AnyBusinessObject)] o RDD[(String, Date, OtherBusinessObject)].



The question is: how can we infer the Avro schema and provide Avro serialization for a whatever class type in order to save it as Avro Data File?



The functionality is actually already built, but it uses Java Serialization, this obviously causes space and time penalty, so we would like to refactor it. We can't use DataFrames.










share|improve this question




























    0















    I have this scenario. We have to provide a functionality that takes a whatever type of RDD, with the generics notation you could say RDD[T] and serialize and save to HDFS using Avro DataFile.



    Beware that the RDD could be of anything, so the functionality should be generic to the given RDD type, for example, RDD[(String, AnyBusinessObject)] o RDD[(String, Date, OtherBusinessObject)].



    The question is: how can we infer the Avro schema and provide Avro serialization for a whatever class type in order to save it as Avro Data File?



    The functionality is actually already built, but it uses Java Serialization, this obviously causes space and time penalty, so we would like to refactor it. We can't use DataFrames.










    share|improve this question
























      0












      0








      0








      I have this scenario. We have to provide a functionality that takes a whatever type of RDD, with the generics notation you could say RDD[T] and serialize and save to HDFS using Avro DataFile.



      Beware that the RDD could be of anything, so the functionality should be generic to the given RDD type, for example, RDD[(String, AnyBusinessObject)] o RDD[(String, Date, OtherBusinessObject)].



      The question is: how can we infer the Avro schema and provide Avro serialization for a whatever class type in order to save it as Avro Data File?



      The functionality is actually already built, but it uses Java Serialization, this obviously causes space and time penalty, so we would like to refactor it. We can't use DataFrames.










      share|improve this question














      I have this scenario. We have to provide a functionality that takes a whatever type of RDD, with the generics notation you could say RDD[T] and serialize and save to HDFS using Avro DataFile.



      Beware that the RDD could be of anything, so the functionality should be generic to the given RDD type, for example, RDD[(String, AnyBusinessObject)] o RDD[(String, Date, OtherBusinessObject)].



      The question is: how can we infer the Avro schema and provide Avro serialization for a whatever class type in order to save it as Avro Data File?



      The functionality is actually already built, but it uses Java Serialization, this obviously causes space and time penalty, so we would like to refactor it. We can't use DataFrames.







      apache-spark hadoop serialization avro






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 8 at 13:16









      GiorgioGiorgio

      483725




      483725






















          1 Answer
          1






          active

          oldest

          votes


















          0














          You can write avro files using the GenericRecord API (see the "Serializing and deserializing without code generation" section). However, you still need to have the Avro schema.



          If you have a DataFrame, Spark handles all of this for you because Spark knows how to do the conversion from Spark SQL types to Avro types.



          Since you say you can't use DataFrames, you'll have to do this schema generation yourself. One option is to use Avro's ReflectData API.



          Then, once you have the schema you'll do a map to transform all of the elements in the RDD to GenericRecords and use GenericDatumWriter to write it to file.



          I'd seriously reconsider these requirements though. IMO, a better design would be to convert from an RDD to a DataFrame so that you can let Spark do the heavy lifting of writing Avro. Or... why even bother with Avro? Just use a file format that allows you to have a generic schema like JSON.






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55064035%2fserialize-rdd-with-avro%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            You can write avro files using the GenericRecord API (see the "Serializing and deserializing without code generation" section). However, you still need to have the Avro schema.



            If you have a DataFrame, Spark handles all of this for you because Spark knows how to do the conversion from Spark SQL types to Avro types.



            Since you say you can't use DataFrames, you'll have to do this schema generation yourself. One option is to use Avro's ReflectData API.



            Then, once you have the schema you'll do a map to transform all of the elements in the RDD to GenericRecords and use GenericDatumWriter to write it to file.



            I'd seriously reconsider these requirements though. IMO, a better design would be to convert from an RDD to a DataFrame so that you can let Spark do the heavy lifting of writing Avro. Or... why even bother with Avro? Just use a file format that allows you to have a generic schema like JSON.






            share|improve this answer



























              0














              You can write avro files using the GenericRecord API (see the "Serializing and deserializing without code generation" section). However, you still need to have the Avro schema.



              If you have a DataFrame, Spark handles all of this for you because Spark knows how to do the conversion from Spark SQL types to Avro types.



              Since you say you can't use DataFrames, you'll have to do this schema generation yourself. One option is to use Avro's ReflectData API.



              Then, once you have the schema you'll do a map to transform all of the elements in the RDD to GenericRecords and use GenericDatumWriter to write it to file.



              I'd seriously reconsider these requirements though. IMO, a better design would be to convert from an RDD to a DataFrame so that you can let Spark do the heavy lifting of writing Avro. Or... why even bother with Avro? Just use a file format that allows you to have a generic schema like JSON.






              share|improve this answer

























                0












                0








                0







                You can write avro files using the GenericRecord API (see the "Serializing and deserializing without code generation" section). However, you still need to have the Avro schema.



                If you have a DataFrame, Spark handles all of this for you because Spark knows how to do the conversion from Spark SQL types to Avro types.



                Since you say you can't use DataFrames, you'll have to do this schema generation yourself. One option is to use Avro's ReflectData API.



                Then, once you have the schema you'll do a map to transform all of the elements in the RDD to GenericRecords and use GenericDatumWriter to write it to file.



                I'd seriously reconsider these requirements though. IMO, a better design would be to convert from an RDD to a DataFrame so that you can let Spark do the heavy lifting of writing Avro. Or... why even bother with Avro? Just use a file format that allows you to have a generic schema like JSON.






                share|improve this answer













                You can write avro files using the GenericRecord API (see the "Serializing and deserializing without code generation" section). However, you still need to have the Avro schema.



                If you have a DataFrame, Spark handles all of this for you because Spark knows how to do the conversion from Spark SQL types to Avro types.



                Since you say you can't use DataFrames, you'll have to do this schema generation yourself. One option is to use Avro's ReflectData API.



                Then, once you have the schema you'll do a map to transform all of the elements in the RDD to GenericRecords and use GenericDatumWriter to write it to file.



                I'd seriously reconsider these requirements though. IMO, a better design would be to convert from an RDD to a DataFrame so that you can let Spark do the heavy lifting of writing Avro. Or... why even bother with Avro? Just use a file format that allows you to have a generic schema like JSON.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Mar 8 at 22:26









                Kit MenkeKit Menke

                6,59612752




                6,59612752





























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55064035%2fserialize-rdd-with-avro%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    AWS Lex not identifying response if by a variable The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) The Ask Question Wizard is Live! Data science time! April 2019 and salary with experienceEnforcing custom enumeration in AWS LEX for slot valuesHow to give response based on user response in Amazon Lex?Intercepting AWS Lambda Response to a AWS Lex QueryLex chat bot error: Reached second execution of fulfillment lambda on the same utteranceamazon lex showing invalid responseLambda response send back to Lex slot?Response card in Amazon lexAmazon Lex - Lambda response return HTML to botHow can I solve 424 (Failed Dependency) (python) obtained from Amazon lex?

                    Алба-Юлія

                    Захаров Федір Захарович