serialize RDD with Avro The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) The Ask Question Wizard is Live! Data science time! April 2019 and salary with experienceUsing apache avro reflectSerializing to JSON in jQueryGenerate Avro Schema from certain Java ObjectGeneric RDD in SparkCompatibility of Avro dates and times with BigQuery?KafkaAvroSerializer for serializing Avro without schema.registry.urlHandling namespace prefix while saving data to Avro using Sparkcan I enable data compression when using binary encoding in avro serialization, without using DataFileWriter?MessageConversionException when trying to serialize a List of Avro SpecificRecords and split them with @SplitterPySpark: Deserializing an Avro serialized message contained in an eventhub capture avro fileHow can I set a logicalType in a spark-avro 2.4 schema?
Do warforged have souls?
How is simplicity better than precision and clarity in prose?
Why does the Event Horizon Telescope (EHT) not include telescopes from Africa, Asia or Australia?
How do I add random spotting to the same face in cycles?
In horse breeding, what is the female equivalent of putting a horse out "to stud"?
Problems with Ubuntu mount /tmp
Working through the single responsibility principle (SRP) in Python when calls are expensive
Create an outline of font
How can I protect witches in combat who wear limited clothing?
Finding the path in a graph from A to B then back to A with a minimum of shared edges
Make it rain characters
Simulating Exploding Dice
Does the AirPods case need to be around while listening via an iOS Device?
system() function string length limit
Is this wall load bearing? Blueprints and photos attached
Did the new image of black hole confirm the general theory of relativity?
Windows 10: How to Lock (not sleep) laptop on lid close?
What aspect of planet Earth must be changed to prevent the industrial revolution?
Sort list of array linked objects by keys and values
What was the last x86 CPU that did not have the x87 floating-point unit built in?
Did God make two great lights or did He make the great light two?
Would it be possible to rearrange a dragon's flight muscle to somewhat circumvent the square-cube law?
Hopping to infinity along a string of digits
Am I ethically obligated to go into work on an off day if the reason is sudden?
serialize RDD with Avro
The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
The Ask Question Wizard is Live!
Data science time! April 2019 and salary with experienceUsing apache avro reflectSerializing to JSON in jQueryGenerate Avro Schema from certain Java ObjectGeneric RDD in SparkCompatibility of Avro dates and times with BigQuery?KafkaAvroSerializer for serializing Avro without schema.registry.urlHandling namespace prefix while saving data to Avro using Sparkcan I enable data compression when using binary encoding in avro serialization, without using DataFileWriter?MessageConversionException when trying to serialize a List of Avro SpecificRecords and split them with @SplitterPySpark: Deserializing an Avro serialized message contained in an eventhub capture avro fileHow can I set a logicalType in a spark-avro 2.4 schema?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I have this scenario. We have to provide a functionality that takes a whatever type of RDD, with the generics notation you could say RDD[T] and serialize and save to HDFS using Avro DataFile.
Beware that the RDD could be of anything, so the functionality should be generic to the given RDD type, for example, RDD[(String, AnyBusinessObject)] o RDD[(String, Date, OtherBusinessObject)].
The question is: how can we infer the Avro schema and provide Avro serialization for a whatever class type in order to save it as Avro Data File?
The functionality is actually already built, but it uses Java Serialization, this obviously causes space and time penalty, so we would like to refactor it. We can't use DataFrames.
apache-spark hadoop serialization avro
add a comment |
I have this scenario. We have to provide a functionality that takes a whatever type of RDD, with the generics notation you could say RDD[T] and serialize and save to HDFS using Avro DataFile.
Beware that the RDD could be of anything, so the functionality should be generic to the given RDD type, for example, RDD[(String, AnyBusinessObject)] o RDD[(String, Date, OtherBusinessObject)].
The question is: how can we infer the Avro schema and provide Avro serialization for a whatever class type in order to save it as Avro Data File?
The functionality is actually already built, but it uses Java Serialization, this obviously causes space and time penalty, so we would like to refactor it. We can't use DataFrames.
apache-spark hadoop serialization avro
add a comment |
I have this scenario. We have to provide a functionality that takes a whatever type of RDD, with the generics notation you could say RDD[T] and serialize and save to HDFS using Avro DataFile.
Beware that the RDD could be of anything, so the functionality should be generic to the given RDD type, for example, RDD[(String, AnyBusinessObject)] o RDD[(String, Date, OtherBusinessObject)].
The question is: how can we infer the Avro schema and provide Avro serialization for a whatever class type in order to save it as Avro Data File?
The functionality is actually already built, but it uses Java Serialization, this obviously causes space and time penalty, so we would like to refactor it. We can't use DataFrames.
apache-spark hadoop serialization avro
I have this scenario. We have to provide a functionality that takes a whatever type of RDD, with the generics notation you could say RDD[T] and serialize and save to HDFS using Avro DataFile.
Beware that the RDD could be of anything, so the functionality should be generic to the given RDD type, for example, RDD[(String, AnyBusinessObject)] o RDD[(String, Date, OtherBusinessObject)].
The question is: how can we infer the Avro schema and provide Avro serialization for a whatever class type in order to save it as Avro Data File?
The functionality is actually already built, but it uses Java Serialization, this obviously causes space and time penalty, so we would like to refactor it. We can't use DataFrames.
apache-spark hadoop serialization avro
apache-spark hadoop serialization avro
asked Mar 8 at 13:16
GiorgioGiorgio
483725
483725
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
You can write avro files using the GenericRecord API (see the "Serializing and deserializing without code generation" section). However, you still need to have the Avro schema.
If you have a DataFrame, Spark handles all of this for you because Spark knows how to do the conversion from Spark SQL types to Avro types.
Since you say you can't use DataFrames, you'll have to do this schema generation yourself. One option is to use Avro's ReflectData API.
Then, once you have the schema you'll do a map to transform all of the elements in the RDD to GenericRecords and use GenericDatumWriter to write it to file.
I'd seriously reconsider these requirements though. IMO, a better design would be to convert from an RDD to a DataFrame so that you can let Spark do the heavy lifting of writing Avro. Or... why even bother with Avro? Just use a file format that allows you to have a generic schema like JSON.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55064035%2fserialize-rdd-with-avro%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can write avro files using the GenericRecord API (see the "Serializing and deserializing without code generation" section). However, you still need to have the Avro schema.
If you have a DataFrame, Spark handles all of this for you because Spark knows how to do the conversion from Spark SQL types to Avro types.
Since you say you can't use DataFrames, you'll have to do this schema generation yourself. One option is to use Avro's ReflectData API.
Then, once you have the schema you'll do a map to transform all of the elements in the RDD to GenericRecords and use GenericDatumWriter to write it to file.
I'd seriously reconsider these requirements though. IMO, a better design would be to convert from an RDD to a DataFrame so that you can let Spark do the heavy lifting of writing Avro. Or... why even bother with Avro? Just use a file format that allows you to have a generic schema like JSON.
add a comment |
You can write avro files using the GenericRecord API (see the "Serializing and deserializing without code generation" section). However, you still need to have the Avro schema.
If you have a DataFrame, Spark handles all of this for you because Spark knows how to do the conversion from Spark SQL types to Avro types.
Since you say you can't use DataFrames, you'll have to do this schema generation yourself. One option is to use Avro's ReflectData API.
Then, once you have the schema you'll do a map to transform all of the elements in the RDD to GenericRecords and use GenericDatumWriter to write it to file.
I'd seriously reconsider these requirements though. IMO, a better design would be to convert from an RDD to a DataFrame so that you can let Spark do the heavy lifting of writing Avro. Or... why even bother with Avro? Just use a file format that allows you to have a generic schema like JSON.
add a comment |
You can write avro files using the GenericRecord API (see the "Serializing and deserializing without code generation" section). However, you still need to have the Avro schema.
If you have a DataFrame, Spark handles all of this for you because Spark knows how to do the conversion from Spark SQL types to Avro types.
Since you say you can't use DataFrames, you'll have to do this schema generation yourself. One option is to use Avro's ReflectData API.
Then, once you have the schema you'll do a map to transform all of the elements in the RDD to GenericRecords and use GenericDatumWriter to write it to file.
I'd seriously reconsider these requirements though. IMO, a better design would be to convert from an RDD to a DataFrame so that you can let Spark do the heavy lifting of writing Avro. Or... why even bother with Avro? Just use a file format that allows you to have a generic schema like JSON.
You can write avro files using the GenericRecord API (see the "Serializing and deserializing without code generation" section). However, you still need to have the Avro schema.
If you have a DataFrame, Spark handles all of this for you because Spark knows how to do the conversion from Spark SQL types to Avro types.
Since you say you can't use DataFrames, you'll have to do this schema generation yourself. One option is to use Avro's ReflectData API.
Then, once you have the schema you'll do a map to transform all of the elements in the RDD to GenericRecords and use GenericDatumWriter to write it to file.
I'd seriously reconsider these requirements though. IMO, a better design would be to convert from an RDD to a DataFrame so that you can let Spark do the heavy lifting of writing Avro. Or... why even bother with Avro? Just use a file format that allows you to have a generic schema like JSON.
answered Mar 8 at 22:26
Kit MenkeKit Menke
6,59612752
6,59612752
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55064035%2fserialize-rdd-with-avro%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown