Processing a null value with spark.read.csv & getting String type always as a consequence2019 Community Moderator ElectionSpark DataFrame Schema Nullable FieldsSpark toDF() / createDataFrame() type inference doesn't work as expectedRead csv file fields with “^M” and multi lines with Spark ScalaSpark accumulator, I get always 0 valueBehavior of spark.read.csv with inferschema=True in case of multiple file loadingConvert RDD to Dataframe in Pyspark's FPGrowthSpark - Getting Type mismatch when assigning a string label to null valuesSpark DataFrame Read And Writecombine multiple csvs into one large avro tableCSV Columns removed From file while loading Dataframe
How do anti-virus programs start at Windows boot?
Instead of Universal Basic Income, why not Universal Basic NEEDS?
Russian cases: A few examples, I'm really confused
Connecting top and bottom SMD component pads using via
My adviser wants to be the first author
Good allowance savings plan?
Should we release the security issues we found in our product as CVE or we can just update those on weekly release notes?
Why are there 40 737 Max planes in flight when they have been grounded as not airworthy?
Be in awe of my brilliance!
Happy pi day, everyone!
Co-worker team leader wants to inject his friend's awful software into our development. What should I say to our common boss?
What are the possible solutions of the given equation?
Why is "das Weib" grammatically neuter?
What has been your most complicated TikZ drawing?
Can the damage from a Talisman of Pure Good (or Ultimate Evil) be non-lethal?
Bash: What does "masking return values" mean?
Theorems like the Lovász Local Lemma?
Meaning of "SEVERA INDEOVI VAS" from 3rd Century slab
Why did it take so long to abandon sail after steamships were demonstrated?
Old race car problem/puzzle
Ban on all campaign finance?
At what level can a dragon innately cast its spells?
Is it normal that my co-workers at a fitness company criticize my food choices?
Is it possible that AIC = BIC?
Processing a null value with spark.read.csv & getting String type always as a consequence
2019 Community Moderator ElectionSpark DataFrame Schema Nullable FieldsSpark toDF() / createDataFrame() type inference doesn't work as expectedRead csv file fields with “^M” and multi lines with Spark ScalaSpark accumulator, I get always 0 valueBehavior of spark.read.csv with inferschema=True in case of multiple file loadingConvert RDD to Dataframe in Pyspark's FPGrowthSpark - Getting Type mismatch when assigning a string label to null valuesSpark DataFrame Read And Writecombine multiple csvs into one large avro tableCSV Columns removed From file while loading Dataframe
I have a file like this:
1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true
I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok
So, I am doing the following:
val dfPG = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "false")
.option("nullValue", "")
.load("/FileStore/tables/SO_QQQ.txt")
and setting the fields explicitly:
val dfPG2 =
dfPG
.map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
r.getString(6) //r.getString(6).toInt
)
I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.
See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.
r.getString(6).toInt
I must be over-complicating and/or missing something.
Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.
apache-spark
add a comment |
I have a file like this:
1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true
I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok
So, I am doing the following:
val dfPG = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "false")
.option("nullValue", "")
.load("/FileStore/tables/SO_QQQ.txt")
and setting the fields explicitly:
val dfPG2 =
dfPG
.map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
r.getString(6) //r.getString(6).toInt
)
I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.
See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.
r.getString(6).toInt
I must be over-complicating and/or missing something.
Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.
apache-spark
add a comment |
I have a file like this:
1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true
I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok
So, I am doing the following:
val dfPG = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "false")
.option("nullValue", "")
.load("/FileStore/tables/SO_QQQ.txt")
and setting the fields explicitly:
val dfPG2 =
dfPG
.map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
r.getString(6) //r.getString(6).toInt
)
I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.
See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.
r.getString(6).toInt
I must be over-complicating and/or missing something.
Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.
apache-spark
I have a file like this:
1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true
I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok
So, I am doing the following:
val dfPG = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "false")
.option("nullValue", "")
.load("/FileStore/tables/SO_QQQ.txt")
and setting the fields explicitly:
val dfPG2 =
dfPG
.map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
r.getString(6) //r.getString(6).toInt
)
I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.
See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.
r.getString(6).toInt
I must be over-complicating and/or missing something.
Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.
apache-spark
apache-spark
asked Mar 6 at 18:40
thebluephantomthebluephantom
3,1073932
3,1073932
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField("your_integer_field", IntegerType, true),
...
))
and provide it to the reader:
val dfPG = spark.read.format("csv")
.schema(schema)
...
.load("/FileStore/tables/SO_QQQ.txt")
I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.
– thebluephantom
Mar 6 at 19:22
Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.
– thebluephantom
Mar 6 at 19:32
stackoverflow.com/questions/41705602/…. In Scala all nullable!
– thebluephantom
Mar 6 at 19:57
Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.
– thebluephantom
Mar 6 at 20:09
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55030088%2fprocessing-a-null-value-with-spark-read-csv-getting-string-type-always-as-a-co%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField("your_integer_field", IntegerType, true),
...
))
and provide it to the reader:
val dfPG = spark.read.format("csv")
.schema(schema)
...
.load("/FileStore/tables/SO_QQQ.txt")
I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.
– thebluephantom
Mar 6 at 19:22
Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.
– thebluephantom
Mar 6 at 19:32
stackoverflow.com/questions/41705602/…. In Scala all nullable!
– thebluephantom
Mar 6 at 19:57
Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.
– thebluephantom
Mar 6 at 20:09
add a comment |
That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField("your_integer_field", IntegerType, true),
...
))
and provide it to the reader:
val dfPG = spark.read.format("csv")
.schema(schema)
...
.load("/FileStore/tables/SO_QQQ.txt")
I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.
– thebluephantom
Mar 6 at 19:22
Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.
– thebluephantom
Mar 6 at 19:32
stackoverflow.com/questions/41705602/…. In Scala all nullable!
– thebluephantom
Mar 6 at 19:57
Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.
– thebluephantom
Mar 6 at 20:09
add a comment |
That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField("your_integer_field", IntegerType, true),
...
))
and provide it to the reader:
val dfPG = spark.read.format("csv")
.schema(schema)
...
.load("/FileStore/tables/SO_QQQ.txt")
That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField("your_integer_field", IntegerType, true),
...
))
and provide it to the reader:
val dfPG = spark.read.format("csv")
.schema(schema)
...
.load("/FileStore/tables/SO_QQQ.txt")
answered Mar 6 at 18:50
user11161602user11161602
261
261
I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.
– thebluephantom
Mar 6 at 19:22
Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.
– thebluephantom
Mar 6 at 19:32
stackoverflow.com/questions/41705602/…. In Scala all nullable!
– thebluephantom
Mar 6 at 19:57
Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.
– thebluephantom
Mar 6 at 20:09
add a comment |
I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.
– thebluephantom
Mar 6 at 19:22
Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.
– thebluephantom
Mar 6 at 19:32
stackoverflow.com/questions/41705602/…. In Scala all nullable!
– thebluephantom
Mar 6 at 19:57
Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.
– thebluephantom
Mar 6 at 20:09
I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.
– thebluephantom
Mar 6 at 19:22
I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.
– thebluephantom
Mar 6 at 19:22
Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.
– thebluephantom
Mar 6 at 19:32
Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.
– thebluephantom
Mar 6 at 19:32
stackoverflow.com/questions/41705602/…. In Scala all nullable!
– thebluephantom
Mar 6 at 19:57
stackoverflow.com/questions/41705602/…. In Scala all nullable!
– thebluephantom
Mar 6 at 19:57
Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.
– thebluephantom
Mar 6 at 20:09
Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.
– thebluephantom
Mar 6 at 20:09
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55030088%2fprocessing-a-null-value-with-spark-read-csv-getting-string-type-always-as-a-co%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown