Processing a null value with spark.read.csv & getting String type always as a consequence2019 Community Moderator ElectionSpark DataFrame Schema Nullable FieldsSpark toDF() / createDataFrame() type inference doesn't work as expectedRead csv file fields with “^M” and multi lines with Spark ScalaSpark accumulator, I get always 0 valueBehavior of spark.read.csv with inferschema=True in case of multiple file loadingConvert RDD to Dataframe in Pyspark's FPGrowthSpark - Getting Type mismatch when assigning a string label to null valuesSpark DataFrame Read And Writecombine multiple csvs into one large avro tableCSV Columns removed From file while loading Dataframe

How do anti-virus programs start at Windows boot?

Instead of Universal Basic Income, why not Universal Basic NEEDS?

Russian cases: A few examples, I'm really confused

Connecting top and bottom SMD component pads using via

My adviser wants to be the first author

Good allowance savings plan?

Should we release the security issues we found in our product as CVE or we can just update those on weekly release notes?

Why are there 40 737 Max planes in flight when they have been grounded as not airworthy?

Be in awe of my brilliance!

Happy pi day, everyone!

Co-worker team leader wants to inject his friend's awful software into our development. What should I say to our common boss?

What are the possible solutions of the given equation?

Why is "das Weib" grammatically neuter?

What has been your most complicated TikZ drawing?

Can the damage from a Talisman of Pure Good (or Ultimate Evil) be non-lethal?

Bash: What does "masking return values" mean?

Theorems like the Lovász Local Lemma?

Meaning of "SEVERA INDEOVI VAS" from 3rd Century slab

Why did it take so long to abandon sail after steamships were demonstrated?

Old race car problem/puzzle

Ban on all campaign finance?

At what level can a dragon innately cast its spells?

Is it normal that my co-workers at a fitness company criticize my food choices?

Is it possible that AIC = BIC?

Processing a null value with spark.read.csv & getting String type always as a consequence

2019 Community Moderator ElectionSpark DataFrame Schema Nullable FieldsSpark toDF() / createDataFrame() type inference doesn't work as expectedRead csv file fields with “^M” and multi lines with Spark ScalaSpark accumulator, I get always 0 valueBehavior of spark.read.csv with inferschema=True in case of multiple file loadingConvert RDD to Dataframe in Pyspark's FPGrowthSpark - Getting Type mismatch when assigning a string label to null valuesSpark DataFrame Read And Writecombine multiple csvs into one large avro tableCSV Columns removed From file while loading Dataframe

I have a file like this:

1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true

I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok

So, I am doing the following:

val dfPG = spark.read.format("csv")
 .option("header", "true")
 .option("inferSchema", "false")
 .option("nullValue", "")
 .load("/FileStore/tables/SO_QQQ.txt")

and setting the fields explicitly:

val dfPG2 =
 dfPG
 .map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
 r.getString(6) //r.getString(6).toInt
 )

I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.

See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.

r.getString(6).toInt

I must be over-complicating and/or missing something.

Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.

asked Mar 6 at 18:40

thebluephantom

3,1073932

add a comment |

I have a file like this:

1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true

I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok

So, I am doing the following:

val dfPG = spark.read.format("csv")
 .option("header", "true")
 .option("inferSchema", "false")
 .option("nullValue", "")
 .load("/FileStore/tables/SO_QQQ.txt")

and setting the fields explicitly:

val dfPG2 =
 dfPG
 .map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
 r.getString(6) //r.getString(6).toInt
 )

I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.

See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.

r.getString(6).toInt

I must be over-complicating and/or missing something.

Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.

asked Mar 6 at 18:40

thebluephantom

3,1073932

add a comment |

I have a file like this:

1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true

I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok

So, I am doing the following:

val dfPG = spark.read.format("csv")
 .option("header", "true")
 .option("inferSchema", "false")
 .option("nullValue", "")
 .load("/FileStore/tables/SO_QQQ.txt")

and setting the fields explicitly:

val dfPG2 =
 dfPG
 .map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
 r.getString(6) //r.getString(6).toInt
 )

I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.

See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.

r.getString(6).toInt

I must be over-complicating and/or missing something.

Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.

asked Mar 6 at 18:40

thebluephantom

3,1073932

I have a file like this:

1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true

I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok

So, I am doing the following:

val dfPG = spark.read.format("csv")
 .option("header", "true")
 .option("inferSchema", "false")
 .option("nullValue", "")
 .load("/FileStore/tables/SO_QQQ.txt")

and setting the fields explicitly:

val dfPG2 =
 dfPG
 .map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
 r.getString(6) //r.getString(6).toInt
 )

I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.

See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.

r.getString(6).toInt

I must be over-complicating and/or missing something.

Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.

apache-spark

asked Mar 6 at 18:40

thebluephantom

3,1073932

asked Mar 6 at 18:40

thebluephantom

3,1073932

asked Mar 6 at 18:40

thebluephantom

3,1073932

asked Mar 6 at 18:40

thebluephantom

3,1073932

asked Mar 6 at 18:40

thebluephantom

3,1073932

add a comment |

1 Answer
1

active

oldest

votes

That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data

import org.apache.spark.sql.types._

val schema = StructType(Seq(
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField("your_integer_field", IntegerType, true),
 ...
))

and provide it to the reader:

val dfPG = spark.read.format("csv")
 .schema(schema)
 ...
 .load("/FileStore/tables/SO_QQQ.txt")

answered Mar 6 at 18:50

user11161602

261

I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

– thebluephantom
Mar 6 at 19:22

Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

– thebluephantom
Mar 6 at 19:32

stackoverflow.com/questions/41705602/…. In Scala all nullable!

– thebluephantom
Mar 6 at 19:57

Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

– thebluephantom
Mar 6 at 20:09

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55030088%2fprocessing-a-null-value-with-spark-read-csv-getting-string-type-always-as-a-co%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data

import org.apache.spark.sql.types._

val schema = StructType(Seq(
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField("your_integer_field", IntegerType, true),
 ...
))

and provide it to the reader:

val dfPG = spark.read.format("csv")
 .schema(schema)
 ...
 .load("/FileStore/tables/SO_QQQ.txt")

answered Mar 6 at 18:50

user11161602

261

I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

– thebluephantom
Mar 6 at 19:22

Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

– thebluephantom
Mar 6 at 19:32

stackoverflow.com/questions/41705602/…. In Scala all nullable!

– thebluephantom
Mar 6 at 19:57

Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

– thebluephantom
Mar 6 at 20:09

add a comment |

That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data

import org.apache.spark.sql.types._

val schema = StructType(Seq(
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField("your_integer_field", IntegerType, true),
 ...
))

and provide it to the reader:

val dfPG = spark.read.format("csv")
 .schema(schema)
 ...
 .load("/FileStore/tables/SO_QQQ.txt")

answered Mar 6 at 18:50

user11161602

261

I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

– thebluephantom
Mar 6 at 19:22

Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

– thebluephantom
Mar 6 at 19:32

stackoverflow.com/questions/41705602/…. In Scala all nullable!

– thebluephantom
Mar 6 at 19:57

Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

– thebluephantom
Mar 6 at 20:09

add a comment |

That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data

import org.apache.spark.sql.types._

val schema = StructType(Seq(
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField("your_integer_field", IntegerType, true),
 ...
))

and provide it to the reader:

val dfPG = spark.read.format("csv")
 .schema(schema)
 ...
 .load("/FileStore/tables/SO_QQQ.txt")

answered Mar 6 at 18:50

user11161602

261

That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data

import org.apache.spark.sql.types._

val schema = StructType(Seq(
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField(...),
 StructField("your_integer_field", IntegerType, true),
 ...
))

and provide it to the reader:

val dfPG = spark.read.format("csv")
 .schema(schema)
 ...
 .load("/FileStore/tables/SO_QQQ.txt")

answered Mar 6 at 18:50

user11161602

261

answered Mar 6 at 18:50

user11161602

261

answered Mar 6 at 18:50

user11161602

261

answered Mar 6 at 18:50

user11161602

261

I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

– thebluephantom
Mar 6 at 19:22

Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

– thebluephantom
Mar 6 at 19:32

stackoverflow.com/questions/41705602/…. In Scala all nullable!

– thebluephantom
Mar 6 at 19:57

Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

– thebluephantom
Mar 6 at 20:09

add a comment |

I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

– thebluephantom
Mar 6 at 19:22

Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

– thebluephantom
Mar 6 at 19:32

stackoverflow.com/questions/41705602/…. In Scala all nullable!

– thebluephantom
Mar 6 at 19:57

Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

– thebluephantom
Mar 6 at 20:09

I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

– thebluephantom
Mar 6 at 19:22

Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

– thebluephantom
Mar 6 at 19:32

stackoverflow.com/questions/41705602/…. In Scala all nullable!

– thebluephantom
Mar 6 at 19:57

Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

– thebluephantom
Mar 6 at 20:09

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ufdjrw

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

1 Answer
1

1 Answer
1

1 Answer
1