Processing a null value with spark.read.csv & getting String type always as a consequence2019 Community Moderator ElectionSpark DataFrame Schema Nullable FieldsSpark toDF() / createDataFrame() type inference doesn't work as expectedRead csv file fields with “^M” and multi lines with Spark ScalaSpark accumulator, I get always 0 valueBehavior of spark.read.csv with inferschema=True in case of multiple file loadingConvert RDD to Dataframe in Pyspark's FPGrowthSpark - Getting Type mismatch when assigning a string label to null valuesSpark DataFrame Read And Writecombine multiple csvs into one large avro tableCSV Columns removed From file while loading Dataframe

How do anti-virus programs start at Windows boot?

Instead of Universal Basic Income, why not Universal Basic NEEDS?

Russian cases: A few examples, I'm really confused

Connecting top and bottom SMD component pads using via

My adviser wants to be the first author

Good allowance savings plan?

Should we release the security issues we found in our product as CVE or we can just update those on weekly release notes?

Why are there 40 737 Max planes in flight when they have been grounded as not airworthy?

Be in awe of my brilliance!

Happy pi day, everyone!

Co-worker team leader wants to inject his friend's awful software into our development. What should I say to our common boss?

What are the possible solutions of the given equation?

Why is "das Weib" grammatically neuter?

What has been your most complicated TikZ drawing?

Can the damage from a Talisman of Pure Good (or Ultimate Evil) be non-lethal?

Bash: What does "masking return values" mean?

Theorems like the Lovász Local Lemma?

Meaning of "SEVERA INDEOVI VAS" from 3rd Century slab

Why did it take so long to abandon sail after steamships were demonstrated?

Old race car problem/puzzle

Ban on all campaign finance?

At what level can a dragon innately cast its spells?

Is it normal that my co-workers at a fitness company criticize my food choices?

Is it possible that AIC = BIC?



Processing a null value with spark.read.csv & getting String type always as a consequence



2019 Community Moderator ElectionSpark DataFrame Schema Nullable FieldsSpark toDF() / createDataFrame() type inference doesn't work as expectedRead csv file fields with “^M” and multi lines with Spark ScalaSpark accumulator, I get always 0 valueBehavior of spark.read.csv with inferschema=True in case of multiple file loadingConvert RDD to Dataframe in Pyspark's FPGrowthSpark - Getting Type mismatch when assigning a string label to null valuesSpark DataFrame Read And Writecombine multiple csvs into one large avro tableCSV Columns removed From file while loading Dataframe










0















I have a file like this:



1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true


I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok



So, I am doing the following:



val dfPG = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "false")
.option("nullValue", "")
.load("/FileStore/tables/SO_QQQ.txt")


and setting the fields explicitly:



val dfPG2 =
dfPG
.map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
r.getString(6) //r.getString(6).toInt
)


I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.



See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.



r.getString(6).toInt


I must be over-complicating and/or missing something.



Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.










share|improve this question


























    0















    I have a file like this:



    1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
    1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
    1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true


    I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok



    So, I am doing the following:



    val dfPG = spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "false")
    .option("nullValue", "")
    .load("/FileStore/tables/SO_QQQ.txt")


    and setting the fields explicitly:



    val dfPG2 =
    dfPG
    .map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
    r.getString(6) //r.getString(6).toInt
    )


    I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.



    See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.



    r.getString(6).toInt


    I must be over-complicating and/or missing something.



    Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.










    share|improve this question
























      0












      0








      0


      1






      I have a file like this:



      1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
      1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
      1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true


      I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok



      So, I am doing the following:



      val dfPG = spark.read.format("csv")
      .option("header", "true")
      .option("inferSchema", "false")
      .option("nullValue", "")
      .load("/FileStore/tables/SO_QQQ.txt")


      and setting the fields explicitly:



      val dfPG2 =
      dfPG
      .map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
      r.getString(6) //r.getString(6).toInt
      )


      I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.



      See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.



      r.getString(6).toInt


      I must be over-complicating and/or missing something.



      Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.










      share|improve this question














      I have a file like this:



      1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
      1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
      1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true


      I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok



      So, I am doing the following:



      val dfPG = spark.read.format("csv")
      .option("header", "true")
      .option("inferSchema", "false")
      .option("nullValue", "")
      .load("/FileStore/tables/SO_QQQ.txt")


      and setting the fields explicitly:



      val dfPG2 =
      dfPG
      .map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
      r.getString(6) //r.getString(6).toInt
      )


      I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.



      See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.



      r.getString(6).toInt


      I must be over-complicating and/or missing something.



      Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.







      apache-spark






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 6 at 18:40









      thebluephantomthebluephantom

      3,1073932




      3,1073932






















          1 Answer
          1






          active

          oldest

          votes


















          1














          That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data



          import org.apache.spark.sql.types._

          val schema = StructType(Seq(
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField("your_integer_field", IntegerType, true),
          ...
          ))


          and provide it to the reader:



          val dfPG = spark.read.format("csv")
          .schema(schema)
          ...
          .load("/FileStore/tables/SO_QQQ.txt")





          share|improve this answer























          • I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

            – thebluephantom
            Mar 6 at 19:22












          • Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

            – thebluephantom
            Mar 6 at 19:32











          • stackoverflow.com/questions/41705602/…. In Scala all nullable!

            – thebluephantom
            Mar 6 at 19:57











          • Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

            – thebluephantom
            Mar 6 at 20:09











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55030088%2fprocessing-a-null-value-with-spark-read-csv-getting-string-type-always-as-a-co%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data



          import org.apache.spark.sql.types._

          val schema = StructType(Seq(
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField("your_integer_field", IntegerType, true),
          ...
          ))


          and provide it to the reader:



          val dfPG = spark.read.format("csv")
          .schema(schema)
          ...
          .load("/FileStore/tables/SO_QQQ.txt")





          share|improve this answer























          • I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

            – thebluephantom
            Mar 6 at 19:22












          • Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

            – thebluephantom
            Mar 6 at 19:32











          • stackoverflow.com/questions/41705602/…. In Scala all nullable!

            – thebluephantom
            Mar 6 at 19:57











          • Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

            – thebluephantom
            Mar 6 at 20:09
















          1














          That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data



          import org.apache.spark.sql.types._

          val schema = StructType(Seq(
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField("your_integer_field", IntegerType, true),
          ...
          ))


          and provide it to the reader:



          val dfPG = spark.read.format("csv")
          .schema(schema)
          ...
          .load("/FileStore/tables/SO_QQQ.txt")





          share|improve this answer























          • I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

            – thebluephantom
            Mar 6 at 19:22












          • Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

            – thebluephantom
            Mar 6 at 19:32











          • stackoverflow.com/questions/41705602/…. In Scala all nullable!

            – thebluephantom
            Mar 6 at 19:57











          • Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

            – thebluephantom
            Mar 6 at 20:09














          1












          1








          1







          That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data



          import org.apache.spark.sql.types._

          val schema = StructType(Seq(
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField("your_integer_field", IntegerType, true),
          ...
          ))


          and provide it to the reader:



          val dfPG = spark.read.format("csv")
          .schema(schema)
          ...
          .load("/FileStore/tables/SO_QQQ.txt")





          share|improve this answer













          That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data



          import org.apache.spark.sql.types._

          val schema = StructType(Seq(
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField("your_integer_field", IntegerType, true),
          ...
          ))


          and provide it to the reader:



          val dfPG = spark.read.format("csv")
          .schema(schema)
          ...
          .load("/FileStore/tables/SO_QQQ.txt")






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 6 at 18:50









          user11161602user11161602

          261




          261












          • I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

            – thebluephantom
            Mar 6 at 19:22












          • Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

            – thebluephantom
            Mar 6 at 19:32











          • stackoverflow.com/questions/41705602/…. In Scala all nullable!

            – thebluephantom
            Mar 6 at 19:57











          • Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

            – thebluephantom
            Mar 6 at 20:09


















          • I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

            – thebluephantom
            Mar 6 at 19:22












          • Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

            – thebluephantom
            Mar 6 at 19:32











          • stackoverflow.com/questions/41705602/…. In Scala all nullable!

            – thebluephantom
            Mar 6 at 19:57











          • Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

            – thebluephantom
            Mar 6 at 20:09

















          I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

          – thebluephantom
          Mar 6 at 19:22






          I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

          – thebluephantom
          Mar 6 at 19:22














          Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

          – thebluephantom
          Mar 6 at 19:32





          Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

          – thebluephantom
          Mar 6 at 19:32













          stackoverflow.com/questions/41705602/…. In Scala all nullable!

          – thebluephantom
          Mar 6 at 19:57





          stackoverflow.com/questions/41705602/…. In Scala all nullable!

          – thebluephantom
          Mar 6 at 19:57













          Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

          – thebluephantom
          Mar 6 at 20:09






          Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

          – thebluephantom
          Mar 6 at 20:09




















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55030088%2fprocessing-a-null-value-with-spark-read-csv-getting-string-type-always-as-a-co%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          1928 у кіно

          Захаров Федір Захарович

          Ель Греко