Processing a null value with spark.read.csv & getting String type always as a consequence2019 Community Moderator ElectionSpark DataFrame Schema Nullable FieldsSpark toDF() / createDataFrame() type inference doesn't work as expectedRead csv file fields with “^M” and multi lines with Spark ScalaSpark accumulator, I get always 0 valueBehavior of spark.read.csv with inferschema=True in case of multiple file loadingConvert RDD to Dataframe in Pyspark's FPGrowthSpark - Getting Type mismatch when assigning a string label to null valuesSpark DataFrame Read And Writecombine multiple csvs into one large avro tableCSV Columns removed From file while loading Dataframe

How do anti-virus programs start at Windows boot?

Instead of Universal Basic Income, why not Universal Basic NEEDS?

Russian cases: A few examples, I'm really confused

Connecting top and bottom SMD component pads using via

My adviser wants to be the first author

Good allowance savings plan?

Should we release the security issues we found in our product as CVE or we can just update those on weekly release notes?

Why are there 40 737 Max planes in flight when they have been grounded as not airworthy?

Be in awe of my brilliance!

Happy pi day, everyone!

Co-worker team leader wants to inject his friend's awful software into our development. What should I say to our common boss?

What are the possible solutions of the given equation?

Why is "das Weib" grammatically neuter?

What has been your most complicated TikZ drawing?

Can the damage from a Talisman of Pure Good (or Ultimate Evil) be non-lethal?

Bash: What does "masking return values" mean?

Theorems like the Lovász Local Lemma?

Meaning of "SEVERA INDEOVI VAS" from 3rd Century slab

Why did it take so long to abandon sail after steamships were demonstrated?

Old race car problem/puzzle

Ban on all campaign finance?

At what level can a dragon innately cast its spells?

Is it normal that my co-workers at a fitness company criticize my food choices?

Is it possible that AIC = BIC?



Processing a null value with spark.read.csv & getting String type always as a consequence



2019 Community Moderator ElectionSpark DataFrame Schema Nullable FieldsSpark toDF() / createDataFrame() type inference doesn't work as expectedRead csv file fields with “^M” and multi lines with Spark ScalaSpark accumulator, I get always 0 valueBehavior of spark.read.csv with inferschema=True in case of multiple file loadingConvert RDD to Dataframe in Pyspark's FPGrowthSpark - Getting Type mismatch when assigning a string label to null valuesSpark DataFrame Read And Writecombine multiple csvs into one large avro tableCSV Columns removed From file while loading Dataframe










0















I have a file like this:



1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true


I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok



So, I am doing the following:



val dfPG = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "false")
.option("nullValue", "")
.load("/FileStore/tables/SO_QQQ.txt")


and setting the fields explicitly:



val dfPG2 =
dfPG
.map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
r.getString(6) //r.getString(6).toInt
)


I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.



See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.



r.getString(6).toInt


I must be over-complicating and/or missing something.



Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.










share|improve this question


























    0















    I have a file like this:



    1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
    1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
    1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true


    I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok



    So, I am doing the following:



    val dfPG = spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "false")
    .option("nullValue", "")
    .load("/FileStore/tables/SO_QQQ.txt")


    and setting the fields explicitly:



    val dfPG2 =
    dfPG
    .map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
    r.getString(6) //r.getString(6).toInt
    )


    I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.



    See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.



    r.getString(6).toInt


    I must be over-complicating and/or missing something.



    Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.










    share|improve this question
























      0












      0








      0


      1






      I have a file like this:



      1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
      1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
      1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true


      I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok



      So, I am doing the following:



      val dfPG = spark.read.format("csv")
      .option("header", "true")
      .option("inferSchema", "false")
      .option("nullValue", "")
      .load("/FileStore/tables/SO_QQQ.txt")


      and setting the fields explicitly:



      val dfPG2 =
      dfPG
      .map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
      r.getString(6) //r.getString(6).toInt
      )


      I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.



      See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.



      r.getString(6).toInt


      I must be over-complicating and/or missing something.



      Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.










      share|improve this question














      I have a file like this:



      1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
      1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
      1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true


      I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok



      So, I am doing the following:



      val dfPG = spark.read.format("csv")
      .option("header", "true")
      .option("inferSchema", "false")
      .option("nullValue", "")
      .load("/FileStore/tables/SO_QQQ.txt")


      and setting the fields explicitly:



      val dfPG2 =
      dfPG
      .map r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
      r.getString(6) //r.getString(6).toInt
      )


      I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.



      See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.



      r.getString(6).toInt


      I must be over-complicating and/or missing something.



      Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.







      apache-spark






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 6 at 18:40









      thebluephantomthebluephantom

      3,1073932




      3,1073932






















          1 Answer
          1






          active

          oldest

          votes


















          1














          That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data



          import org.apache.spark.sql.types._

          val schema = StructType(Seq(
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField("your_integer_field", IntegerType, true),
          ...
          ))


          and provide it to the reader:



          val dfPG = spark.read.format("csv")
          .schema(schema)
          ...
          .load("/FileStore/tables/SO_QQQ.txt")





          share|improve this answer























          • I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

            – thebluephantom
            Mar 6 at 19:22












          • Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

            – thebluephantom
            Mar 6 at 19:32











          • stackoverflow.com/questions/41705602/…. In Scala all nullable!

            – thebluephantom
            Mar 6 at 19:57











          • Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

            – thebluephantom
            Mar 6 at 20:09











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55030088%2fprocessing-a-null-value-with-spark-read-csv-getting-string-type-always-as-a-co%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data



          import org.apache.spark.sql.types._

          val schema = StructType(Seq(
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField("your_integer_field", IntegerType, true),
          ...
          ))


          and provide it to the reader:



          val dfPG = spark.read.format("csv")
          .schema(schema)
          ...
          .load("/FileStore/tables/SO_QQQ.txt")





          share|improve this answer























          • I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

            – thebluephantom
            Mar 6 at 19:22












          • Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

            – thebluephantom
            Mar 6 at 19:32











          • stackoverflow.com/questions/41705602/…. In Scala all nullable!

            – thebluephantom
            Mar 6 at 19:57











          • Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

            – thebluephantom
            Mar 6 at 20:09
















          1














          That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data



          import org.apache.spark.sql.types._

          val schema = StructType(Seq(
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField("your_integer_field", IntegerType, true),
          ...
          ))


          and provide it to the reader:



          val dfPG = spark.read.format("csv")
          .schema(schema)
          ...
          .load("/FileStore/tables/SO_QQQ.txt")





          share|improve this answer























          • I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

            – thebluephantom
            Mar 6 at 19:22












          • Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

            – thebluephantom
            Mar 6 at 19:32











          • stackoverflow.com/questions/41705602/…. In Scala all nullable!

            – thebluephantom
            Mar 6 at 19:57











          • Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

            – thebluephantom
            Mar 6 at 20:09














          1












          1








          1







          That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data



          import org.apache.spark.sql.types._

          val schema = StructType(Seq(
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField("your_integer_field", IntegerType, true),
          ...
          ))


          and provide it to the reader:



          val dfPG = spark.read.format("csv")
          .schema(schema)
          ...
          .load("/FileStore/tables/SO_QQQ.txt")





          share|improve this answer













          That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data



          import org.apache.spark.sql.types._

          val schema = StructType(Seq(
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField(...),
          StructField("your_integer_field", IntegerType, true),
          ...
          ))


          and provide it to the reader:



          val dfPG = spark.read.format("csv")
          .schema(schema)
          ...
          .load("/FileStore/tables/SO_QQQ.txt")






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 6 at 18:50









          user11161602user11161602

          261




          261












          • I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

            – thebluephantom
            Mar 6 at 19:22












          • Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

            – thebluephantom
            Mar 6 at 19:32











          • stackoverflow.com/questions/41705602/…. In Scala all nullable!

            – thebluephantom
            Mar 6 at 19:57











          • Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

            – thebluephantom
            Mar 6 at 20:09


















          • I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

            – thebluephantom
            Mar 6 at 19:22












          • Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

            – thebluephantom
            Mar 6 at 19:32











          • stackoverflow.com/questions/41705602/…. In Scala all nullable!

            – thebluephantom
            Mar 6 at 19:57











          • Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

            – thebluephantom
            Mar 6 at 20:09

















          I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

          – thebluephantom
          Mar 6 at 19:22






          I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited.

          – thebluephantom
          Mar 6 at 19:22














          Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

          – thebluephantom
          Mar 6 at 19:32





          Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great.

          – thebluephantom
          Mar 6 at 19:32













          stackoverflow.com/questions/41705602/…. In Scala all nullable!

          – thebluephantom
          Mar 6 at 19:57





          stackoverflow.com/questions/41705602/…. In Scala all nullable!

          – thebluephantom
          Mar 6 at 19:57













          Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

          – thebluephantom
          Mar 6 at 20:09






          Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting.

          – thebluephantom
          Mar 6 at 20:09




















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55030088%2fprocessing-a-null-value-with-spark-read-csv-getting-string-type-always-as-a-co%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Save data to MySQL database using ExtJS and PHP [closed]2019 Community Moderator ElectionHow can I prevent SQL injection in PHP?Which MySQL data type to use for storing boolean valuesPHP: Delete an element from an arrayHow do I connect to a MySQL Database in Python?Should I use the datetime or timestamp data type in MySQL?How to get a list of MySQL user accountsHow Do You Parse and Process HTML/XML in PHP?Reference — What does this symbol mean in PHP?How does PHP 'foreach' actually work?Why shouldn't I use mysql_* functions in PHP?

          Compiling GNU Global with universal-ctags support Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!Tags for Emacs: Relationship between etags, ebrowse, cscope, GNU Global and exuberant ctagsVim and Ctags tips and trickscscope or ctags why choose one over the other?scons and ctagsctags cannot open option file “.ctags”Adding tag scopes in universal-ctagsShould I use Universal-ctags?Universal ctags on WindowsHow do I install GNU Global with universal ctags support using Homebrew?Universal ctags with emacsHow to highlight ctags generated by Universal Ctags in Vim?

          Add ONERROR event to image from jsp tldHow to add an image to a JPanel?Saving image from PHP URLHTML img scalingCheck if an image is loaded (no errors) with jQueryHow to force an <img> to take up width, even if the image is not loadedHow do I populate hidden form field with a value set in Spring ControllerStyling Raw elements Generated from JSP tagds with Jquery MobileLimit resizing of images with explicitly set width and height attributeserror TLD use in a jsp fileJsp tld files cannot be resolved