spark structured stream read to hdfs files fails if data is read immediately Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) The Ask Question Wizard is Live! Data science time! April 2019 and salary with experience Should we burninate the [wrap] tag?How to read from Hive HDFS in spark 1.6?spark structured streaming: query incoming data via HiveParquet data and partition issue in Spark Structured streamingHow to create an EXTERNAL Spark table from data in HDFSSpark structured streaming how to control hdfs read number of partitionsSpark Structured Streaming Writestream to Hive ORC Partioned External TableAccessing Hive Tables from Spark SQL when Data is Stored in Object StorageHow to insert spark structured streaming DataFrame to Hive external table/location?spark structured streaming producing .c000.csv filesparquet fields showing NULL when reading through HIVE, BUT showing values when reading through spark

Is 1 ppb equal to 1 μg/kg?

How does a Death Domain cleric's Touch of Death feature work with Touch-range spells delivered by familiars?

How can players work together to take actions that are otherwise impossible?

Center align columns in table ignoring minus signs?

Why aren't air breathing engines used as small first stages

Is there a service that would inform me whenever a new direct route is scheduled from a given airport?

Is the Standard Deduction better than Itemized when both are the same amount?

What are the pros and cons of Aerospike nosecones?

Proof involving the spectral radius and the Jordan canonical form

When to stop saving and start investing?

Right-skewed distribution with mean equals to mode?

Is there a documented rationale why the House Ways and Means chairman can demand tax info?

How discoverable are IPv6 addresses and AAAA names by potential attackers?

Is there a concise way to say "all of the X, one of each"?

How widely used is the term Treppenwitz? Is it something that most Germans know?

How much radiation do nuclear physics experiments expose researchers to nowadays?

What's the purpose of writing one's academic bio in 3rd person?

Did Kevin spill real chili?

How to bypass password on Windows XP account?

Diagram with tikz

Why did the IBM 650 use bi-quinary?

What makes black pepper strong or mild?

Disable hyphenation for an entire paragraph

Why does Python start at index -1 when indexing a list from the end?



spark structured stream read to hdfs files fails if data is read immediately



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
The Ask Question Wizard is Live!
Data science time! April 2019 and salary with experience
Should we burninate the [wrap] tag?How to read from Hive HDFS in spark 1.6?spark structured streaming: query incoming data via HiveParquet data and partition issue in Spark Structured streamingHow to create an EXTERNAL Spark table from data in HDFSSpark structured streaming how to control hdfs read number of partitionsSpark Structured Streaming Writestream to Hive ORC Partioned External TableAccessing Hive Tables from Spark SQL when Data is Stored in Object StorageHow to insert spark structured streaming DataFrame to Hive external table/location?spark structured streaming producing .c000.csv filesparquet fields showing NULL when reading through HIVE, BUT showing values when reading through spark



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








2















I'd like to load a Hive table (target_table) as a DataFrame after writing a new batch out to HDFS (target_table_dir) using Spark Structured Streaming as follows:



df.writeStream
.trigger(processingTime='5 seconds')
.foreachBatch(lambda df, partition_id:
df.write
.option("path", target_table_dir)
.format("parquet")
.mode("append")
.saveAsTable(target_table))
.start()


When we immediately read same data back from the Hive table we get a "partition not found exception". If we read with some delay, we have data correct.



It seems that Spark is still writing data to HDFS while execution has stopped and Hive Metastore is updated but data is still being written out to HDFS.



How to know when the writing of data to the Hive table (into the HDFS) is complete?



Note:
we have found that if we use processAllAvailable() after writing out,subsequent read works fine.but processAllAvailable() will block execution forever if we are dealing with continuous streams










share|improve this question
























  • How do you do "when we immediately try to read same data back from hive table.we are getting partition not found exception."? How do you know when a table is appended (after a 5-sec trigger is executed)?

    – Jacek Laskowski
    Mar 13 at 17:47











  • we are not able to identify when table is appended.As part of workflow ,we are trying to read back data as next step and getting error.We need way to identify that table append is finished

    – Vish
    Mar 14 at 17:49












  • @JacekLaskowski ..we have found that if we use processAllAvailable() after writing out.subsequent read works fine..but processAllAvailable() will block execution forever if we are dealing with continuous streams.

    – Vish
    Mar 15 at 10:58


















2















I'd like to load a Hive table (target_table) as a DataFrame after writing a new batch out to HDFS (target_table_dir) using Spark Structured Streaming as follows:



df.writeStream
.trigger(processingTime='5 seconds')
.foreachBatch(lambda df, partition_id:
df.write
.option("path", target_table_dir)
.format("parquet")
.mode("append")
.saveAsTable(target_table))
.start()


When we immediately read same data back from the Hive table we get a "partition not found exception". If we read with some delay, we have data correct.



It seems that Spark is still writing data to HDFS while execution has stopped and Hive Metastore is updated but data is still being written out to HDFS.



How to know when the writing of data to the Hive table (into the HDFS) is complete?



Note:
we have found that if we use processAllAvailable() after writing out,subsequent read works fine.but processAllAvailable() will block execution forever if we are dealing with continuous streams










share|improve this question
























  • How do you do "when we immediately try to read same data back from hive table.we are getting partition not found exception."? How do you know when a table is appended (after a 5-sec trigger is executed)?

    – Jacek Laskowski
    Mar 13 at 17:47











  • we are not able to identify when table is appended.As part of workflow ,we are trying to read back data as next step and getting error.We need way to identify that table append is finished

    – Vish
    Mar 14 at 17:49












  • @JacekLaskowski ..we have found that if we use processAllAvailable() after writing out.subsequent read works fine..but processAllAvailable() will block execution forever if we are dealing with continuous streams.

    – Vish
    Mar 15 at 10:58














2












2








2








I'd like to load a Hive table (target_table) as a DataFrame after writing a new batch out to HDFS (target_table_dir) using Spark Structured Streaming as follows:



df.writeStream
.trigger(processingTime='5 seconds')
.foreachBatch(lambda df, partition_id:
df.write
.option("path", target_table_dir)
.format("parquet")
.mode("append")
.saveAsTable(target_table))
.start()


When we immediately read same data back from the Hive table we get a "partition not found exception". If we read with some delay, we have data correct.



It seems that Spark is still writing data to HDFS while execution has stopped and Hive Metastore is updated but data is still being written out to HDFS.



How to know when the writing of data to the Hive table (into the HDFS) is complete?



Note:
we have found that if we use processAllAvailable() after writing out,subsequent read works fine.but processAllAvailable() will block execution forever if we are dealing with continuous streams










share|improve this question
















I'd like to load a Hive table (target_table) as a DataFrame after writing a new batch out to HDFS (target_table_dir) using Spark Structured Streaming as follows:



df.writeStream
.trigger(processingTime='5 seconds')
.foreachBatch(lambda df, partition_id:
df.write
.option("path", target_table_dir)
.format("parquet")
.mode("append")
.saveAsTable(target_table))
.start()


When we immediately read same data back from the Hive table we get a "partition not found exception". If we read with some delay, we have data correct.



It seems that Spark is still writing data to HDFS while execution has stopped and Hive Metastore is updated but data is still being written out to HDFS.



How to know when the writing of data to the Hive table (into the HDFS) is complete?



Note:
we have found that if we use processAllAvailable() after writing out,subsequent read works fine.but processAllAvailable() will block execution forever if we are dealing with continuous streams







apache-spark hive hdfs spark-structured-streaming






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 15 at 10:59







Vish

















asked Mar 8 at 16:08









VishVish

51961541




51961541












  • How do you do "when we immediately try to read same data back from hive table.we are getting partition not found exception."? How do you know when a table is appended (after a 5-sec trigger is executed)?

    – Jacek Laskowski
    Mar 13 at 17:47











  • we are not able to identify when table is appended.As part of workflow ,we are trying to read back data as next step and getting error.We need way to identify that table append is finished

    – Vish
    Mar 14 at 17:49












  • @JacekLaskowski ..we have found that if we use processAllAvailable() after writing out.subsequent read works fine..but processAllAvailable() will block execution forever if we are dealing with continuous streams.

    – Vish
    Mar 15 at 10:58


















  • How do you do "when we immediately try to read same data back from hive table.we are getting partition not found exception."? How do you know when a table is appended (after a 5-sec trigger is executed)?

    – Jacek Laskowski
    Mar 13 at 17:47











  • we are not able to identify when table is appended.As part of workflow ,we are trying to read back data as next step and getting error.We need way to identify that table append is finished

    – Vish
    Mar 14 at 17:49












  • @JacekLaskowski ..we have found that if we use processAllAvailable() after writing out.subsequent read works fine..but processAllAvailable() will block execution forever if we are dealing with continuous streams.

    – Vish
    Mar 15 at 10:58

















How do you do "when we immediately try to read same data back from hive table.we are getting partition not found exception."? How do you know when a table is appended (after a 5-sec trigger is executed)?

– Jacek Laskowski
Mar 13 at 17:47





How do you do "when we immediately try to read same data back from hive table.we are getting partition not found exception."? How do you know when a table is appended (after a 5-sec trigger is executed)?

– Jacek Laskowski
Mar 13 at 17:47













we are not able to identify when table is appended.As part of workflow ,we are trying to read back data as next step and getting error.We need way to identify that table append is finished

– Vish
Mar 14 at 17:49






we are not able to identify when table is appended.As part of workflow ,we are trying to read back data as next step and getting error.We need way to identify that table append is finished

– Vish
Mar 14 at 17:49














@JacekLaskowski ..we have found that if we use processAllAvailable() after writing out.subsequent read works fine..but processAllAvailable() will block execution forever if we are dealing with continuous streams.

– Vish
Mar 15 at 10:58






@JacekLaskowski ..we have found that if we use processAllAvailable() after writing out.subsequent read works fine..but processAllAvailable() will block execution forever if we are dealing with continuous streams.

– Vish
Mar 15 at 10:58













0






active

oldest

votes












Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55066919%2fspark-structured-stream-read-to-hdfs-files-fails-if-data-is-read-immediately%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55066919%2fspark-structured-stream-read-to-hdfs-files-fails-if-data-is-read-immediately%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

1928 у кіно

Захаров Федір Захарович

Ель Греко