How are Dataflow bundles created after GroupBy/Combine?2019 Community Moderator ElectionHow to Control Location of Parallel / GroupBy Stage in Google Cloud DataflowHow to iterate all files in google cloud storage to be used as dataflow input?Dataflow pipeline from Cloud Pubsub to DatastoreWhy is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?How do you apply transformations to all elements in a window of an unbounded Apache Beam pipeline before outputting the window?CombineFn Dataflow - Step not in order, creating null pointerHow to create GoogleCredential object referencing the service account json file in Dataflow?Specific Version: PubSub/Dataflow acknowledgement of unbounded dataDatastoreException: A non-transactional commit may not contain multiple mutations affecting the same entityHow to use GroupBy in Google Dataflow pipeline to write to GCS?
In the quantum hamiltonian, why does kinetic energy turn into an operator while potential doesn't?
Why was Goose renamed from Chewie for the Captain Marvel film?
When a wind turbine does not produce enough electricity how does the power company compensate for the loss?
Is it possible to avoid unpacking when merging Association?
How does NOW work?
Declaring and defining template, and specialising them
Single word request: Harming the benefactor
An alternative proof of an application of Hahn-Banach
Why is computing ridge regression with a Cholesky decomposition much quicker than using SVD?
How strictly should I take "Candidates must be local"?
Why does Captain Marvel assume the people on this planet know this?
Does a warlock using the Darkness/Devil's Sight combo still have advantage on ranged attacks against a target outside the Darkness?
Is "conspicuously missing" or "conspicuously" the subject of this sentence?
Signed and unsigned numbers
Do items de-spawn in Diablo?
'The literal of type int is out of range' con número enteros pequeños (2 dígitos)
What are actual Tesla M60 models used by AWS?
Plausibility of Mushroom Buildings
How are showroom/display vehicles prepared?
Reversed Sudoku
How to secure an aircraft at a transient parking space?
Can Mathematica be used to create an Artistic 3D extrusion from a 2D image and wrap a line pattern around it?
Examples of a statistic that is not independent of sample's distribution?
What Happens when Passenger Refuses to Fly Boeing 737 Max?
How are Dataflow bundles created after GroupBy/Combine?
2019 Community Moderator ElectionHow to Control Location of Parallel / GroupBy Stage in Google Cloud DataflowHow to iterate all files in google cloud storage to be used as dataflow input?Dataflow pipeline from Cloud Pubsub to DatastoreWhy is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?How do you apply transformations to all elements in a window of an unbounded Apache Beam pipeline before outputting the window?CombineFn Dataflow - Step not in order, creating null pointerHow to create GoogleCredential object referencing the service account json file in Dataflow?Specific Version: PubSub/Dataflow acknowledgement of unbounded dataDatastoreException: A non-transactional commit may not contain multiple mutations affecting the same entityHow to use GroupBy in Google Dataflow pipeline to write to GCS?
Setup:
read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore
Problem:
I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.
Question:
I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?
Can re-execution (retries) of bundles cause this behavior?
Is this bundling dependent of the runner?
Is deduplication an option? if so, how would I best approach that?
Note that I'm not looking for a replacement for the datastore writer at the end of the pipeline, I already know that we can use a different strategy. I'm merely trying to understand how the bundling happens.
google-cloud-dataflow apache-beam
add a comment |
Setup:
read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore
Problem:
I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.
Question:
I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?
Can re-execution (retries) of bundles cause this behavior?
Is this bundling dependent of the runner?
Is deduplication an option? if so, how would I best approach that?
Note that I'm not looking for a replacement for the datastore writer at the end of the pipeline, I already know that we can use a different strategy. I'm merely trying to understand how the bundling happens.
google-cloud-dataflow apache-beam
1
ah this is a very good question. TBH I don't know, but I'll do my best to get someone who does here.
– Pablo
Mar 7 at 0:09
Much appreciated @pablo ! :)
– Jonny5
Mar 7 at 7:16
1
sorry for the delay. Will try to get some5thing tomorrow!
– Pablo
Mar 8 at 1:07
okay I went asking around. I hope the answer is helpful.
– Pablo
Mar 9 at 0:38
add a comment |
Setup:
read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore
Problem:
I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.
Question:
I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?
Can re-execution (retries) of bundles cause this behavior?
Is this bundling dependent of the runner?
Is deduplication an option? if so, how would I best approach that?
Note that I'm not looking for a replacement for the datastore writer at the end of the pipeline, I already know that we can use a different strategy. I'm merely trying to understand how the bundling happens.
google-cloud-dataflow apache-beam
Setup:
read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore
Problem:
I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.
Question:
I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?
Can re-execution (retries) of bundles cause this behavior?
Is this bundling dependent of the runner?
Is deduplication an option? if so, how would I best approach that?
Note that I'm not looking for a replacement for the datastore writer at the end of the pipeline, I already know that we can use a different strategy. I'm merely trying to understand how the bundling happens.
google-cloud-dataflow apache-beam
google-cloud-dataflow apache-beam
asked Mar 6 at 15:24
Jonny5Jonny5
461418
461418
1
ah this is a very good question. TBH I don't know, but I'll do my best to get someone who does here.
– Pablo
Mar 7 at 0:09
Much appreciated @pablo ! :)
– Jonny5
Mar 7 at 7:16
1
sorry for the delay. Will try to get some5thing tomorrow!
– Pablo
Mar 8 at 1:07
okay I went asking around. I hope the answer is helpful.
– Pablo
Mar 9 at 0:38
add a comment |
1
ah this is a very good question. TBH I don't know, but I'll do my best to get someone who does here.
– Pablo
Mar 7 at 0:09
Much appreciated @pablo ! :)
– Jonny5
Mar 7 at 7:16
1
sorry for the delay. Will try to get some5thing tomorrow!
– Pablo
Mar 8 at 1:07
okay I went asking around. I hope the answer is helpful.
– Pablo
Mar 9 at 0:38
1
1
ah this is a very good question. TBH I don't know, but I'll do my best to get someone who does here.
– Pablo
Mar 7 at 0:09
ah this is a very good question. TBH I don't know, but I'll do my best to get someone who does here.
– Pablo
Mar 7 at 0:09
Much appreciated @pablo ! :)
– Jonny5
Mar 7 at 7:16
Much appreciated @pablo ! :)
– Jonny5
Mar 7 at 7:16
1
1
sorry for the delay. Will try to get some5thing tomorrow!
– Pablo
Mar 8 at 1:07
sorry for the delay. Will try to get some5thing tomorrow!
– Pablo
Mar 8 at 1:07
okay I went asking around. I hope the answer is helpful.
– Pablo
Mar 9 at 0:38
okay I went asking around. I hope the answer is helpful.
– Pablo
Mar 9 at 0:38
add a comment |
1 Answer
1
active
oldest
votes
There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.
Specific to your pipeline
I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey
or Combine
operations will have one separate element for every pair of user+window.
So the question is: What are you trying to insert into datastore?
- An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.
- A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?
Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.
The larger question about bundling of data
Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:
- Every worker is assigned a key range. Elements for the same key will be processed by the same worker.
- Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.
*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55026556%2fhow-are-dataflow-bundles-created-after-groupby-combine%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.
Specific to your pipeline
I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey
or Combine
operations will have one separate element for every pair of user+window.
So the question is: What are you trying to insert into datastore?
- An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.
- A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?
Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.
The larger question about bundling of data
Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:
- Every worker is assigned a key range. Elements for the same key will be processed by the same worker.
- Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.
*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).
add a comment |
There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.
Specific to your pipeline
I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey
or Combine
operations will have one separate element for every pair of user+window.
So the question is: What are you trying to insert into datastore?
- An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.
- A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?
Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.
The larger question about bundling of data
Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:
- Every worker is assigned a key range. Elements for the same key will be processed by the same worker.
- Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.
*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).
add a comment |
There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.
Specific to your pipeline
I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey
or Combine
operations will have one separate element for every pair of user+window.
So the question is: What are you trying to insert into datastore?
- An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.
- A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?
Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.
The larger question about bundling of data
Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:
- Every worker is assigned a key range. Elements for the same key will be processed by the same worker.
- Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.
*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).
There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.
Specific to your pipeline
I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey
or Combine
operations will have one separate element for every pair of user+window.
So the question is: What are you trying to insert into datastore?
- An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.
- A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?
Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.
The larger question about bundling of data
Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:
- Every worker is assigned a key range. Elements for the same key will be processed by the same worker.
- Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.
*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).
answered Mar 9 at 0:38
PabloPablo
3,5662038
3,5662038
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55026556%2fhow-are-dataflow-bundles-created-after-groupby-combine%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
ah this is a very good question. TBH I don't know, but I'll do my best to get someone who does here.
– Pablo
Mar 7 at 0:09
Much appreciated @pablo ! :)
– Jonny5
Mar 7 at 7:16
1
sorry for the delay. Will try to get some5thing tomorrow!
– Pablo
Mar 8 at 1:07
okay I went asking around. I hope the answer is helpful.
– Pablo
Mar 9 at 0:38