How are Dataflow bundles created after GroupBy/Combine?2019 Community Moderator ElectionHow to Control Location of Parallel / GroupBy Stage in Google Cloud DataflowHow to iterate all files in google cloud storage to be used as dataflow input?Dataflow pipeline from Cloud Pubsub to DatastoreWhy is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?How do you apply transformations to all elements in a window of an unbounded Apache Beam pipeline before outputting the window?CombineFn Dataflow - Step not in order, creating null pointerHow to create GoogleCredential object referencing the service account json file in Dataflow?Specific Version: PubSub/Dataflow acknowledgement of unbounded dataDatastoreException: A non-transactional commit may not contain multiple mutations affecting the same entityHow to use GroupBy in Google Dataflow pipeline to write to GCS?

In the quantum hamiltonian, why does kinetic energy turn into an operator while potential doesn't?

Why was Goose renamed from Chewie for the Captain Marvel film?

When a wind turbine does not produce enough electricity how does the power company compensate for the loss?

Is it possible to avoid unpacking when merging Association?

How does NOW work?

Declaring and defining template, and specialising them

Single word request: Harming the benefactor

An alternative proof of an application of Hahn-Banach

Why is computing ridge regression with a Cholesky decomposition much quicker than using SVD?

How strictly should I take "Candidates must be local"?

Why does Captain Marvel assume the people on this planet know this?

Does a warlock using the Darkness/Devil's Sight combo still have advantage on ranged attacks against a target outside the Darkness?

Is "conspicuously missing" or "conspicuously" the subject of this sentence?

Signed and unsigned numbers

Do items de-spawn in Diablo?

'The literal of type int is out of range' con número enteros pequeños (2 dígitos)

What are actual Tesla M60 models used by AWS?

Plausibility of Mushroom Buildings

How are showroom/display vehicles prepared?

Reversed Sudoku

How to secure an aircraft at a transient parking space?

Can Mathematica be used to create an Artistic 3D extrusion from a 2D image and wrap a line pattern around it?

Examples of a statistic that is not independent of sample's distribution?

What Happens when Passenger Refuses to Fly Boeing 737 Max?

How are Dataflow bundles created after GroupBy/Combine?

2019 Community Moderator ElectionHow to Control Location of Parallel / GroupBy Stage in Google Cloud DataflowHow to iterate all files in google cloud storage to be used as dataflow input?Dataflow pipeline from Cloud Pubsub to DatastoreWhy is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?How do you apply transformations to all elements in a window of an unbounded Apache Beam pipeline before outputting the window?CombineFn Dataflow - Step not in order, creating null pointerHow to create GoogleCredential object referencing the service account json file in Dataflow?Specific Version: PubSub/Dataflow acknowledgement of unbounded dataDatastoreException: A non-transactional commit may not contain multiple mutations affecting the same entityHow to use GroupBy in Google Dataflow pipeline to write to GCS?

Setup:

read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore

Problem:

I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.

Question:

I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?

Can re-execution (retries) of bundles cause this behavior?

Is this bundling dependent of the runner?

Is deduplication an option? if so, how would I best approach that?

Note that I'm not looking for a replacement for the datastore writer at the end of the pipeline, I already know that we can use a different strategy. I'm merely trying to understand how the bundling happens.

asked Mar 6 at 15:24

Jonny5

461418

1

ah this is a very good question. TBH I don't know, but I'll do my best to get someone who does here.

– Pablo
Mar 7 at 0:09

Much appreciated @pablo ! :)

– Jonny5
Mar 7 at 7:16

1

sorry for the delay. Will try to get some5thing tomorrow!

– Pablo
Mar 8 at 1:07

okay I went asking around. I hope the answer is helpful.

– Pablo
Mar 9 at 0:38

add a comment |

Setup:

read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore

Problem:

I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.

Question:

I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?

Can re-execution (retries) of bundles cause this behavior?

Is this bundling dependent of the runner?

Is deduplication an option? if so, how would I best approach that?

asked Mar 6 at 15:24

Jonny5

461418

1

ah this is a very good question. TBH I don't know, but I'll do my best to get someone who does here.

– Pablo
Mar 7 at 0:09

Much appreciated @pablo ! :)

– Jonny5
Mar 7 at 7:16

1

sorry for the delay. Will try to get some5thing tomorrow!

– Pablo
Mar 8 at 1:07

okay I went asking around. I hope the answer is helpful.

– Pablo
Mar 9 at 0:38

add a comment |

Setup:

read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore

Problem:

I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.

Question:

I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?

Can re-execution (retries) of bundles cause this behavior?

Is this bundling dependent of the runner?

Is deduplication an option? if so, how would I best approach that?

asked Mar 6 at 15:24

Jonny5

461418

Setup:

read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore

Problem:

I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.

Question:

I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?

Can re-execution (retries) of bundles cause this behavior?

Is this bundling dependent of the runner?

Is deduplication an option? if so, how would I best approach that?

google-cloud-dataflow apache-beam

asked Mar 6 at 15:24

Jonny5

461418

asked Mar 6 at 15:24

Jonny5

461418

asked Mar 6 at 15:24

Jonny5

461418

asked Mar 6 at 15:24

Jonny5

461418

asked Mar 6 at 15:24

Jonny5

461418

1

ah this is a very good question. TBH I don't know, but I'll do my best to get someone who does here.

– Pablo
Mar 7 at 0:09

Much appreciated @pablo ! :)

– Jonny5
Mar 7 at 7:16

1

sorry for the delay. Will try to get some5thing tomorrow!

– Pablo
Mar 8 at 1:07

okay I went asking around. I hope the answer is helpful.

– Pablo
Mar 9 at 0:38

add a comment |

1

ah this is a very good question. TBH I don't know, but I'll do my best to get someone who does here.

– Pablo
Mar 7 at 0:09

Much appreciated @pablo ! :)

– Jonny5
Mar 7 at 7:16

1

sorry for the delay. Will try to get some5thing tomorrow!

– Pablo
Mar 8 at 1:07

okay I went asking around. I hope the answer is helpful.

– Pablo
Mar 9 at 0:38

ah this is a very good question. TBH I don't know, but I'll do my best to get someone who does here.

– Pablo
Mar 7 at 0:09

Much appreciated @pablo ! :)

– Jonny5
Mar 7 at 7:16

sorry for the delay. Will try to get some5thing tomorrow!

– Pablo
Mar 8 at 1:07

okay I went asking around. I hope the answer is helpful.

– Pablo
Mar 9 at 0:38

add a comment |

1 Answer
1

active

oldest

votes

There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.

Specific to your pipeline

I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey or Combine operations will have one separate element for every pair of user+window.

So the question is: What are you trying to insert into datastore?

An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.

A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?

Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.

The larger question about bundling of data

Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:

Every worker is assigned a key range. Elements for the same key will be processed by the same worker.

Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.

*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).

answered Mar 9 at 0:38

Pablo

3,5662038

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55026556%2fhow-are-dataflow-bundles-created-after-groupby-combine%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.

Specific to your pipeline

So the question is: What are you trying to insert into datastore?

An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.

A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?

Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.

The larger question about bundling of data

Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:

Every worker is assigned a key range. Elements for the same key will be processed by the same worker.

Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.

answered Mar 9 at 0:38

Pablo

3,5662038

add a comment |

There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.

Specific to your pipeline

So the question is: What are you trying to insert into datastore?

An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.

A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?

Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.

The larger question about bundling of data

Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:

Every worker is assigned a key range. Elements for the same key will be processed by the same worker.

Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.

answered Mar 9 at 0:38

Pablo

3,5662038

add a comment |

There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.

Specific to your pipeline

So the question is: What are you trying to insert into datastore?

An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.

A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?

Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.

The larger question about bundling of data

Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:

Every worker is assigned a key range. Elements for the same key will be processed by the same worker.

Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.

answered Mar 9 at 0:38

Pablo

3,5662038

There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.

Specific to your pipeline

So the question is: What are you trying to insert into datastore?

An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.

A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?

Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.

The larger question about bundling of data

Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:

Every worker is assigned a key range. Elements for the same key will be processed by the same worker.

Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.

answered Mar 9 at 0:38

Pablo

3,5662038

answered Mar 9 at 0:38

Pablo

3,5662038

answered Mar 9 at 0:38

Pablo

3,5662038

answered Mar 9 at 0:38

Pablo

3,5662038

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ufdjrw

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Алба-Юлія

Захаров Федір Захарович

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Алба-Юлія

Захаров Федір Захарович

1 Answer
1

1 Answer
1

1 Answer
1