pyspark join two Dataframe and keep row by the recent date2019 Community Moderator ElectionSpark Equivalent of IF Then ELSEHow to convert a DataFrame back to normal RDD in pyspark?Joining two DataFrames from the same sourceHow to join two DataFrames in Scala and Apache Spark?pyspark row number dataframeRetrieve top n in each group of a DataFrame in pysparkCumulate arrays from earlier rows (PySpark dataframe)How to join two DataFrames and update missing values?PySpark adding values to one DataFrame based on columns of 2nd DataFramePySpark join dataframes and merge contents of specific columnsGenerating monthly timestamps between two dates in pyspark dataframe

Is the differential, dp, exact or not?

Is it appropriate to ask a former professor to order a library book for me through ILL?

Boss Telling direct supervisor I snitched

After Brexit, will the EU recognize British passports that are valid for more than ten years?

Interpretation of linear regression interaction term plot

Short story about an infectious indestructible metal bar?

School performs periodic password audits. Is my password compromised?

Why does a car's steering wheel get lighter with increasing speed

Averaging over columns while ignoring zero entries

Propulsion Systems

Are small insurances worth it?

3.5% Interest Student Loan or use all of my savings on Tuition?

Why do we call complex numbers “numbers” but we don’t consider 2-vectors numbers?

What is the purpose of a disclaimer like "this is not legal advice"?

A running toilet that stops itself

Tabular environment - text vertically positions itself by bottom of tikz picture in adjacent cell

Does an unused member variable take up memory?

How to educate team mate to take screenshots for bugs with out unwanted stuff

How to make sure I'm assertive enough in contact with subordinates?

How does a sound wave propagate?

Too soon for a plot twist?

What is the orbit and expected lifetime of Crew Dragon trunk?

Unidentified signals on FT8 frequencies

What would be the most expensive material to an intergalactic society?

pyspark join two Dataframe and keep row by the recent date

2019 Community Moderator ElectionSpark Equivalent of IF Then ELSEHow to convert a DataFrame back to normal RDD in pyspark?Joining two DataFrames from the same sourceHow to join two DataFrames in Scala and Apache Spark?pyspark row number dataframeRetrieve top n in each group of a DataFrame in pysparkCumulate arrays from earlier rows (PySpark dataframe)How to join two DataFrames and update missing values?PySpark adding values to one DataFrame based on columns of 2nd DataFramePySpark join dataframes and merge contents of specific columnsGenerating monthly timestamps between two dates in pyspark dataframe

I have two Dataframes A and B.

+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 5|2018-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

and I must create a new Dataframe where the score is updated by looking the date

result

+---+------+-----+----------+
|id |player|score|date |
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

edited 2 days ago

pault

16k32552

asked 2 days ago

Chemssii

New contributor

what have you tried by far?

– Alexander Dmitriev
2 days ago

I use A.join(B,'id','player',"outer") but it's not the good way

– Chemssii
2 days ago

add a comment |

I have two Dataframes A and B.

+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 5|2018-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

and I must create a new Dataframe where the score is updated by looking the date

result

+---+------+-----+----------+
|id |player|score|date |
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

edited 2 days ago

pault

16k32552

asked 2 days ago

Chemssii

New contributor

what have you tried by far?

– Alexander Dmitriev
2 days ago

I use A.join(B,'id','player',"outer") but it's not the good way

– Chemssii
2 days ago

add a comment |

I have two Dataframes A and B.

+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 5|2018-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

and I must create a new Dataframe where the score is updated by looking the date

result

+---+------+-----+----------+
|id |player|score|date |
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

edited 2 days ago

pault

16k32552

asked 2 days ago

Chemssii

New contributor

I have two Dataframes A and B.

+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 5|2018-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

and I must create a new Dataframe where the score is updated by looking the date

result

+---+------+-----+----------+
|id |player|score|date |
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

apache-spark pyspark apache-spark-sql

edited 2 days ago

pault

16k32552

asked 2 days ago

Chemssii

New contributor

edited 2 days ago

pault

16k32552

asked 2 days ago

Chemssii

New contributor

edited 2 days ago

pault

16k32552

edited 2 days ago

pault

16k32552

edited 2 days ago

pault

16k32552

asked 2 days ago

Chemssii

New contributor

asked 2 days ago

Chemssii

asked 2 days ago

Chemssii

New contributor

Chemssii is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

what have you tried by far?

– Alexander Dmitriev
2 days ago

I use A.join(B,'id','player',"outer") but it's not the good way

– Chemssii
2 days ago

add a comment |

what have you tried by far?

– Alexander Dmitriev
2 days ago

I use A.join(B,'id','player',"outer") but it's not the good way

– Chemssii
2 days ago

what have you tried by far?

– Alexander Dmitriev
2 days ago

I use A.join(B,'id','player',"outer") but it's not the good way

– Chemssii
2 days ago

add a comment |

2 Answers
2

active

oldest

votes

You can join the two dataframes, and use pyspark.sql.functions.when() to pick the values for the score and date columns.

from pyspark.sql.functions import col, when

df_A.alias("a").join(df_B.alias("b"), on=["id", "player"], how="inner")
 .select(
 "id", 
 "player", 
 when(
 col("b.date") > col("a.date"), 
 col("b.score")
 ).otherwise(col("a.score")).alias("score"),
 when(
 col("b.date") > col("a.date"), 
 col("b.date")
 ).otherwise(col("a.date")).alias("date")
 )
 .show()
#+---+------+-----+----------+
#| id|player|score| date|
#+---+------+-----+----------+
#| 1| alpha| 100|2019-02-13|
#| 2| beta| 6|2018-02-13|
#+---+------+-----+----------+

Read more on when: Spark Equivalent of IF Then ELSE

answered 2 days ago

pault

16k32552

This will be more efficient than union, distinct, and grouping.

– pault
2 days ago

add a comment |

I am making an assumption that every player is allocated an id and it doesn't change. OP wants that the resulting dataframe should contain the score from the most current date.

# Creating both the DataFrames.
df_A = sqlContext.createDataFrame([(1,'alpha',5,'2018-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_A = df_A.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))

df_B = sqlContext.createDataFrame([(1,'alpha',100,'2019-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_B = df_B.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))

The idea is to make a union(), of these two dataframes and then take the distinct rows. The reason behind taking distinct rows afterwards is the following - Suppose there was no update for a player, then in the B dataframe, it's corresponding values will be the same as in dataframe A. So, we remove such duplicates.

# Importing the requisite packages.
from pyspark.sql.functions import col, max
from pyspark.sql import Window
df = df_A.union(df_B).distinct()
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 5|2018-02-13|
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

Now, as a final step, use Window() function to loop over the unioned dataframe df and find the latestDate and filter out only those rows where the date is same as the latestDate. That way, all those rows corresponding to those players will be removed where there was an update (manifested by an updated date in dataframe B).

w = Window.partitionBy('id','player')
df = df.withColumn('latestDate', max('date').over(w))
 .where(col('date') == col('latestDate')).drop('latestDate')
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

answered 2 days ago

cph_sto

2,3012422

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Chemssii is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55023505%2fpyspark-join-two-dataframe-and-keep-row-by-the-recent-date%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You can join the two dataframes, and use pyspark.sql.functions.when() to pick the values for the score and date columns.

from pyspark.sql.functions import col, when

df_A.alias("a").join(df_B.alias("b"), on=["id", "player"], how="inner")
 .select(
 "id", 
 "player", 
 when(
 col("b.date") > col("a.date"), 
 col("b.score")
 ).otherwise(col("a.score")).alias("score"),
 when(
 col("b.date") > col("a.date"), 
 col("b.date")
 ).otherwise(col("a.date")).alias("date")
 )
 .show()
#+---+------+-----+----------+
#| id|player|score| date|
#+---+------+-----+----------+
#| 1| alpha| 100|2019-02-13|
#| 2| beta| 6|2018-02-13|
#+---+------+-----+----------+

Read more on when: Spark Equivalent of IF Then ELSE

answered 2 days ago

pault

16k32552

This will be more efficient than union, distinct, and grouping.

– pault
2 days ago

add a comment |

You can join the two dataframes, and use pyspark.sql.functions.when() to pick the values for the score and date columns.

from pyspark.sql.functions import col, when

df_A.alias("a").join(df_B.alias("b"), on=["id", "player"], how="inner")
 .select(
 "id", 
 "player", 
 when(
 col("b.date") > col("a.date"), 
 col("b.score")
 ).otherwise(col("a.score")).alias("score"),
 when(
 col("b.date") > col("a.date"), 
 col("b.date")
 ).otherwise(col("a.date")).alias("date")
 )
 .show()
#+---+------+-----+----------+
#| id|player|score| date|
#+---+------+-----+----------+
#| 1| alpha| 100|2019-02-13|
#| 2| beta| 6|2018-02-13|
#+---+------+-----+----------+

Read more on when: Spark Equivalent of IF Then ELSE

answered 2 days ago

pault

16k32552

This will be more efficient than union, distinct, and grouping.

– pault
2 days ago

add a comment |

You can join the two dataframes, and use pyspark.sql.functions.when() to pick the values for the score and date columns.

from pyspark.sql.functions import col, when

df_A.alias("a").join(df_B.alias("b"), on=["id", "player"], how="inner")
 .select(
 "id", 
 "player", 
 when(
 col("b.date") > col("a.date"), 
 col("b.score")
 ).otherwise(col("a.score")).alias("score"),
 when(
 col("b.date") > col("a.date"), 
 col("b.date")
 ).otherwise(col("a.date")).alias("date")
 )
 .show()
#+---+------+-----+----------+
#| id|player|score| date|
#+---+------+-----+----------+
#| 1| alpha| 100|2019-02-13|
#| 2| beta| 6|2018-02-13|
#+---+------+-----+----------+

Read more on when: Spark Equivalent of IF Then ELSE

answered 2 days ago

pault

16k32552

You can join the two dataframes, and use pyspark.sql.functions.when() to pick the values for the score and date columns.

from pyspark.sql.functions import col, when

df_A.alias("a").join(df_B.alias("b"), on=["id", "player"], how="inner")
 .select(
 "id", 
 "player", 
 when(
 col("b.date") > col("a.date"), 
 col("b.score")
 ).otherwise(col("a.score")).alias("score"),
 when(
 col("b.date") > col("a.date"), 
 col("b.date")
 ).otherwise(col("a.date")).alias("date")
 )
 .show()
#+---+------+-----+----------+
#| id|player|score| date|
#+---+------+-----+----------+
#| 1| alpha| 100|2019-02-13|
#| 2| beta| 6|2018-02-13|
#+---+------+-----+----------+

Read more on when: Spark Equivalent of IF Then ELSE

answered 2 days ago

pault

16k32552

answered 2 days ago

pault

16k32552

answered 2 days ago

pault

16k32552

answered 2 days ago

pault

16k32552

This will be more efficient than union, distinct, and grouping.

– pault
2 days ago

add a comment |

This will be more efficient than union, distinct, and grouping.

– pault
2 days ago

This will be more efficient than union, distinct, and grouping.

– pault
2 days ago

add a comment |

I am making an assumption that every player is allocated an id and it doesn't change. OP wants that the resulting dataframe should contain the score from the most current date.

# Creating both the DataFrames.
df_A = sqlContext.createDataFrame([(1,'alpha',5,'2018-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_A = df_A.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))

df_B = sqlContext.createDataFrame([(1,'alpha',100,'2019-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_B = df_B.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))

# Importing the requisite packages.
from pyspark.sql.functions import col, max
from pyspark.sql import Window
df = df_A.union(df_B).distinct()
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 5|2018-02-13|
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

w = Window.partitionBy('id','player')
df = df.withColumn('latestDate', max('date').over(w))
 .where(col('date') == col('latestDate')).drop('latestDate')
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

answered 2 days ago

cph_sto

2,3012422

add a comment |

I am making an assumption that every player is allocated an id and it doesn't change. OP wants that the resulting dataframe should contain the score from the most current date.

# Creating both the DataFrames.
df_A = sqlContext.createDataFrame([(1,'alpha',5,'2018-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_A = df_A.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))

df_B = sqlContext.createDataFrame([(1,'alpha',100,'2019-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_B = df_B.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))

# Importing the requisite packages.
from pyspark.sql.functions import col, max
from pyspark.sql import Window
df = df_A.union(df_B).distinct()
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 5|2018-02-13|
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

w = Window.partitionBy('id','player')
df = df.withColumn('latestDate', max('date').over(w))
 .where(col('date') == col('latestDate')).drop('latestDate')
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

answered 2 days ago

cph_sto

2,3012422

add a comment |

I am making an assumption that every player is allocated an id and it doesn't change. OP wants that the resulting dataframe should contain the score from the most current date.

# Creating both the DataFrames.
df_A = sqlContext.createDataFrame([(1,'alpha',5,'2018-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_A = df_A.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))

df_B = sqlContext.createDataFrame([(1,'alpha',100,'2019-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_B = df_B.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))

# Importing the requisite packages.
from pyspark.sql.functions import col, max
from pyspark.sql import Window
df = df_A.union(df_B).distinct()
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 5|2018-02-13|
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

w = Window.partitionBy('id','player')
df = df.withColumn('latestDate', max('date').over(w))
 .where(col('date') == col('latestDate')).drop('latestDate')
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

answered 2 days ago

cph_sto

2,3012422

I am making an assumption that every player is allocated an id and it doesn't change. OP wants that the resulting dataframe should contain the score from the most current date.

# Creating both the DataFrames.
df_A = sqlContext.createDataFrame([(1,'alpha',5,'2018-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_A = df_A.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))

df_B = sqlContext.createDataFrame([(1,'alpha',100,'2019-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_B = df_B.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))

# Importing the requisite packages.
from pyspark.sql.functions import col, max
from pyspark.sql import Window
df = df_A.union(df_B).distinct()
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 5|2018-02-13|
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

w = Window.partitionBy('id','player')
df = df.withColumn('latestDate', max('date').over(w))
 .where(col('date') == col('latestDate')).drop('latestDate')
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

answered 2 days ago

cph_sto

2,3012422

answered 2 days ago

cph_sto

2,3012422

answered 2 days ago

cph_sto

2,3012422

answered 2 days ago

cph_sto

2,3012422

add a comment |

Chemssii is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Chemssii is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ufdjrw

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Алба-Юлія

Захаров Федір Захарович

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Алба-Юлія

Захаров Федір Захарович

2 Answers
2

2 Answers
2

2 Answers
2