Looking for an elegant solution that avoid merging two dataframes Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Data science time! April 2019 and salary with experience Should we burninate the [wrap] tag? The Ask Question Wizard is Live!Dask: How would I parallelize my code with dask delayed?Using Python Faker generate different data for 5000 rowsHow to merge two dictionaries in a single expression?How to return multiple values from a function?Peak detection in a 2D array“Large data” work flows using pandasMerge a large Dask dataframe with a small Pandas dataframeusing Dask library to merge two large dataframesDask groupby transformUsing Dask with Python causes issues when running Pandas codedask - rolling apply - result.head() yields errorMask dataframe column based on datetime index
Why was the term "discrete" used in discrete logarithm?
List *all* the tuples!
English words in a non-english sci-fi novel
How to react to hostile behavior from a senior developer?
Why aren't air breathing engines used as small first stages
In predicate logic, does existential quantification (∃) include universal quantification (∀), i.e. can 'some' imply 'all'?
How discoverable are IPv6 addresses and AAAA names by potential attackers?
Why light coming from distant stars is not discreet?
3 doors, three guards, one stone
Why are there no cargo aircraft with "flying wing" design?
Using audio cues to encourage good posture
What is a non-alternating simple group with big order, but relatively few conjugacy classes?
Is it true that "carbohydrates are of no use for the basal metabolic need"?
Why is "Consequences inflicted." not a sentence?
When do you get frequent flier miles - when you buy, or when you fly?
Coloring maths inside a tcolorbox
The logistics of corpse disposal
Why is my conclusion inconsistent with the van't Hoff equation?
Dating a Former Employee
What is the meaning of the new sigil in Game of Thrones Season 8 intro?
Abandoning the Ordinary World
Echoing a tail command produces unexpected output?
When a candle burns, why does the top of wick glow if bottom of flame is hottest?
When were vectors invented?
Looking for an elegant solution that avoid merging two dataframes
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
Data science time! April 2019 and salary with experience
Should we burninate the [wrap] tag?
The Ask Question Wizard is Live!Dask: How would I parallelize my code with dask delayed?Using Python Faker generate different data for 5000 rowsHow to merge two dictionaries in a single expression?How to return multiple values from a function?Peak detection in a 2D array“Large data” work flows using pandasMerge a large Dask dataframe with a small Pandas dataframeusing Dask library to merge two large dataframesDask groupby transformUsing Dask with Python causes issues when running Pandas codedask - rolling apply - result.head() yields errorMask dataframe column based on datetime index
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I have a dask dataframe df
that looks as follows:
Main_Author PaperID
A X
B Y
C Z
I also have another dask dataframe pa
that looks as follows:
PaperID Co_Author
X D
X E
X F
Y A
Z B
Z D
I want a resulting dataframe that looks as follows:
Main_Author Co_Authors Num_Co_Authors
A (D,E,F) 3
B (A) 1
C (B,D) 2
This is what I did:
df = df.merge(pa, on="PaperID")
df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
This works on small dataframes. However, since I am working with really large ones, it keeps getting killed. I believe it is because I am merging. Is there a more elegant way of getting the desired result?
python python-3.x dask
add a comment |
I have a dask dataframe df
that looks as follows:
Main_Author PaperID
A X
B Y
C Z
I also have another dask dataframe pa
that looks as follows:
PaperID Co_Author
X D
X E
X F
Y A
Z B
Z D
I want a resulting dataframe that looks as follows:
Main_Author Co_Authors Num_Co_Authors
A (D,E,F) 3
B (A) 1
C (B,D) 2
This is what I did:
df = df.merge(pa, on="PaperID")
df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
This works on small dataframes. However, since I am working with really large ones, it keeps getting killed. I believe it is because I am merging. Is there a more elegant way of getting the desired result?
python python-3.x dask
add a comment |
I have a dask dataframe df
that looks as follows:
Main_Author PaperID
A X
B Y
C Z
I also have another dask dataframe pa
that looks as follows:
PaperID Co_Author
X D
X E
X F
Y A
Z B
Z D
I want a resulting dataframe that looks as follows:
Main_Author Co_Authors Num_Co_Authors
A (D,E,F) 3
B (A) 1
C (B,D) 2
This is what I did:
df = df.merge(pa, on="PaperID")
df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
This works on small dataframes. However, since I am working with really large ones, it keeps getting killed. I believe it is because I am merging. Is there a more elegant way of getting the desired result?
python python-3.x dask
I have a dask dataframe df
that looks as follows:
Main_Author PaperID
A X
B Y
C Z
I also have another dask dataframe pa
that looks as follows:
PaperID Co_Author
X D
X E
X F
Y A
Z B
Z D
I want a resulting dataframe that looks as follows:
Main_Author Co_Authors Num_Co_Authors
A (D,E,F) 3
B (A) 1
C (B,D) 2
This is what I did:
df = df.merge(pa, on="PaperID")
df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
This works on small dataframes. However, since I am working with really large ones, it keeps getting killed. I believe it is because I am merging. Is there a more elegant way of getting the desired result?
python python-3.x dask
python python-3.x dask
asked Mar 8 at 17:40
BKSBKS
7081927
7081927
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
If you are looking to work with two large DataFrame
s, then you could try to wrap this merge
in dask.delayed
there's a terrific example of
dask.delayed
here in the Dask docs or here on SOsee Dask use cases here
Imports
from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()
Generate dummy data in order to get a large number of rows in each DataFrame
- Specify number of rows of dummy data to generate in each
DataFrame
number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000
Generate some big dataset using the faker
library (per this SO post)
def create_rows(auth_colname, num=1):
output = [auth_colname:fake.name(),
"PaperID":random.randint(1000,2000) for x in range(num)]
return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))
Print first 5 rows of dataframes
print(df.head())
Main_Author PaperID
0 Kyle Morton MD 1522
1 April Edwards 1992
2 Rachel Sullivan 1874
3 Kevin Johnson 1909
4 Julie Morton 1635
print(pa.head())
Co_Author PaperID
0 Deborah Cuevas 1911
1 Melissa Fox 1095
2 Sean Mcguire 1620
3 Cory Clarke 1424
4 David White 1569
Wrap the merge
operation in a helper function
def merge_operations(df1, df2):
df = df1.merge(df2, on="PaperID")
df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
return df
Dask approach - Generate final DataFrame
using dask.delayed
ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
df_dask = dask.compute(ddf)
Output of Dask approach
[ ] | 0% Completed | 0.0s
[ ] | 0% Completed | 0.1s
[ ] | 0% Completed | 0.2s
[ ] | 0% Completed | 0.3s
[ ] | 0% Completed | 0.4s
[ ] | 0% Completed | 0.5s
[########################################] | 100% Completed | 0.6s
print(df_dask[0].head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
Pandas approach - Generate final DataFrame
created using Pandas
df_pandas = (merge_operations)(df, pa)
print(df_pandas.head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
Compare DataFrame
s obtained using Pandas and Dask approaches
from pandas.util.testing import assert_frame_equal
try:
assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
message = "n"+str(e)
else:
message = 'DataFrames created using Dask and Pandas are equivalent.'
Result of comparing two approaches
print(message)
DataFrames created using Dask and Pandas are equivalent.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55068338%2flooking-for-an-elegant-solution-that-avoid-merging-two-dataframes%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you are looking to work with two large DataFrame
s, then you could try to wrap this merge
in dask.delayed
there's a terrific example of
dask.delayed
here in the Dask docs or here on SOsee Dask use cases here
Imports
from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()
Generate dummy data in order to get a large number of rows in each DataFrame
- Specify number of rows of dummy data to generate in each
DataFrame
number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000
Generate some big dataset using the faker
library (per this SO post)
def create_rows(auth_colname, num=1):
output = [auth_colname:fake.name(),
"PaperID":random.randint(1000,2000) for x in range(num)]
return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))
Print first 5 rows of dataframes
print(df.head())
Main_Author PaperID
0 Kyle Morton MD 1522
1 April Edwards 1992
2 Rachel Sullivan 1874
3 Kevin Johnson 1909
4 Julie Morton 1635
print(pa.head())
Co_Author PaperID
0 Deborah Cuevas 1911
1 Melissa Fox 1095
2 Sean Mcguire 1620
3 Cory Clarke 1424
4 David White 1569
Wrap the merge
operation in a helper function
def merge_operations(df1, df2):
df = df1.merge(df2, on="PaperID")
df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
return df
Dask approach - Generate final DataFrame
using dask.delayed
ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
df_dask = dask.compute(ddf)
Output of Dask approach
[ ] | 0% Completed | 0.0s
[ ] | 0% Completed | 0.1s
[ ] | 0% Completed | 0.2s
[ ] | 0% Completed | 0.3s
[ ] | 0% Completed | 0.4s
[ ] | 0% Completed | 0.5s
[########################################] | 100% Completed | 0.6s
print(df_dask[0].head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
Pandas approach - Generate final DataFrame
created using Pandas
df_pandas = (merge_operations)(df, pa)
print(df_pandas.head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
Compare DataFrame
s obtained using Pandas and Dask approaches
from pandas.util.testing import assert_frame_equal
try:
assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
message = "n"+str(e)
else:
message = 'DataFrames created using Dask and Pandas are equivalent.'
Result of comparing two approaches
print(message)
DataFrames created using Dask and Pandas are equivalent.
add a comment |
If you are looking to work with two large DataFrame
s, then you could try to wrap this merge
in dask.delayed
there's a terrific example of
dask.delayed
here in the Dask docs or here on SOsee Dask use cases here
Imports
from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()
Generate dummy data in order to get a large number of rows in each DataFrame
- Specify number of rows of dummy data to generate in each
DataFrame
number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000
Generate some big dataset using the faker
library (per this SO post)
def create_rows(auth_colname, num=1):
output = [auth_colname:fake.name(),
"PaperID":random.randint(1000,2000) for x in range(num)]
return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))
Print first 5 rows of dataframes
print(df.head())
Main_Author PaperID
0 Kyle Morton MD 1522
1 April Edwards 1992
2 Rachel Sullivan 1874
3 Kevin Johnson 1909
4 Julie Morton 1635
print(pa.head())
Co_Author PaperID
0 Deborah Cuevas 1911
1 Melissa Fox 1095
2 Sean Mcguire 1620
3 Cory Clarke 1424
4 David White 1569
Wrap the merge
operation in a helper function
def merge_operations(df1, df2):
df = df1.merge(df2, on="PaperID")
df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
return df
Dask approach - Generate final DataFrame
using dask.delayed
ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
df_dask = dask.compute(ddf)
Output of Dask approach
[ ] | 0% Completed | 0.0s
[ ] | 0% Completed | 0.1s
[ ] | 0% Completed | 0.2s
[ ] | 0% Completed | 0.3s
[ ] | 0% Completed | 0.4s
[ ] | 0% Completed | 0.5s
[########################################] | 100% Completed | 0.6s
print(df_dask[0].head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
Pandas approach - Generate final DataFrame
created using Pandas
df_pandas = (merge_operations)(df, pa)
print(df_pandas.head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
Compare DataFrame
s obtained using Pandas and Dask approaches
from pandas.util.testing import assert_frame_equal
try:
assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
message = "n"+str(e)
else:
message = 'DataFrames created using Dask and Pandas are equivalent.'
Result of comparing two approaches
print(message)
DataFrames created using Dask and Pandas are equivalent.
add a comment |
If you are looking to work with two large DataFrame
s, then you could try to wrap this merge
in dask.delayed
there's a terrific example of
dask.delayed
here in the Dask docs or here on SOsee Dask use cases here
Imports
from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()
Generate dummy data in order to get a large number of rows in each DataFrame
- Specify number of rows of dummy data to generate in each
DataFrame
number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000
Generate some big dataset using the faker
library (per this SO post)
def create_rows(auth_colname, num=1):
output = [auth_colname:fake.name(),
"PaperID":random.randint(1000,2000) for x in range(num)]
return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))
Print first 5 rows of dataframes
print(df.head())
Main_Author PaperID
0 Kyle Morton MD 1522
1 April Edwards 1992
2 Rachel Sullivan 1874
3 Kevin Johnson 1909
4 Julie Morton 1635
print(pa.head())
Co_Author PaperID
0 Deborah Cuevas 1911
1 Melissa Fox 1095
2 Sean Mcguire 1620
3 Cory Clarke 1424
4 David White 1569
Wrap the merge
operation in a helper function
def merge_operations(df1, df2):
df = df1.merge(df2, on="PaperID")
df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
return df
Dask approach - Generate final DataFrame
using dask.delayed
ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
df_dask = dask.compute(ddf)
Output of Dask approach
[ ] | 0% Completed | 0.0s
[ ] | 0% Completed | 0.1s
[ ] | 0% Completed | 0.2s
[ ] | 0% Completed | 0.3s
[ ] | 0% Completed | 0.4s
[ ] | 0% Completed | 0.5s
[########################################] | 100% Completed | 0.6s
print(df_dask[0].head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
Pandas approach - Generate final DataFrame
created using Pandas
df_pandas = (merge_operations)(df, pa)
print(df_pandas.head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
Compare DataFrame
s obtained using Pandas and Dask approaches
from pandas.util.testing import assert_frame_equal
try:
assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
message = "n"+str(e)
else:
message = 'DataFrames created using Dask and Pandas are equivalent.'
Result of comparing two approaches
print(message)
DataFrames created using Dask and Pandas are equivalent.
If you are looking to work with two large DataFrame
s, then you could try to wrap this merge
in dask.delayed
there's a terrific example of
dask.delayed
here in the Dask docs or here on SOsee Dask use cases here
Imports
from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()
Generate dummy data in order to get a large number of rows in each DataFrame
- Specify number of rows of dummy data to generate in each
DataFrame
number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000
Generate some big dataset using the faker
library (per this SO post)
def create_rows(auth_colname, num=1):
output = [auth_colname:fake.name(),
"PaperID":random.randint(1000,2000) for x in range(num)]
return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))
Print first 5 rows of dataframes
print(df.head())
Main_Author PaperID
0 Kyle Morton MD 1522
1 April Edwards 1992
2 Rachel Sullivan 1874
3 Kevin Johnson 1909
4 Julie Morton 1635
print(pa.head())
Co_Author PaperID
0 Deborah Cuevas 1911
1 Melissa Fox 1095
2 Sean Mcguire 1620
3 Cory Clarke 1424
4 David White 1569
Wrap the merge
operation in a helper function
def merge_operations(df1, df2):
df = df1.merge(df2, on="PaperID")
df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
return df
Dask approach - Generate final DataFrame
using dask.delayed
ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
df_dask = dask.compute(ddf)
Output of Dask approach
[ ] | 0% Completed | 0.0s
[ ] | 0% Completed | 0.1s
[ ] | 0% Completed | 0.2s
[ ] | 0% Completed | 0.3s
[ ] | 0% Completed | 0.4s
[ ] | 0% Completed | 0.5s
[########################################] | 100% Completed | 0.6s
print(df_dask[0].head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
Pandas approach - Generate final DataFrame
created using Pandas
df_pandas = (merge_operations)(df, pa)
print(df_pandas.head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
Compare DataFrame
s obtained using Pandas and Dask approaches
from pandas.util.testing import assert_frame_equal
try:
assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
message = "n"+str(e)
else:
message = 'DataFrames created using Dask and Pandas are equivalent.'
Result of comparing two approaches
print(message)
DataFrames created using Dask and Pandas are equivalent.
answered Mar 8 at 21:26
edeszedesz
2,97472672
2,97472672
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55068338%2flooking-for-an-elegant-solution-that-avoid-merging-two-dataframes%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown