Looking for an elegant solution that avoid merging two dataframes Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Data science time! April 2019 and salary with experience Should we burninate the [wrap] tag? The Ask Question Wizard is Live!Dask: How would I parallelize my code with dask delayed?Using Python Faker generate different data for 5000 rowsHow to merge two dictionaries in a single expression?How to return multiple values from a function?Peak detection in a 2D array“Large data” work flows using pandasMerge a large Dask dataframe with a small Pandas dataframeusing Dask library to merge two large dataframesDask groupby transformUsing Dask with Python causes issues when running Pandas codedask - rolling apply - result.head() yields errorMask dataframe column based on datetime index

Why was the term "discrete" used in discrete logarithm?

List *all* the tuples!

English words in a non-english sci-fi novel

How to react to hostile behavior from a senior developer?

Why aren't air breathing engines used as small first stages

In predicate logic, does existential quantification (∃) include universal quantification (∀), i.e. can 'some' imply 'all'?

How discoverable are IPv6 addresses and AAAA names by potential attackers?

Why light coming from distant stars is not discreet?

3 doors, three guards, one stone

Why are there no cargo aircraft with "flying wing" design?

Using audio cues to encourage good posture

What is a non-alternating simple group with big order, but relatively few conjugacy classes?

Is it true that "carbohydrates are of no use for the basal metabolic need"?

Why is "Consequences inflicted." not a sentence?

When do you get frequent flier miles - when you buy, or when you fly?

Coloring maths inside a tcolorbox

The logistics of corpse disposal

Why is my conclusion inconsistent with the van't Hoff equation?

Dating a Former Employee

What is the meaning of the new sigil in Game of Thrones Season 8 intro?

Abandoning the Ordinary World

Echoing a tail command produces unexpected output?

When a candle burns, why does the top of wick glow if bottom of flame is hottest?

When were vectors invented?

Looking for an elegant solution that avoid merging two dataframes

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)

Data science time! April 2019 and salary with experience

Should we burninate the [wrap] tag?

The Ask Question Wizard is Live!Dask: How would I parallelize my code with dask delayed?Using Python Faker generate different data for 5000 rowsHow to merge two dictionaries in a single expression?How to return multiple values from a function?Peak detection in a 2D array“Large data” work flows using pandasMerge a large Dask dataframe with a small Pandas dataframeusing Dask library to merge two large dataframesDask groupby transformUsing Dask with Python causes issues when running Pandas codedask - rolling apply - result.head() yields errorMask dataframe column based on datetime index

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I have a dask dataframe df that looks as follows:

Main_Author PaperID
A X
B Y
C Z

I also have another dask dataframe pa that looks as follows:

PaperID Co_Author
X D
X E
X F
Y A
Z B
Z D

I want a resulting dataframe that looks as follows:

Main_Author Co_Authors Num_Co_Authors
A (D,E,F) 3
B (A) 1
C (B,D) 2

This is what I did:

df = df.merge(pa, on="PaperID")

df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()

df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))

This works on small dataframes. However, since I am working with really large ones, it keeps getting killed. I believe it is because I am merging. Is there a more elegant way of getting the desired result?

asked Mar 8 at 17:40

BKS

7081927

add a comment |

I have a dask dataframe df that looks as follows:

Main_Author PaperID
A X
B Y
C Z

I also have another dask dataframe pa that looks as follows:

PaperID Co_Author
X D
X E
X F
Y A
Z B
Z D

I want a resulting dataframe that looks as follows:

Main_Author Co_Authors Num_Co_Authors
A (D,E,F) 3
B (A) 1
C (B,D) 2

This is what I did:

df = df.merge(pa, on="PaperID")

df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()

df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))

asked Mar 8 at 17:40

BKS

7081927

add a comment |

I have a dask dataframe df that looks as follows:

Main_Author PaperID
A X
B Y
C Z

I also have another dask dataframe pa that looks as follows:

PaperID Co_Author
X D
X E
X F
Y A
Z B
Z D

I want a resulting dataframe that looks as follows:

Main_Author Co_Authors Num_Co_Authors
A (D,E,F) 3
B (A) 1
C (B,D) 2

This is what I did:

df = df.merge(pa, on="PaperID")

df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()

df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))

asked Mar 8 at 17:40

BKS

7081927

I have a dask dataframe df that looks as follows:

Main_Author PaperID
A X
B Y
C Z

I also have another dask dataframe pa that looks as follows:

PaperID Co_Author
X D
X E
X F
Y A
Z B
Z D

I want a resulting dataframe that looks as follows:

Main_Author Co_Authors Num_Co_Authors
A (D,E,F) 3
B (A) 1
C (B,D) 2

This is what I did:

df = df.merge(pa, on="PaperID")

df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()

df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))

python python-3.x dask

asked Mar 8 at 17:40

BKS

7081927

asked Mar 8 at 17:40

BKS

7081927

asked Mar 8 at 17:40

BKS

7081927

asked Mar 8 at 17:40

BKS

7081927

asked Mar 8 at 17:40

BKS

7081927

add a comment |

1 Answer
1

active

oldest

votes

If you are looking to work with two large DataFrames, then you could try to wrap this merge in dask.delayed

there's a terrific example of dask.delayed here in the Dask docs or here on SO

see Dask use cases here

Imports

from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()

Generate dummy data in order to get a large number of rows in each DataFrame

Specify number of rows of dummy data to generate in each DataFrame

number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000

Generate some big dataset using the faker library (per this SO post)

def create_rows(auth_colname, num=1):
 output = [auth_colname:fake.name(),
 "PaperID":random.randint(1000,2000) for x in range(num)]
 return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))

Print first 5 rows of dataframes

print(df.head())
 Main_Author PaperID
0 Kyle Morton MD 1522
1 April Edwards 1992
2 Rachel Sullivan 1874
3 Kevin Johnson 1909
4 Julie Morton 1635

print(pa.head())
 Co_Author PaperID
0 Deborah Cuevas 1911
1 Melissa Fox 1095
2 Sean Mcguire 1620
3 Cory Clarke 1424
4 David White 1569

Wrap the merge operation in a helper function

def merge_operations(df1, df2):
 df = df1.merge(df2, on="PaperID")
 df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
 df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
 return df

Dask approach - Generate final DataFrame using dask.delayed

ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
 df_dask = dask.compute(ddf)

Output of Dask approach

[ ] | 0% Completed | 0.0s
[ ] | 0% Completed | 0.1s
[ ] | 0% Completed | 0.2s
[ ] | 0% Completed | 0.3s
[ ] | 0% Completed | 0.4s
[ ] | 0% Completed | 0.5s
[########################################] | 100% Completed | 0.6s

print(df_dask[0].head())
 Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6

Pandas approach - Generate final DataFrame created using Pandas

df_pandas = (merge_operations)(df, pa)

print(df_pandas.head())
 Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6

Compare DataFrames obtained using Pandas and Dask approaches

from pandas.util.testing import assert_frame_equal
try:
 assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
 message = "n"+str(e)
else:
 message = 'DataFrames created using Dask and Pandas are equivalent.'

Result of comparing two approaches

print(message)
DataFrames created using Dask and Pandas are equivalent.

answered Mar 8 at 21:26

edesz

2,97472672

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55068338%2flooking-for-an-elegant-solution-that-avoid-merging-two-dataframes%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

If you are looking to work with two large DataFrames, then you could try to wrap this merge in dask.delayed

there's a terrific example of dask.delayed here in the Dask docs or here on SO

see Dask use cases here

Imports

from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()

Generate dummy data in order to get a large number of rows in each DataFrame

Specify number of rows of dummy data to generate in each DataFrame

number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000

Generate some big dataset using the faker library (per this SO post)

def create_rows(auth_colname, num=1):
 output = [auth_colname:fake.name(),
 "PaperID":random.randint(1000,2000) for x in range(num)]
 return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))

Print first 5 rows of dataframes

print(df.head())
 Main_Author PaperID
0 Kyle Morton MD 1522
1 April Edwards 1992
2 Rachel Sullivan 1874
3 Kevin Johnson 1909
4 Julie Morton 1635

print(pa.head())
 Co_Author PaperID
0 Deborah Cuevas 1911
1 Melissa Fox 1095
2 Sean Mcguire 1620
3 Cory Clarke 1424
4 David White 1569

Wrap the merge operation in a helper function

def merge_operations(df1, df2):
 df = df1.merge(df2, on="PaperID")
 df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
 df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
 return df

Dask approach - Generate final DataFrame using dask.delayed

ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
 df_dask = dask.compute(ddf)

Output of Dask approach

[ ] | 0% Completed | 0.0s
[ ] | 0% Completed | 0.1s
[ ] | 0% Completed | 0.2s
[ ] | 0% Completed | 0.3s
[ ] | 0% Completed | 0.4s
[ ] | 0% Completed | 0.5s
[########################################] | 100% Completed | 0.6s

print(df_dask[0].head())
 Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6

Pandas approach - Generate final DataFrame created using Pandas

df_pandas = (merge_operations)(df, pa)

print(df_pandas.head())
 Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6

Compare DataFrames obtained using Pandas and Dask approaches

from pandas.util.testing import assert_frame_equal
try:
 assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
 message = "n"+str(e)
else:
 message = 'DataFrames created using Dask and Pandas are equivalent.'

Result of comparing two approaches

print(message)
DataFrames created using Dask and Pandas are equivalent.

answered Mar 8 at 21:26

edesz

2,97472672

add a comment |

If you are looking to work with two large DataFrames, then you could try to wrap this merge in dask.delayed

there's a terrific example of dask.delayed here in the Dask docs or here on SO

see Dask use cases here

Imports

from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()

Generate dummy data in order to get a large number of rows in each DataFrame

Specify number of rows of dummy data to generate in each DataFrame

number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000

Generate some big dataset using the faker library (per this SO post)

def create_rows(auth_colname, num=1):
 output = [auth_colname:fake.name(),
 "PaperID":random.randint(1000,2000) for x in range(num)]
 return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))

Print first 5 rows of dataframes

print(df.head())
 Main_Author PaperID
0 Kyle Morton MD 1522
1 April Edwards 1992
2 Rachel Sullivan 1874
3 Kevin Johnson 1909
4 Julie Morton 1635

print(pa.head())
 Co_Author PaperID
0 Deborah Cuevas 1911
1 Melissa Fox 1095
2 Sean Mcguire 1620
3 Cory Clarke 1424
4 David White 1569

Wrap the merge operation in a helper function

def merge_operations(df1, df2):
 df = df1.merge(df2, on="PaperID")
 df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
 df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
 return df

Dask approach - Generate final DataFrame using dask.delayed

ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
 df_dask = dask.compute(ddf)

Output of Dask approach

[ ] | 0% Completed | 0.0s
[ ] | 0% Completed | 0.1s
[ ] | 0% Completed | 0.2s
[ ] | 0% Completed | 0.3s
[ ] | 0% Completed | 0.4s
[ ] | 0% Completed | 0.5s
[########################################] | 100% Completed | 0.6s

print(df_dask[0].head())
 Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6

Pandas approach - Generate final DataFrame created using Pandas

df_pandas = (merge_operations)(df, pa)

print(df_pandas.head())
 Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6

Compare DataFrames obtained using Pandas and Dask approaches

from pandas.util.testing import assert_frame_equal
try:
 assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
 message = "n"+str(e)
else:
 message = 'DataFrames created using Dask and Pandas are equivalent.'

Result of comparing two approaches

print(message)
DataFrames created using Dask and Pandas are equivalent.

answered Mar 8 at 21:26

edesz

2,97472672

add a comment |

If you are looking to work with two large DataFrames, then you could try to wrap this merge in dask.delayed

there's a terrific example of dask.delayed here in the Dask docs or here on SO

see Dask use cases here

Imports

from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()

Generate dummy data in order to get a large number of rows in each DataFrame

Specify number of rows of dummy data to generate in each DataFrame

number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000

Generate some big dataset using the faker library (per this SO post)

def create_rows(auth_colname, num=1):
 output = [auth_colname:fake.name(),
 "PaperID":random.randint(1000,2000) for x in range(num)]
 return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))

Print first 5 rows of dataframes

print(df.head())
 Main_Author PaperID
0 Kyle Morton MD 1522
1 April Edwards 1992
2 Rachel Sullivan 1874
3 Kevin Johnson 1909
4 Julie Morton 1635

print(pa.head())
 Co_Author PaperID
0 Deborah Cuevas 1911
1 Melissa Fox 1095
2 Sean Mcguire 1620
3 Cory Clarke 1424
4 David White 1569

Wrap the merge operation in a helper function

def merge_operations(df1, df2):
 df = df1.merge(df2, on="PaperID")
 df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
 df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
 return df

Dask approach - Generate final DataFrame using dask.delayed

ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
 df_dask = dask.compute(ddf)

Output of Dask approach

[ ] | 0% Completed | 0.0s
[ ] | 0% Completed | 0.1s
[ ] | 0% Completed | 0.2s
[ ] | 0% Completed | 0.3s
[ ] | 0% Completed | 0.4s
[ ] | 0% Completed | 0.5s
[########################################] | 100% Completed | 0.6s

print(df_dask[0].head())
 Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6

Pandas approach - Generate final DataFrame created using Pandas

df_pandas = (merge_operations)(df, pa)

print(df_pandas.head())
 Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6

Compare DataFrames obtained using Pandas and Dask approaches

from pandas.util.testing import assert_frame_equal
try:
 assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
 message = "n"+str(e)
else:
 message = 'DataFrames created using Dask and Pandas are equivalent.'

Result of comparing two approaches

print(message)
DataFrames created using Dask and Pandas are equivalent.

answered Mar 8 at 21:26

edesz

2,97472672

If you are looking to work with two large DataFrames, then you could try to wrap this merge in dask.delayed

there's a terrific example of dask.delayed here in the Dask docs or here on SO

see Dask use cases here

Imports

from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()

Generate dummy data in order to get a large number of rows in each DataFrame

Specify number of rows of dummy data to generate in each DataFrame

number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000

Generate some big dataset using the faker library (per this SO post)

def create_rows(auth_colname, num=1):
 output = [auth_colname:fake.name(),
 "PaperID":random.randint(1000,2000) for x in range(num)]
 return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))

Print first 5 rows of dataframes

print(df.head())
 Main_Author PaperID
0 Kyle Morton MD 1522
1 April Edwards 1992
2 Rachel Sullivan 1874
3 Kevin Johnson 1909
4 Julie Morton 1635

print(pa.head())
 Co_Author PaperID
0 Deborah Cuevas 1911
1 Melissa Fox 1095
2 Sean Mcguire 1620
3 Cory Clarke 1424
4 David White 1569

Wrap the merge operation in a helper function

def merge_operations(df1, df2):
 df = df1.merge(df2, on="PaperID")
 df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
 df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
 return df

Dask approach - Generate final DataFrame using dask.delayed

ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
 df_dask = dask.compute(ddf)

Output of Dask approach

[ ] | 0% Completed | 0.0s
[ ] | 0% Completed | 0.1s
[ ] | 0% Completed | 0.2s
[ ] | 0% Completed | 0.3s
[ ] | 0% Completed | 0.4s
[ ] | 0% Completed | 0.5s
[########################################] | 100% Completed | 0.6s

print(df_dask[0].head())
 Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6

Pandas approach - Generate final DataFrame created using Pandas

df_pandas = (merge_operations)(df, pa)

print(df_pandas.head())
 Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6

Compare DataFrames obtained using Pandas and Dask approaches

from pandas.util.testing import assert_frame_equal
try:
 assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
 message = "n"+str(e)
else:
 message = 'DataFrames created using Dask and Pandas are equivalent.'

Result of comparing two approaches

print(message)
DataFrames created using Dask and Pandas are equivalent.

answered Mar 8 at 21:26

edesz

2,97472672

answered Mar 8 at 21:26

edesz

2,97472672

answered Mar 8 at 21:26

edesz

2,97472672

answered Mar 8 at 21:26

edesz

2,97472672

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ufdjrw

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Алба-Юлія

Захаров Федір Захарович

Гладіатор

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Алба-Юлія

Захаров Федір Захарович

Гладіатор

1 Answer
1

1 Answer
1

1 Answer
1