Pandas: Keep Column, Count, Drop Duplicates2019 Community Moderator ElectionSelecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to drop rows of Pandas DataFrame whose value in certain columns is NaN“Large data” work flows using pandasHow do I get the row count of a Pandas dataframe?How to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headers

Does splitting a potentially monolithic application into several smaller ones help prevent bugs?

Best approach to update all entries in a list that is paginated?

Is "history" a male-biased word ("his+story")?

Adding an additional "order by" column gives me a much worse plan

How could our ancestors have domesticated a solitary predator?

Built-In Shelves/Bookcases - IKEA vs Built

Make a transparent 448*448 image

Is there an elementary proof that there are infinitely many primes that are *not* completely split in an abelian extension?

Do f-stop and exposure time perfectly cancel?

How do I deal with a powergamer in a game full of beginners in a school club?

Good for you! in Russian

Are the terms "stab" and "staccato" synonyms?

Should I take out a loan for a friend to invest on my behalf?

How strictly should I take "Candidates must be local"?

How to pass a string to a command that expects a file?

Are babies of evil humanoid species inherently evil?

Is it possible to have an Abelian group under two different binary operations but the binary operations are not distributive?

Can someone explain what is being said here in color publishing in the American Mathematical Monthly?

Reverse string, can I make it faster?

Is there a window switcher for GNOME that shows the actual window?

Rejected in 4th interview round citing insufficient years of experience

Force user to remove USB token

Should I tell my boss the work he did was worthless

What are some noteworthy "mic-drop" moments in math?

Pandas: Keep Column, Count, Drop Duplicates

2019 Community Moderator ElectionSelecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to drop rows of Pandas DataFrame whose value in certain columns is NaN“Large data” work flows using pandasHow do I get the row count of a Pandas dataframe?How to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headers

I'm currently trying to drop duplicates according to two columns, but count the duplicates before they are dropped. I've managed to do this via

df_interactions = df_interactions.groupby(['user_id','item_tag_ids']).size().reset_index() .rename(columns=0:'interactions')

but this leaves me with

 user_id item_tag_ids interactions
0 170 71 1
1 170 325 1
2 170 387 1
3 170 474 1
4 170 526 2

It does what I want with respect to counting, adding as a column and dropping the duplicates but how would I do this with retaining the original structure (plus a new column). Adding more to groupby changes its behaviour.

Here is the original structure, I only want to group by IDs:

 user_id item_tag_ids item_timestamp
0 406225 7271 1483229353
1 406225 1183 1483229350
2 406225 5930 1483229350
3 406225 7162 1483229350
4 406225 7271 1483229350

I would like to have the new item_timestamp field in the smaller dataframe to contain the first occurring timestamp for that combination.

edited Mar 6 at 17:05

asked Mar 6 at 16:20

kuomi

978

1

What was the original structure?

– micric
Mar 6 at 16:27

@micric I'm trying to retain a column, item_timestamp after duplicate removal. So basically group by these IDs, count the interactions (duplicates before removal), add item_timestamps after duplicates are removed.

– kuomi
Mar 6 at 16:35

@kuomi understand that we cannot help you if you dont include example of original data before groupby.

– Erfan
Mar 6 at 16:36

From your Original structure what is the expected output?

– Scott Boston
Mar 6 at 16:53

add a comment |

I'm currently trying to drop duplicates according to two columns, but count the duplicates before they are dropped. I've managed to do this via

df_interactions = df_interactions.groupby(['user_id','item_tag_ids']).size().reset_index() .rename(columns=0:'interactions')

but this leaves me with

 user_id item_tag_ids interactions
0 170 71 1
1 170 325 1
2 170 387 1
3 170 474 1
4 170 526 2

Here is the original structure, I only want to group by IDs:

 user_id item_tag_ids item_timestamp
0 406225 7271 1483229353
1 406225 1183 1483229350
2 406225 5930 1483229350
3 406225 7162 1483229350
4 406225 7271 1483229350

I would like to have the new item_timestamp field in the smaller dataframe to contain the first occurring timestamp for that combination.

edited Mar 6 at 17:05

asked Mar 6 at 16:20

kuomi

978

1

What was the original structure?

– micric
Mar 6 at 16:27

@micric I'm trying to retain a column, item_timestamp after duplicate removal. So basically group by these IDs, count the interactions (duplicates before removal), add item_timestamps after duplicates are removed.

– kuomi
Mar 6 at 16:35

@kuomi understand that we cannot help you if you dont include example of original data before groupby.

– Erfan
Mar 6 at 16:36

From your Original structure what is the expected output?

– Scott Boston
Mar 6 at 16:53

add a comment |

I'm currently trying to drop duplicates according to two columns, but count the duplicates before they are dropped. I've managed to do this via

df_interactions = df_interactions.groupby(['user_id','item_tag_ids']).size().reset_index() .rename(columns=0:'interactions')

but this leaves me with

 user_id item_tag_ids interactions
0 170 71 1
1 170 325 1
2 170 387 1
3 170 474 1
4 170 526 2

Here is the original structure, I only want to group by IDs:

 user_id item_tag_ids item_timestamp
0 406225 7271 1483229353
1 406225 1183 1483229350
2 406225 5930 1483229350
3 406225 7162 1483229350
4 406225 7271 1483229350

I would like to have the new item_timestamp field in the smaller dataframe to contain the first occurring timestamp for that combination.

edited Mar 6 at 17:05

asked Mar 6 at 16:20

kuomi

978

I'm currently trying to drop duplicates according to two columns, but count the duplicates before they are dropped. I've managed to do this via

df_interactions = df_interactions.groupby(['user_id','item_tag_ids']).size().reset_index() .rename(columns=0:'interactions')

but this leaves me with

 user_id item_tag_ids interactions
0 170 71 1
1 170 325 1
2 170 387 1
3 170 474 1
4 170 526 2

Here is the original structure, I only want to group by IDs:

 user_id item_tag_ids item_timestamp
0 406225 7271 1483229353
1 406225 1183 1483229350
2 406225 5930 1483229350
3 406225 7162 1483229350
4 406225 7271 1483229350

I would like to have the new item_timestamp field in the smaller dataframe to contain the first occurring timestamp for that combination.

python pandas

edited Mar 6 at 17:05

asked Mar 6 at 16:20

kuomi

978

edited Mar 6 at 17:05

asked Mar 6 at 16:20

kuomi

978

edited Mar 6 at 17:05

asked Mar 6 at 16:20

kuomi

978

asked Mar 6 at 16:20

kuomi

978

asked Mar 6 at 16:20

kuomi

978

1

What was the original structure?

– micric
Mar 6 at 16:27

@micric I'm trying to retain a column, item_timestamp after duplicate removal. So basically group by these IDs, count the interactions (duplicates before removal), add item_timestamps after duplicates are removed.

– kuomi
Mar 6 at 16:35

@kuomi understand that we cannot help you if you dont include example of original data before groupby.

– Erfan
Mar 6 at 16:36

From your Original structure what is the expected output?

– Scott Boston
Mar 6 at 16:53

add a comment |

1

What was the original structure?

– micric
Mar 6 at 16:27

@micric I'm trying to retain a column, item_timestamp after duplicate removal. So basically group by these IDs, count the interactions (duplicates before removal), add item_timestamps after duplicates are removed.

– kuomi
Mar 6 at 16:35

@kuomi understand that we cannot help you if you dont include example of original data before groupby.

– Erfan
Mar 6 at 16:36

From your Original structure what is the expected output?

– Scott Boston
Mar 6 at 16:53

What was the original structure?

– micric
Mar 6 at 16:27

@micric I'm trying to retain a column, item_timestamp after duplicate removal. So basically group by these IDs, count the interactions (duplicates before removal), add item_timestamps after duplicates are removed.

– kuomi
Mar 6 at 16:35

@kuomi understand that we cannot help you if you dont include example of original data before groupby.

– Erfan
Mar 6 at 16:36

From your Original structure what is the expected output?

– Scott Boston
Mar 6 at 16:53

add a comment |

1 Answer
1

active

oldest

votes

You want to use transform like the following to keep your original data's shape.

And to get a list of the values of all the item_stamps you can use groupby in combination with agg(list)

# First we create count column with transform
df['count'] = df.groupby(['user_id', 'item_tag_ids']).user_id.transform('size')

# AFter that we merge our groupby with apply list back to our original dataframe
df = df.merge(df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index(), 
 on=['user_id', 'item_tag_ids'], 
 how='left',
 suffixes=['_1', '']).drop('item_timestamp_1', axis=1)

print(df)
 user_id item_tag_ids count item_timestamp
0 406225 7271 2 [1483229353, 1483229350]
1 406225 1183 1 [1483229350]
2 406225 5930 1 [1483229350]
3 406225 7162 1 [1483229350]
4 406225 7271 2 [1483229353, 1483229350]

Explanation of .agg(list) it aggregates the values of the group to a list like the following:

df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index()
Out[39]: 
 user_id item_tag_ids item_timestamp
0 406225 1183 [1483229350]
1 406225 5930 [1483229350]
2 406225 7162 [1483229350]
3 406225 7271 [1483229353, 1483229350]

edited Mar 6 at 17:07

answered Mar 6 at 16:37

Erfan

890214

Apologies, I've attached the original structure to my question

– kuomi
Mar 6 at 16:39

I can transform the size but this drops the rest of the columns, I want the item_timestamp to also have their duplicates dropped but if I group by all three, I get a different size of structure as some timestamps repeat

– kuomi
Mar 6 at 16:44

Editted answer, is this what you want? @kuomi

– Erfan
Mar 6 at 16:47

This seems to retain the original structure, but add a count. What I'm looking for is to group by the first two columns and then get the timestamps for what is remaining. The grouping trims my dataframe from 236268 to 31548 so what I'm looking for is the associated timestamps for each index in the new dataframe.

– kuomi
Mar 6 at 16:53

Sorry if I wasn't clear, I want a grouping of unique user_id, item_tag_ids combinations but a counter of how many times duplicates appeared. I then want the first occurring timestamps for each of the unique combinations from the original DF

– kuomi
Mar 6 at 16:58

|
show 3 more comments

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55027703%2fpandas-keep-column-count-drop-duplicates%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You want to use transform like the following to keep your original data's shape.

And to get a list of the values of all the item_stamps you can use groupby in combination with agg(list)

# First we create count column with transform
df['count'] = df.groupby(['user_id', 'item_tag_ids']).user_id.transform('size')

# AFter that we merge our groupby with apply list back to our original dataframe
df = df.merge(df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index(), 
 on=['user_id', 'item_tag_ids'], 
 how='left',
 suffixes=['_1', '']).drop('item_timestamp_1', axis=1)

print(df)
 user_id item_tag_ids count item_timestamp
0 406225 7271 2 [1483229353, 1483229350]
1 406225 1183 1 [1483229350]
2 406225 5930 1 [1483229350]
3 406225 7162 1 [1483229350]
4 406225 7271 2 [1483229353, 1483229350]

Explanation of .agg(list) it aggregates the values of the group to a list like the following:

df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index()
Out[39]: 
 user_id item_tag_ids item_timestamp
0 406225 1183 [1483229350]
1 406225 5930 [1483229350]
2 406225 7162 [1483229350]
3 406225 7271 [1483229353, 1483229350]

edited Mar 6 at 17:07

answered Mar 6 at 16:37

Erfan

890214

Apologies, I've attached the original structure to my question

– kuomi
Mar 6 at 16:39

I can transform the size but this drops the rest of the columns, I want the item_timestamp to also have their duplicates dropped but if I group by all three, I get a different size of structure as some timestamps repeat

– kuomi
Mar 6 at 16:44

Editted answer, is this what you want? @kuomi

– Erfan
Mar 6 at 16:47

This seems to retain the original structure, but add a count. What I'm looking for is to group by the first two columns and then get the timestamps for what is remaining. The grouping trims my dataframe from 236268 to 31548 so what I'm looking for is the associated timestamps for each index in the new dataframe.

– kuomi
Mar 6 at 16:53

Sorry if I wasn't clear, I want a grouping of unique user_id, item_tag_ids combinations but a counter of how many times duplicates appeared. I then want the first occurring timestamps for each of the unique combinations from the original DF

– kuomi
Mar 6 at 16:58

|
show 3 more comments

You want to use transform like the following to keep your original data's shape.

And to get a list of the values of all the item_stamps you can use groupby in combination with agg(list)

# First we create count column with transform
df['count'] = df.groupby(['user_id', 'item_tag_ids']).user_id.transform('size')

# AFter that we merge our groupby with apply list back to our original dataframe
df = df.merge(df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index(), 
 on=['user_id', 'item_tag_ids'], 
 how='left',
 suffixes=['_1', '']).drop('item_timestamp_1', axis=1)

print(df)
 user_id item_tag_ids count item_timestamp
0 406225 7271 2 [1483229353, 1483229350]
1 406225 1183 1 [1483229350]
2 406225 5930 1 [1483229350]
3 406225 7162 1 [1483229350]
4 406225 7271 2 [1483229353, 1483229350]

Explanation of .agg(list) it aggregates the values of the group to a list like the following:

df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index()
Out[39]: 
 user_id item_tag_ids item_timestamp
0 406225 1183 [1483229350]
1 406225 5930 [1483229350]
2 406225 7162 [1483229350]
3 406225 7271 [1483229353, 1483229350]

edited Mar 6 at 17:07

answered Mar 6 at 16:37

Erfan

890214

Apologies, I've attached the original structure to my question

– kuomi
Mar 6 at 16:39

I can transform the size but this drops the rest of the columns, I want the item_timestamp to also have their duplicates dropped but if I group by all three, I get a different size of structure as some timestamps repeat

– kuomi
Mar 6 at 16:44

Editted answer, is this what you want? @kuomi

– Erfan
Mar 6 at 16:47

This seems to retain the original structure, but add a count. What I'm looking for is to group by the first two columns and then get the timestamps for what is remaining. The grouping trims my dataframe from 236268 to 31548 so what I'm looking for is the associated timestamps for each index in the new dataframe.

– kuomi
Mar 6 at 16:53

Sorry if I wasn't clear, I want a grouping of unique user_id, item_tag_ids combinations but a counter of how many times duplicates appeared. I then want the first occurring timestamps for each of the unique combinations from the original DF

– kuomi
Mar 6 at 16:58

|
show 3 more comments

You want to use transform like the following to keep your original data's shape.

And to get a list of the values of all the item_stamps you can use groupby in combination with agg(list)

# First we create count column with transform
df['count'] = df.groupby(['user_id', 'item_tag_ids']).user_id.transform('size')

# AFter that we merge our groupby with apply list back to our original dataframe
df = df.merge(df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index(), 
 on=['user_id', 'item_tag_ids'], 
 how='left',
 suffixes=['_1', '']).drop('item_timestamp_1', axis=1)

print(df)
 user_id item_tag_ids count item_timestamp
0 406225 7271 2 [1483229353, 1483229350]
1 406225 1183 1 [1483229350]
2 406225 5930 1 [1483229350]
3 406225 7162 1 [1483229350]
4 406225 7271 2 [1483229353, 1483229350]

Explanation of .agg(list) it aggregates the values of the group to a list like the following:

df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index()
Out[39]: 
 user_id item_tag_ids item_timestamp
0 406225 1183 [1483229350]
1 406225 5930 [1483229350]
2 406225 7162 [1483229350]
3 406225 7271 [1483229353, 1483229350]

edited Mar 6 at 17:07

answered Mar 6 at 16:37

Erfan

890214

You want to use transform like the following to keep your original data's shape.

And to get a list of the values of all the item_stamps you can use groupby in combination with agg(list)

# First we create count column with transform
df['count'] = df.groupby(['user_id', 'item_tag_ids']).user_id.transform('size')

# AFter that we merge our groupby with apply list back to our original dataframe
df = df.merge(df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index(), 
 on=['user_id', 'item_tag_ids'], 
 how='left',
 suffixes=['_1', '']).drop('item_timestamp_1', axis=1)

print(df)
 user_id item_tag_ids count item_timestamp
0 406225 7271 2 [1483229353, 1483229350]
1 406225 1183 1 [1483229350]
2 406225 5930 1 [1483229350]
3 406225 7162 1 [1483229350]
4 406225 7271 2 [1483229353, 1483229350]

Explanation of .agg(list) it aggregates the values of the group to a list like the following:

df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index()
Out[39]: 
 user_id item_tag_ids item_timestamp
0 406225 1183 [1483229350]
1 406225 5930 [1483229350]
2 406225 7162 [1483229350]
3 406225 7271 [1483229353, 1483229350]

edited Mar 6 at 17:07

answered Mar 6 at 16:37

Erfan

890214

edited Mar 6 at 17:07

answered Mar 6 at 16:37

Erfan

890214

answered Mar 6 at 16:37

Erfan

890214

answered Mar 6 at 16:37

Erfan

890214

Apologies, I've attached the original structure to my question

– kuomi
Mar 6 at 16:39

I can transform the size but this drops the rest of the columns, I want the item_timestamp to also have their duplicates dropped but if I group by all three, I get a different size of structure as some timestamps repeat

– kuomi
Mar 6 at 16:44

Editted answer, is this what you want? @kuomi

– Erfan
Mar 6 at 16:47

This seems to retain the original structure, but add a count. What I'm looking for is to group by the first two columns and then get the timestamps for what is remaining. The grouping trims my dataframe from 236268 to 31548 so what I'm looking for is the associated timestamps for each index in the new dataframe.

– kuomi
Mar 6 at 16:53

Sorry if I wasn't clear, I want a grouping of unique user_id, item_tag_ids combinations but a counter of how many times duplicates appeared. I then want the first occurring timestamps for each of the unique combinations from the original DF

– kuomi
Mar 6 at 16:58

|
show 3 more comments

Apologies, I've attached the original structure to my question

– kuomi
Mar 6 at 16:39

I can transform the size but this drops the rest of the columns, I want the item_timestamp to also have their duplicates dropped but if I group by all three, I get a different size of structure as some timestamps repeat

– kuomi
Mar 6 at 16:44

Editted answer, is this what you want? @kuomi

– Erfan
Mar 6 at 16:47

This seems to retain the original structure, but add a count. What I'm looking for is to group by the first two columns and then get the timestamps for what is remaining. The grouping trims my dataframe from 236268 to 31548 so what I'm looking for is the associated timestamps for each index in the new dataframe.

– kuomi
Mar 6 at 16:53

Sorry if I wasn't clear, I want a grouping of unique user_id, item_tag_ids combinations but a counter of how many times duplicates appeared. I then want the first occurring timestamps for each of the unique combinations from the original DF

– kuomi
Mar 6 at 16:58

Apologies, I've attached the original structure to my question

– kuomi
Mar 6 at 16:39

I can transform the size but this drops the rest of the columns, I want the item_timestamp to also have their duplicates dropped but if I group by all three, I get a different size of structure as some timestamps repeat

– kuomi
Mar 6 at 16:44

Editted answer, is this what you want? @kuomi

– Erfan
Mar 6 at 16:47

This seems to retain the original structure, but add a count. What I'm looking for is to group by the first two columns and then get the timestamps for what is remaining. The grouping trims my dataframe from 236268 to 31548 so what I'm looking for is the associated timestamps for each index in the new dataframe.

– kuomi
Mar 6 at 16:53

Sorry if I wasn't clear, I want a grouping of unique user_id, item_tag_ids combinations but a counter of how many times duplicates appeared. I then want the first occurring timestamps for each of the unique combinations from the original DF

– kuomi
Mar 6 at 16:58

|
show 3 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ufdjrw

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Лубенський полк

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Лубенський полк

1 Answer
1

1 Answer
1

1 Answer
1