Pandas: Keep Column, Count, Drop Duplicates2019 Community Moderator ElectionSelecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to drop rows of Pandas DataFrame whose value in certain columns is NaN“Large data” work flows using pandasHow do I get the row count of a Pandas dataframe?How to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headers

Does splitting a potentially monolithic application into several smaller ones help prevent bugs?

Best approach to update all entries in a list that is paginated?

Is "history" a male-biased word ("his+story")?

Adding an additional "order by" column gives me a much worse plan

How could our ancestors have domesticated a solitary predator?

Built-In Shelves/Bookcases - IKEA vs Built

Make a transparent 448*448 image

Is there an elementary proof that there are infinitely many primes that are *not* completely split in an abelian extension?

Do f-stop and exposure time perfectly cancel?

How do I deal with a powergamer in a game full of beginners in a school club?

Good for you! in Russian

Are the terms "stab" and "staccato" synonyms?

Should I take out a loan for a friend to invest on my behalf?

How strictly should I take "Candidates must be local"?

How to pass a string to a command that expects a file?

Are babies of evil humanoid species inherently evil?

Is it possible to have an Abelian group under two different binary operations but the binary operations are not distributive?

Can someone explain what is being said here in color publishing in the American Mathematical Monthly?

Reverse string, can I make it faster?

Is there a window switcher for GNOME that shows the actual window?

Rejected in 4th interview round citing insufficient years of experience

Force user to remove USB token

Should I tell my boss the work he did was worthless

What are some noteworthy "mic-drop" moments in math?



Pandas: Keep Column, Count, Drop Duplicates



2019 Community Moderator ElectionSelecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to drop rows of Pandas DataFrame whose value in certain columns is NaN“Large data” work flows using pandasHow do I get the row count of a Pandas dataframe?How to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headers










2















I'm currently trying to drop duplicates according to two columns, but count the duplicates before they are dropped. I've managed to do this via



df_interactions = df_interactions.groupby(['user_id','item_tag_ids']).size().reset_index()
.rename(columns=0:'interactions')



but this leaves me with



 user_id item_tag_ids interactions
0 170 71 1
1 170 325 1
2 170 387 1
3 170 474 1
4 170 526 2


It does what I want with respect to counting, adding as a column and dropping the duplicates but how would I do this with retaining the original structure (plus a new column). Adding more to groupby changes its behaviour.



Here is the original structure, I only want to group by IDs:



 user_id item_tag_ids item_timestamp
0 406225 7271 1483229353
1 406225 1183 1483229350
2 406225 5930 1483229350
3 406225 7162 1483229350
4 406225 7271 1483229350


I would like to have the new item_timestamp field in the smaller dataframe to contain the first occurring timestamp for that combination.










share|improve this question



















  • 1





    What was the original structure?

    – micric
    Mar 6 at 16:27











  • @micric I'm trying to retain a column, item_timestamp after duplicate removal. So basically group by these IDs, count the interactions (duplicates before removal), add item_timestamps after duplicates are removed.

    – kuomi
    Mar 6 at 16:35











  • @kuomi understand that we cannot help you if you dont include example of original data before groupby.

    – Erfan
    Mar 6 at 16:36











  • From your Original structure what is the expected output?

    – Scott Boston
    Mar 6 at 16:53















2















I'm currently trying to drop duplicates according to two columns, but count the duplicates before they are dropped. I've managed to do this via



df_interactions = df_interactions.groupby(['user_id','item_tag_ids']).size().reset_index()
.rename(columns=0:'interactions')



but this leaves me with



 user_id item_tag_ids interactions
0 170 71 1
1 170 325 1
2 170 387 1
3 170 474 1
4 170 526 2


It does what I want with respect to counting, adding as a column and dropping the duplicates but how would I do this with retaining the original structure (plus a new column). Adding more to groupby changes its behaviour.



Here is the original structure, I only want to group by IDs:



 user_id item_tag_ids item_timestamp
0 406225 7271 1483229353
1 406225 1183 1483229350
2 406225 5930 1483229350
3 406225 7162 1483229350
4 406225 7271 1483229350


I would like to have the new item_timestamp field in the smaller dataframe to contain the first occurring timestamp for that combination.










share|improve this question



















  • 1





    What was the original structure?

    – micric
    Mar 6 at 16:27











  • @micric I'm trying to retain a column, item_timestamp after duplicate removal. So basically group by these IDs, count the interactions (duplicates before removal), add item_timestamps after duplicates are removed.

    – kuomi
    Mar 6 at 16:35











  • @kuomi understand that we cannot help you if you dont include example of original data before groupby.

    – Erfan
    Mar 6 at 16:36











  • From your Original structure what is the expected output?

    – Scott Boston
    Mar 6 at 16:53













2












2








2








I'm currently trying to drop duplicates according to two columns, but count the duplicates before they are dropped. I've managed to do this via



df_interactions = df_interactions.groupby(['user_id','item_tag_ids']).size().reset_index()
.rename(columns=0:'interactions')



but this leaves me with



 user_id item_tag_ids interactions
0 170 71 1
1 170 325 1
2 170 387 1
3 170 474 1
4 170 526 2


It does what I want with respect to counting, adding as a column and dropping the duplicates but how would I do this with retaining the original structure (plus a new column). Adding more to groupby changes its behaviour.



Here is the original structure, I only want to group by IDs:



 user_id item_tag_ids item_timestamp
0 406225 7271 1483229353
1 406225 1183 1483229350
2 406225 5930 1483229350
3 406225 7162 1483229350
4 406225 7271 1483229350


I would like to have the new item_timestamp field in the smaller dataframe to contain the first occurring timestamp for that combination.










share|improve this question
















I'm currently trying to drop duplicates according to two columns, but count the duplicates before they are dropped. I've managed to do this via



df_interactions = df_interactions.groupby(['user_id','item_tag_ids']).size().reset_index()
.rename(columns=0:'interactions')



but this leaves me with



 user_id item_tag_ids interactions
0 170 71 1
1 170 325 1
2 170 387 1
3 170 474 1
4 170 526 2


It does what I want with respect to counting, adding as a column and dropping the duplicates but how would I do this with retaining the original structure (plus a new column). Adding more to groupby changes its behaviour.



Here is the original structure, I only want to group by IDs:



 user_id item_tag_ids item_timestamp
0 406225 7271 1483229353
1 406225 1183 1483229350
2 406225 5930 1483229350
3 406225 7162 1483229350
4 406225 7271 1483229350


I would like to have the new item_timestamp field in the smaller dataframe to contain the first occurring timestamp for that combination.







python pandas






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 6 at 17:05







kuomi

















asked Mar 6 at 16:20









kuomikuomi

978




978







  • 1





    What was the original structure?

    – micric
    Mar 6 at 16:27











  • @micric I'm trying to retain a column, item_timestamp after duplicate removal. So basically group by these IDs, count the interactions (duplicates before removal), add item_timestamps after duplicates are removed.

    – kuomi
    Mar 6 at 16:35











  • @kuomi understand that we cannot help you if you dont include example of original data before groupby.

    – Erfan
    Mar 6 at 16:36











  • From your Original structure what is the expected output?

    – Scott Boston
    Mar 6 at 16:53












  • 1





    What was the original structure?

    – micric
    Mar 6 at 16:27











  • @micric I'm trying to retain a column, item_timestamp after duplicate removal. So basically group by these IDs, count the interactions (duplicates before removal), add item_timestamps after duplicates are removed.

    – kuomi
    Mar 6 at 16:35











  • @kuomi understand that we cannot help you if you dont include example of original data before groupby.

    – Erfan
    Mar 6 at 16:36











  • From your Original structure what is the expected output?

    – Scott Boston
    Mar 6 at 16:53







1




1





What was the original structure?

– micric
Mar 6 at 16:27





What was the original structure?

– micric
Mar 6 at 16:27













@micric I'm trying to retain a column, item_timestamp after duplicate removal. So basically group by these IDs, count the interactions (duplicates before removal), add item_timestamps after duplicates are removed.

– kuomi
Mar 6 at 16:35





@micric I'm trying to retain a column, item_timestamp after duplicate removal. So basically group by these IDs, count the interactions (duplicates before removal), add item_timestamps after duplicates are removed.

– kuomi
Mar 6 at 16:35













@kuomi understand that we cannot help you if you dont include example of original data before groupby.

– Erfan
Mar 6 at 16:36





@kuomi understand that we cannot help you if you dont include example of original data before groupby.

– Erfan
Mar 6 at 16:36













From your Original structure what is the expected output?

– Scott Boston
Mar 6 at 16:53





From your Original structure what is the expected output?

– Scott Boston
Mar 6 at 16:53












1 Answer
1






active

oldest

votes


















2














You want to use transform like the following to keep your original data's shape.



And to get a list of the values of all the item_stamps you can use groupby in combination with agg(list)



# First we create count column with transform
df['count'] = df.groupby(['user_id', 'item_tag_ids']).user_id.transform('size')

# AFter that we merge our groupby with apply list back to our original dataframe
df = df.merge(df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index(),
on=['user_id', 'item_tag_ids'],
how='left',
suffixes=['_1', '']).drop('item_timestamp_1', axis=1)

print(df)
user_id item_tag_ids count item_timestamp
0 406225 7271 2 [1483229353, 1483229350]
1 406225 1183 1 [1483229350]
2 406225 5930 1 [1483229350]
3 406225 7162 1 [1483229350]
4 406225 7271 2 [1483229353, 1483229350]


Explanation of .agg(list) it aggregates the values of the group to a list like the following:



df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index()
Out[39]:
user_id item_tag_ids item_timestamp
0 406225 1183 [1483229350]
1 406225 5930 [1483229350]
2 406225 7162 [1483229350]
3 406225 7271 [1483229353, 1483229350]





share|improve this answer

























  • Apologies, I've attached the original structure to my question

    – kuomi
    Mar 6 at 16:39











  • I can transform the size but this drops the rest of the columns, I want the item_timestamp to also have their duplicates dropped but if I group by all three, I get a different size of structure as some timestamps repeat

    – kuomi
    Mar 6 at 16:44











  • Editted answer, is this what you want? @kuomi

    – Erfan
    Mar 6 at 16:47











  • This seems to retain the original structure, but add a count. What I'm looking for is to group by the first two columns and then get the timestamps for what is remaining. The grouping trims my dataframe from 236268 to 31548 so what I'm looking for is the associated timestamps for each index in the new dataframe.

    – kuomi
    Mar 6 at 16:53











  • Sorry if I wasn't clear, I want a grouping of unique user_id, item_tag_ids combinations but a counter of how many times duplicates appeared. I then want the first occurring timestamps for each of the unique combinations from the original DF

    – kuomi
    Mar 6 at 16:58











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55027703%2fpandas-keep-column-count-drop-duplicates%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














You want to use transform like the following to keep your original data's shape.



And to get a list of the values of all the item_stamps you can use groupby in combination with agg(list)



# First we create count column with transform
df['count'] = df.groupby(['user_id', 'item_tag_ids']).user_id.transform('size')

# AFter that we merge our groupby with apply list back to our original dataframe
df = df.merge(df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index(),
on=['user_id', 'item_tag_ids'],
how='left',
suffixes=['_1', '']).drop('item_timestamp_1', axis=1)

print(df)
user_id item_tag_ids count item_timestamp
0 406225 7271 2 [1483229353, 1483229350]
1 406225 1183 1 [1483229350]
2 406225 5930 1 [1483229350]
3 406225 7162 1 [1483229350]
4 406225 7271 2 [1483229353, 1483229350]


Explanation of .agg(list) it aggregates the values of the group to a list like the following:



df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index()
Out[39]:
user_id item_tag_ids item_timestamp
0 406225 1183 [1483229350]
1 406225 5930 [1483229350]
2 406225 7162 [1483229350]
3 406225 7271 [1483229353, 1483229350]





share|improve this answer

























  • Apologies, I've attached the original structure to my question

    – kuomi
    Mar 6 at 16:39











  • I can transform the size but this drops the rest of the columns, I want the item_timestamp to also have their duplicates dropped but if I group by all three, I get a different size of structure as some timestamps repeat

    – kuomi
    Mar 6 at 16:44











  • Editted answer, is this what you want? @kuomi

    – Erfan
    Mar 6 at 16:47











  • This seems to retain the original structure, but add a count. What I'm looking for is to group by the first two columns and then get the timestamps for what is remaining. The grouping trims my dataframe from 236268 to 31548 so what I'm looking for is the associated timestamps for each index in the new dataframe.

    – kuomi
    Mar 6 at 16:53











  • Sorry if I wasn't clear, I want a grouping of unique user_id, item_tag_ids combinations but a counter of how many times duplicates appeared. I then want the first occurring timestamps for each of the unique combinations from the original DF

    – kuomi
    Mar 6 at 16:58
















2














You want to use transform like the following to keep your original data's shape.



And to get a list of the values of all the item_stamps you can use groupby in combination with agg(list)



# First we create count column with transform
df['count'] = df.groupby(['user_id', 'item_tag_ids']).user_id.transform('size')

# AFter that we merge our groupby with apply list back to our original dataframe
df = df.merge(df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index(),
on=['user_id', 'item_tag_ids'],
how='left',
suffixes=['_1', '']).drop('item_timestamp_1', axis=1)

print(df)
user_id item_tag_ids count item_timestamp
0 406225 7271 2 [1483229353, 1483229350]
1 406225 1183 1 [1483229350]
2 406225 5930 1 [1483229350]
3 406225 7162 1 [1483229350]
4 406225 7271 2 [1483229353, 1483229350]


Explanation of .agg(list) it aggregates the values of the group to a list like the following:



df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index()
Out[39]:
user_id item_tag_ids item_timestamp
0 406225 1183 [1483229350]
1 406225 5930 [1483229350]
2 406225 7162 [1483229350]
3 406225 7271 [1483229353, 1483229350]





share|improve this answer

























  • Apologies, I've attached the original structure to my question

    – kuomi
    Mar 6 at 16:39











  • I can transform the size but this drops the rest of the columns, I want the item_timestamp to also have their duplicates dropped but if I group by all three, I get a different size of structure as some timestamps repeat

    – kuomi
    Mar 6 at 16:44











  • Editted answer, is this what you want? @kuomi

    – Erfan
    Mar 6 at 16:47











  • This seems to retain the original structure, but add a count. What I'm looking for is to group by the first two columns and then get the timestamps for what is remaining. The grouping trims my dataframe from 236268 to 31548 so what I'm looking for is the associated timestamps for each index in the new dataframe.

    – kuomi
    Mar 6 at 16:53











  • Sorry if I wasn't clear, I want a grouping of unique user_id, item_tag_ids combinations but a counter of how many times duplicates appeared. I then want the first occurring timestamps for each of the unique combinations from the original DF

    – kuomi
    Mar 6 at 16:58














2












2








2







You want to use transform like the following to keep your original data's shape.



And to get a list of the values of all the item_stamps you can use groupby in combination with agg(list)



# First we create count column with transform
df['count'] = df.groupby(['user_id', 'item_tag_ids']).user_id.transform('size')

# AFter that we merge our groupby with apply list back to our original dataframe
df = df.merge(df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index(),
on=['user_id', 'item_tag_ids'],
how='left',
suffixes=['_1', '']).drop('item_timestamp_1', axis=1)

print(df)
user_id item_tag_ids count item_timestamp
0 406225 7271 2 [1483229353, 1483229350]
1 406225 1183 1 [1483229350]
2 406225 5930 1 [1483229350]
3 406225 7162 1 [1483229350]
4 406225 7271 2 [1483229353, 1483229350]


Explanation of .agg(list) it aggregates the values of the group to a list like the following:



df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index()
Out[39]:
user_id item_tag_ids item_timestamp
0 406225 1183 [1483229350]
1 406225 5930 [1483229350]
2 406225 7162 [1483229350]
3 406225 7271 [1483229353, 1483229350]





share|improve this answer















You want to use transform like the following to keep your original data's shape.



And to get a list of the values of all the item_stamps you can use groupby in combination with agg(list)



# First we create count column with transform
df['count'] = df.groupby(['user_id', 'item_tag_ids']).user_id.transform('size')

# AFter that we merge our groupby with apply list back to our original dataframe
df = df.merge(df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index(),
on=['user_id', 'item_tag_ids'],
how='left',
suffixes=['_1', '']).drop('item_timestamp_1', axis=1)

print(df)
user_id item_tag_ids count item_timestamp
0 406225 7271 2 [1483229353, 1483229350]
1 406225 1183 1 [1483229350]
2 406225 5930 1 [1483229350]
3 406225 7162 1 [1483229350]
4 406225 7271 2 [1483229353, 1483229350]


Explanation of .agg(list) it aggregates the values of the group to a list like the following:



df.groupby(['user_id', 'item_tag_ids']).item_timestamp.agg(list).reset_index()
Out[39]:
user_id item_tag_ids item_timestamp
0 406225 1183 [1483229350]
1 406225 5930 [1483229350]
2 406225 7162 [1483229350]
3 406225 7271 [1483229353, 1483229350]






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 6 at 17:07

























answered Mar 6 at 16:37









ErfanErfan

890214




890214












  • Apologies, I've attached the original structure to my question

    – kuomi
    Mar 6 at 16:39











  • I can transform the size but this drops the rest of the columns, I want the item_timestamp to also have their duplicates dropped but if I group by all three, I get a different size of structure as some timestamps repeat

    – kuomi
    Mar 6 at 16:44











  • Editted answer, is this what you want? @kuomi

    – Erfan
    Mar 6 at 16:47











  • This seems to retain the original structure, but add a count. What I'm looking for is to group by the first two columns and then get the timestamps for what is remaining. The grouping trims my dataframe from 236268 to 31548 so what I'm looking for is the associated timestamps for each index in the new dataframe.

    – kuomi
    Mar 6 at 16:53











  • Sorry if I wasn't clear, I want a grouping of unique user_id, item_tag_ids combinations but a counter of how many times duplicates appeared. I then want the first occurring timestamps for each of the unique combinations from the original DF

    – kuomi
    Mar 6 at 16:58


















  • Apologies, I've attached the original structure to my question

    – kuomi
    Mar 6 at 16:39











  • I can transform the size but this drops the rest of the columns, I want the item_timestamp to also have their duplicates dropped but if I group by all three, I get a different size of structure as some timestamps repeat

    – kuomi
    Mar 6 at 16:44











  • Editted answer, is this what you want? @kuomi

    – Erfan
    Mar 6 at 16:47











  • This seems to retain the original structure, but add a count. What I'm looking for is to group by the first two columns and then get the timestamps for what is remaining. The grouping trims my dataframe from 236268 to 31548 so what I'm looking for is the associated timestamps for each index in the new dataframe.

    – kuomi
    Mar 6 at 16:53











  • Sorry if I wasn't clear, I want a grouping of unique user_id, item_tag_ids combinations but a counter of how many times duplicates appeared. I then want the first occurring timestamps for each of the unique combinations from the original DF

    – kuomi
    Mar 6 at 16:58

















Apologies, I've attached the original structure to my question

– kuomi
Mar 6 at 16:39





Apologies, I've attached the original structure to my question

– kuomi
Mar 6 at 16:39













I can transform the size but this drops the rest of the columns, I want the item_timestamp to also have their duplicates dropped but if I group by all three, I get a different size of structure as some timestamps repeat

– kuomi
Mar 6 at 16:44





I can transform the size but this drops the rest of the columns, I want the item_timestamp to also have their duplicates dropped but if I group by all three, I get a different size of structure as some timestamps repeat

– kuomi
Mar 6 at 16:44













Editted answer, is this what you want? @kuomi

– Erfan
Mar 6 at 16:47





Editted answer, is this what you want? @kuomi

– Erfan
Mar 6 at 16:47













This seems to retain the original structure, but add a count. What I'm looking for is to group by the first two columns and then get the timestamps for what is remaining. The grouping trims my dataframe from 236268 to 31548 so what I'm looking for is the associated timestamps for each index in the new dataframe.

– kuomi
Mar 6 at 16:53





This seems to retain the original structure, but add a count. What I'm looking for is to group by the first two columns and then get the timestamps for what is remaining. The grouping trims my dataframe from 236268 to 31548 so what I'm looking for is the associated timestamps for each index in the new dataframe.

– kuomi
Mar 6 at 16:53













Sorry if I wasn't clear, I want a grouping of unique user_id, item_tag_ids combinations but a counter of how many times duplicates appeared. I then want the first occurring timestamps for each of the unique combinations from the original DF

– kuomi
Mar 6 at 16:58






Sorry if I wasn't clear, I want a grouping of unique user_id, item_tag_ids combinations but a counter of how many times duplicates appeared. I then want the first occurring timestamps for each of the unique combinations from the original DF

– kuomi
Mar 6 at 16:58




















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55027703%2fpandas-keep-column-count-drop-duplicates%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

1928 у кіно

Захаров Федір Захарович

Ель Греко