Pandas group by year, date producing spurious valuesUse a list of values to select rows from a pandas dataframeHow to drop rows of Pandas DataFrame whose value in certain columns is NaNSet value for particular cell in pandas DataFrame using indexSelect rows from a DataFrame based on values in a column in pandasDeleting DataFrame row in Pandas based on column valueGet statistics for each group (such as count, mean, etc) using pandas GroupBy?grouping rows in list in pandas groupbyHow to count the NaN values in a column in pandas DataFrameHow to check if any value is NaN in a Pandas DataFrameQuantileRegression ValueError: operands could not be broadcast together with shapes

Is there a way to generate a list of distinct numbers such that no two subsets ever have an equal sum?

Aliens crash on Earth and go into stasis to wait for technology to fix their ship

Can I criticise the more senior developers around me for not writing clean code?

Is the claim "Employers won't employ people with no 'social media presence'" realistic?

Don’t seats that recline flat defeat the purpose of having seatbelts?

Minor Revision with suggestion of an alternative proof by reviewer

Rivers without rain

infinitely many negative and infinitely many positive numbers

How much cash can I safely carry into the USA and avoid civil forfeiture?

I preordered a game on my Xbox while on the home screen of my friend's account. Which of us owns the game?

Critique of timeline aesthetic

555 timer FM transmitter

Relationship between strut and baselineskip

Classification of surfaces

How to not starve gigantic beasts

How to write a column outside the braces in a matrix?

Can SQL Server create collisions in system generated constraint names?

What are the steps to solving this definite integral?

a sore throat vs a strep throat vs strep throat

How does Captain America channel this power?

Can we say “you can pay when the order gets ready”?

Is Diceware more secure than a long passphrase?

What happened to Captain America in Endgame?

Get consecutive integer number ranges from list of int

Pandas group by year, date producing spurious values

Use a list of values to select rows from a pandas dataframeHow to drop rows of Pandas DataFrame whose value in certain columns is NaNSet value for particular cell in pandas DataFrame using indexSelect rows from a DataFrame based on values in a column in pandasDeleting DataFrame row in Pandas based on column valueGet statistics for each group (such as count, mean, etc) using pandas GroupBy?grouping rows in list in pandas groupbyHow to count the NaN values in a column in pandas DataFrameHow to check if any value is NaN in a Pandas DataFrameQuantileRegression ValueError: operands could not be broadcast together with shapes

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I have a pandas dataframe of daily stats from 1st-Jan-2015 to 3rd-Mar-2019. Reading this to a df and applying groupby month/year produces spurious values right until Dec-2019. Here is the code to get the MultiIndex levels:

col_types = 'count': np.int64, 'value': np.float64
df = pd.read_csv("myfile.csv", sep = 't', index_col = 1, dtype = col_types, parse_dates=True)

df.dtypes # count int64, value float64
type(df.index) #pandas.core.indexes.datetimes.DatetimeIndex

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index
MultiIndex(levels=[[2015, 2016, 2017, 2018, 2019], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
 codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], 
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 
 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 
 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
 11]],
 names=['year', 'month'])

It seems MultiIndex levels are being created even for dates outside the range of the data. Instead of filtering is there a way of avoiding this during the groupby() call itself? Thanks

asked Mar 9 at 9:16

shanlodh

189212

What is df.index.min() and df.index.max() ?

– Chris A
Mar 9 at 9:23

df.index.min() == (2015, 1), df.index.max() == (2019, 12)

– shanlodh
Mar 9 at 9:24

The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

– shanlodh
Mar 9 at 9:26

@perl: the problem is that it's grouping non-existent data

– shanlodh
Mar 9 at 9:26

1

Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

– perl
Mar 9 at 9:34

|
show 3 more comments

col_types = 'count': np.int64, 'value': np.float64
df = pd.read_csv("myfile.csv", sep = 't', index_col = 1, dtype = col_types, parse_dates=True)

df.dtypes # count int64, value float64
type(df.index) #pandas.core.indexes.datetimes.DatetimeIndex

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index
MultiIndex(levels=[[2015, 2016, 2017, 2018, 2019], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
 codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], 
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 
 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 
 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
 11]],
 names=['year', 'month'])

It seems MultiIndex levels are being created even for dates outside the range of the data. Instead of filtering is there a way of avoiding this during the groupby() call itself? Thanks

asked Mar 9 at 9:16

shanlodh

189212

What is df.index.min() and df.index.max() ?

– Chris A
Mar 9 at 9:23

df.index.min() == (2015, 1), df.index.max() == (2019, 12)

– shanlodh
Mar 9 at 9:24

The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

– shanlodh
Mar 9 at 9:26

@perl: the problem is that it's grouping non-existent data

– shanlodh
Mar 9 at 9:26

1

Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

– perl
Mar 9 at 9:34

|
show 3 more comments

col_types = 'count': np.int64, 'value': np.float64
df = pd.read_csv("myfile.csv", sep = 't', index_col = 1, dtype = col_types, parse_dates=True)

df.dtypes # count int64, value float64
type(df.index) #pandas.core.indexes.datetimes.DatetimeIndex

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index
MultiIndex(levels=[[2015, 2016, 2017, 2018, 2019], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
 codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], 
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 
 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 
 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
 11]],
 names=['year', 'month'])

It seems MultiIndex levels are being created even for dates outside the range of the data. Instead of filtering is there a way of avoiding this during the groupby() call itself? Thanks

asked Mar 9 at 9:16

shanlodh

189212

col_types = 'count': np.int64, 'value': np.float64
df = pd.read_csv("myfile.csv", sep = 't', index_col = 1, dtype = col_types, parse_dates=True)

df.dtypes # count int64, value float64
type(df.index) #pandas.core.indexes.datetimes.DatetimeIndex

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index
MultiIndex(levels=[[2015, 2016, 2017, 2018, 2019], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
 codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], 
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 
 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 
 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
 11]],
 names=['year', 'month'])

It seems MultiIndex levels are being created even for dates outside the range of the data. Instead of filtering is there a way of avoiding this during the groupby() call itself? Thanks

python pandas

asked Mar 9 at 9:16

shanlodh

189212

asked Mar 9 at 9:16

shanlodh

189212

asked Mar 9 at 9:16

shanlodh

189212

asked Mar 9 at 9:16

shanlodh

189212

asked Mar 9 at 9:16

shanlodh

189212

What is df.index.min() and df.index.max() ?

– Chris A
Mar 9 at 9:23

df.index.min() == (2015, 1), df.index.max() == (2019, 12)

– shanlodh
Mar 9 at 9:24

The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

– shanlodh
Mar 9 at 9:26

@perl: the problem is that it's grouping non-existent data

– shanlodh
Mar 9 at 9:26

1

Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

– perl
Mar 9 at 9:34

|
show 3 more comments

What is df.index.min() and df.index.max() ?

– Chris A
Mar 9 at 9:23

df.index.min() == (2015, 1), df.index.max() == (2019, 12)

– shanlodh
Mar 9 at 9:24

The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

– shanlodh
Mar 9 at 9:26

@perl: the problem is that it's grouping non-existent data

– shanlodh
Mar 9 at 9:26

1

Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

– perl
Mar 9 at 9:34

What is df.index.min() and df.index.max() ?

– Chris A
Mar 9 at 9:23

df.index.min() == (2015, 1), df.index.max() == (2019, 12)

– shanlodh
Mar 9 at 9:24

The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

– shanlodh
Mar 9 at 9:26

@perl: the problem is that it's grouping non-existent data

– shanlodh
Mar 9 at 9:26

Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

– perl
Mar 9 at 9:34

|
show 3 more comments

1 Answer
1

active

oldest

votes

The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:

df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index.max()

Output:

(2019, 3)

P.S. By the way, any reason for not using resample instead of groupby?

edited Mar 9 at 9:37

answered Mar 9 at 9:27

perl

1,916416

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55075781%2fpandas-group-by-year-date-producing-spurious-values%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:

df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index.max()

Output:

(2019, 3)

P.S. By the way, any reason for not using resample instead of groupby?

edited Mar 9 at 9:37

answered Mar 9 at 9:27

perl

1,916416

add a comment |

The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:

df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index.max()

Output:

(2019, 3)

P.S. By the way, any reason for not using resample instead of groupby?

edited Mar 9 at 9:37

answered Mar 9 at 9:27

perl

1,916416

add a comment |

The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:

df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index.max()

Output:

(2019, 3)

P.S. By the way, any reason for not using resample instead of groupby?

edited Mar 9 at 9:37

answered Mar 9 at 9:27

perl

1,916416

The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:

df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index.max()

Output:

(2019, 3)

P.S. By the way, any reason for not using resample instead of groupby?

edited Mar 9 at 9:37

answered Mar 9 at 9:27

perl

1,916416

edited Mar 9 at 9:37

answered Mar 9 at 9:27

perl

1,916416

answered Mar 9 at 9:27

perl

1,916416

answered Mar 9 at 9:27

perl

1,916416

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ufdjrw

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

1 Answer
1

1 Answer
1

1 Answer
1