Pandas group by year, date producing spurious valuesUse a list of values to select rows from a pandas dataframeHow to drop rows of Pandas DataFrame whose value in certain columns is NaNSet value for particular cell in pandas DataFrame using indexSelect rows from a DataFrame based on values in a column in pandasDeleting DataFrame row in Pandas based on column valueGet statistics for each group (such as count, mean, etc) using pandas GroupBy?grouping rows in list in pandas groupbyHow to count the NaN values in a column in pandas DataFrameHow to check if any value is NaN in a Pandas DataFrameQuantileRegression ValueError: operands could not be broadcast together with shapes

Is there a way to generate a list of distinct numbers such that no two subsets ever have an equal sum?

Aliens crash on Earth and go into stasis to wait for technology to fix their ship

Can I criticise the more senior developers around me for not writing clean code?

Is the claim "Employers won't employ people with no 'social media presence'" realistic?

Don’t seats that recline flat defeat the purpose of having seatbelts?

Minor Revision with suggestion of an alternative proof by reviewer

Rivers without rain

infinitely many negative and infinitely many positive numbers

How much cash can I safely carry into the USA and avoid civil forfeiture?

I preordered a game on my Xbox while on the home screen of my friend's account. Which of us owns the game?

Critique of timeline aesthetic

555 timer FM transmitter

Relationship between strut and baselineskip

Classification of surfaces

How to not starve gigantic beasts

How to write a column outside the braces in a matrix?

Can SQL Server create collisions in system generated constraint names?

What are the steps to solving this definite integral?

a sore throat vs a strep throat vs strep throat

How does Captain America channel this power?

Can we say “you can pay when the order gets ready”?

Is Diceware more secure than a long passphrase?

What happened to Captain America in Endgame?

Get consecutive integer number ranges from list of int



Pandas group by year, date producing spurious values


Use a list of values to select rows from a pandas dataframeHow to drop rows of Pandas DataFrame whose value in certain columns is NaNSet value for particular cell in pandas DataFrame using indexSelect rows from a DataFrame based on values in a column in pandasDeleting DataFrame row in Pandas based on column valueGet statistics for each group (such as count, mean, etc) using pandas GroupBy?grouping rows in list in pandas groupbyHow to count the NaN values in a column in pandas DataFrameHow to check if any value is NaN in a Pandas DataFrameQuantileRegression ValueError: operands could not be broadcast together with shapes






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I have a pandas dataframe of daily stats from 1st-Jan-2015 to 3rd-Mar-2019. Reading this to a df and applying groupby month/year produces spurious values right until Dec-2019. Here is the code to get the MultiIndex levels:



col_types = 'count': np.int64, 'value': np.float64
df = pd.read_csv("myfile.csv", sep = 't', index_col = 1, dtype = col_types, parse_dates=True)

df.dtypes # count int64, value float64
type(df.index) #pandas.core.indexes.datetimes.DatetimeIndex

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index
MultiIndex(levels=[[2015, 2016, 2017, 2018, 2019], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11]],
names=['year', 'month'])


It seems MultiIndex levels are being created even for dates outside the range of the data. Instead of filtering is there a way of avoiding this during the groupby() call itself? Thanks










share|improve this question






















  • What is df.index.min() and df.index.max() ?

    – Chris A
    Mar 9 at 9:23











  • df.index.min() == (2015, 1), df.index.max() == (2019, 12)

    – shanlodh
    Mar 9 at 9:24











  • The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

    – shanlodh
    Mar 9 at 9:26











  • @perl: the problem is that it's grouping non-existent data

    – shanlodh
    Mar 9 at 9:26






  • 1





    Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

    – perl
    Mar 9 at 9:34


















0















I have a pandas dataframe of daily stats from 1st-Jan-2015 to 3rd-Mar-2019. Reading this to a df and applying groupby month/year produces spurious values right until Dec-2019. Here is the code to get the MultiIndex levels:



col_types = 'count': np.int64, 'value': np.float64
df = pd.read_csv("myfile.csv", sep = 't', index_col = 1, dtype = col_types, parse_dates=True)

df.dtypes # count int64, value float64
type(df.index) #pandas.core.indexes.datetimes.DatetimeIndex

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index
MultiIndex(levels=[[2015, 2016, 2017, 2018, 2019], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11]],
names=['year', 'month'])


It seems MultiIndex levels are being created even for dates outside the range of the data. Instead of filtering is there a way of avoiding this during the groupby() call itself? Thanks










share|improve this question






















  • What is df.index.min() and df.index.max() ?

    – Chris A
    Mar 9 at 9:23











  • df.index.min() == (2015, 1), df.index.max() == (2019, 12)

    – shanlodh
    Mar 9 at 9:24











  • The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

    – shanlodh
    Mar 9 at 9:26











  • @perl: the problem is that it's grouping non-existent data

    – shanlodh
    Mar 9 at 9:26






  • 1





    Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

    – perl
    Mar 9 at 9:34














0












0








0








I have a pandas dataframe of daily stats from 1st-Jan-2015 to 3rd-Mar-2019. Reading this to a df and applying groupby month/year produces spurious values right until Dec-2019. Here is the code to get the MultiIndex levels:



col_types = 'count': np.int64, 'value': np.float64
df = pd.read_csv("myfile.csv", sep = 't', index_col = 1, dtype = col_types, parse_dates=True)

df.dtypes # count int64, value float64
type(df.index) #pandas.core.indexes.datetimes.DatetimeIndex

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index
MultiIndex(levels=[[2015, 2016, 2017, 2018, 2019], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11]],
names=['year', 'month'])


It seems MultiIndex levels are being created even for dates outside the range of the data. Instead of filtering is there a way of avoiding this during the groupby() call itself? Thanks










share|improve this question














I have a pandas dataframe of daily stats from 1st-Jan-2015 to 3rd-Mar-2019. Reading this to a df and applying groupby month/year produces spurious values right until Dec-2019. Here is the code to get the MultiIndex levels:



col_types = 'count': np.int64, 'value': np.float64
df = pd.read_csv("myfile.csv", sep = 't', index_col = 1, dtype = col_types, parse_dates=True)

df.dtypes # count int64, value float64
type(df.index) #pandas.core.indexes.datetimes.DatetimeIndex

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index
MultiIndex(levels=[[2015, 2016, 2017, 2018, 2019], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11]],
names=['year', 'month'])


It seems MultiIndex levels are being created even for dates outside the range of the data. Instead of filtering is there a way of avoiding this during the groupby() call itself? Thanks







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 9 at 9:16









shanlodhshanlodh

189212




189212












  • What is df.index.min() and df.index.max() ?

    – Chris A
    Mar 9 at 9:23











  • df.index.min() == (2015, 1), df.index.max() == (2019, 12)

    – shanlodh
    Mar 9 at 9:24











  • The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

    – shanlodh
    Mar 9 at 9:26











  • @perl: the problem is that it's grouping non-existent data

    – shanlodh
    Mar 9 at 9:26






  • 1





    Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

    – perl
    Mar 9 at 9:34


















  • What is df.index.min() and df.index.max() ?

    – Chris A
    Mar 9 at 9:23











  • df.index.min() == (2015, 1), df.index.max() == (2019, 12)

    – shanlodh
    Mar 9 at 9:24











  • The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

    – shanlodh
    Mar 9 at 9:26











  • @perl: the problem is that it's grouping non-existent data

    – shanlodh
    Mar 9 at 9:26






  • 1





    Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

    – perl
    Mar 9 at 9:34

















What is df.index.min() and df.index.max() ?

– Chris A
Mar 9 at 9:23





What is df.index.min() and df.index.max() ?

– Chris A
Mar 9 at 9:23













df.index.min() == (2015, 1), df.index.max() == (2019, 12)

– shanlodh
Mar 9 at 9:24





df.index.min() == (2015, 1), df.index.max() == (2019, 12)

– shanlodh
Mar 9 at 9:24













The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

– shanlodh
Mar 9 at 9:26





The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

– shanlodh
Mar 9 at 9:26













@perl: the problem is that it's grouping non-existent data

– shanlodh
Mar 9 at 9:26





@perl: the problem is that it's grouping non-existent data

– shanlodh
Mar 9 at 9:26




1




1





Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

– perl
Mar 9 at 9:34






Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

– perl
Mar 9 at 9:34













1 Answer
1






active

oldest

votes


















1














The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:



df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index.max()


Output:



(2019, 3)


P.S. By the way, any reason for not using resample instead of groupby?






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55075781%2fpandas-group-by-year-date-producing-spurious-values%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:



    df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

    group_by_list = [df.index.year, df.index.month]
    grouped_df = df.groupby(group_by_list).sum()

    index_rename_names_list = ['year', 'month']
    index_rename_position_list = [0, 1]
    grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

    grouped_df.index.max()


    Output:



    (2019, 3)


    P.S. By the way, any reason for not using resample instead of groupby?






    share|improve this answer





























      1














      The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:



      df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

      group_by_list = [df.index.year, df.index.month]
      grouped_df = df.groupby(group_by_list).sum()

      index_rename_names_list = ['year', 'month']
      index_rename_position_list = [0, 1]
      grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

      grouped_df.index.max()


      Output:



      (2019, 3)


      P.S. By the way, any reason for not using resample instead of groupby?






      share|improve this answer



























        1












        1








        1







        The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:



        df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

        group_by_list = [df.index.year, df.index.month]
        grouped_df = df.groupby(group_by_list).sum()

        index_rename_names_list = ['year', 'month']
        index_rename_position_list = [0, 1]
        grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

        grouped_df.index.max()


        Output:



        (2019, 3)


        P.S. By the way, any reason for not using resample instead of groupby?






        share|improve this answer















        The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:



        df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

        group_by_list = [df.index.year, df.index.month]
        grouped_df = df.groupby(group_by_list).sum()

        index_rename_names_list = ['year', 'month']
        index_rename_position_list = [0, 1]
        grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

        grouped_df.index.max()


        Output:



        (2019, 3)


        P.S. By the way, any reason for not using resample instead of groupby?







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Mar 9 at 9:37

























        answered Mar 9 at 9:27









        perlperl

        1,916416




        1,916416





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55075781%2fpandas-group-by-year-date-producing-spurious-values%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            1928 у кіно

            Захаров Федір Захарович

            Ель Греко