Pandas group by year, date producing spurious valuesUse a list of values to select rows from a pandas dataframeHow to drop rows of Pandas DataFrame whose value in certain columns is NaNSet value for particular cell in pandas DataFrame using indexSelect rows from a DataFrame based on values in a column in pandasDeleting DataFrame row in Pandas based on column valueGet statistics for each group (such as count, mean, etc) using pandas GroupBy?grouping rows in list in pandas groupbyHow to count the NaN values in a column in pandas DataFrameHow to check if any value is NaN in a Pandas DataFrameQuantileRegression ValueError: operands could not be broadcast together with shapes

Is there a way to generate a list of distinct numbers such that no two subsets ever have an equal sum?

Aliens crash on Earth and go into stasis to wait for technology to fix their ship

Can I criticise the more senior developers around me for not writing clean code?

Is the claim "Employers won't employ people with no 'social media presence'" realistic?

Don’t seats that recline flat defeat the purpose of having seatbelts?

Minor Revision with suggestion of an alternative proof by reviewer

Rivers without rain

infinitely many negative and infinitely many positive numbers

How much cash can I safely carry into the USA and avoid civil forfeiture?

I preordered a game on my Xbox while on the home screen of my friend's account. Which of us owns the game?

Critique of timeline aesthetic

555 timer FM transmitter

Relationship between strut and baselineskip

Classification of surfaces

How to not starve gigantic beasts

How to write a column outside the braces in a matrix?

Can SQL Server create collisions in system generated constraint names?

What are the steps to solving this definite integral?

a sore throat vs a strep throat vs strep throat

How does Captain America channel this power?

Can we say “you can pay when the order gets ready”?

Is Diceware more secure than a long passphrase?

What happened to Captain America in Endgame?

Get consecutive integer number ranges from list of int



Pandas group by year, date producing spurious values


Use a list of values to select rows from a pandas dataframeHow to drop rows of Pandas DataFrame whose value in certain columns is NaNSet value for particular cell in pandas DataFrame using indexSelect rows from a DataFrame based on values in a column in pandasDeleting DataFrame row in Pandas based on column valueGet statistics for each group (such as count, mean, etc) using pandas GroupBy?grouping rows in list in pandas groupbyHow to count the NaN values in a column in pandas DataFrameHow to check if any value is NaN in a Pandas DataFrameQuantileRegression ValueError: operands could not be broadcast together with shapes






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I have a pandas dataframe of daily stats from 1st-Jan-2015 to 3rd-Mar-2019. Reading this to a df and applying groupby month/year produces spurious values right until Dec-2019. Here is the code to get the MultiIndex levels:



col_types = 'count': np.int64, 'value': np.float64
df = pd.read_csv("myfile.csv", sep = 't', index_col = 1, dtype = col_types, parse_dates=True)

df.dtypes # count int64, value float64
type(df.index) #pandas.core.indexes.datetimes.DatetimeIndex

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index
MultiIndex(levels=[[2015, 2016, 2017, 2018, 2019], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11]],
names=['year', 'month'])


It seems MultiIndex levels are being created even for dates outside the range of the data. Instead of filtering is there a way of avoiding this during the groupby() call itself? Thanks










share|improve this question






















  • What is df.index.min() and df.index.max() ?

    – Chris A
    Mar 9 at 9:23











  • df.index.min() == (2015, 1), df.index.max() == (2019, 12)

    – shanlodh
    Mar 9 at 9:24











  • The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

    – shanlodh
    Mar 9 at 9:26











  • @perl: the problem is that it's grouping non-existent data

    – shanlodh
    Mar 9 at 9:26






  • 1





    Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

    – perl
    Mar 9 at 9:34


















0















I have a pandas dataframe of daily stats from 1st-Jan-2015 to 3rd-Mar-2019. Reading this to a df and applying groupby month/year produces spurious values right until Dec-2019. Here is the code to get the MultiIndex levels:



col_types = 'count': np.int64, 'value': np.float64
df = pd.read_csv("myfile.csv", sep = 't', index_col = 1, dtype = col_types, parse_dates=True)

df.dtypes # count int64, value float64
type(df.index) #pandas.core.indexes.datetimes.DatetimeIndex

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index
MultiIndex(levels=[[2015, 2016, 2017, 2018, 2019], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11]],
names=['year', 'month'])


It seems MultiIndex levels are being created even for dates outside the range of the data. Instead of filtering is there a way of avoiding this during the groupby() call itself? Thanks










share|improve this question






















  • What is df.index.min() and df.index.max() ?

    – Chris A
    Mar 9 at 9:23











  • df.index.min() == (2015, 1), df.index.max() == (2019, 12)

    – shanlodh
    Mar 9 at 9:24











  • The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

    – shanlodh
    Mar 9 at 9:26











  • @perl: the problem is that it's grouping non-existent data

    – shanlodh
    Mar 9 at 9:26






  • 1





    Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

    – perl
    Mar 9 at 9:34














0












0








0








I have a pandas dataframe of daily stats from 1st-Jan-2015 to 3rd-Mar-2019. Reading this to a df and applying groupby month/year produces spurious values right until Dec-2019. Here is the code to get the MultiIndex levels:



col_types = 'count': np.int64, 'value': np.float64
df = pd.read_csv("myfile.csv", sep = 't', index_col = 1, dtype = col_types, parse_dates=True)

df.dtypes # count int64, value float64
type(df.index) #pandas.core.indexes.datetimes.DatetimeIndex

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index
MultiIndex(levels=[[2015, 2016, 2017, 2018, 2019], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11]],
names=['year', 'month'])


It seems MultiIndex levels are being created even for dates outside the range of the data. Instead of filtering is there a way of avoiding this during the groupby() call itself? Thanks










share|improve this question














I have a pandas dataframe of daily stats from 1st-Jan-2015 to 3rd-Mar-2019. Reading this to a df and applying groupby month/year produces spurious values right until Dec-2019. Here is the code to get the MultiIndex levels:



col_types = 'count': np.int64, 'value': np.float64
df = pd.read_csv("myfile.csv", sep = 't', index_col = 1, dtype = col_types, parse_dates=True)

df.dtypes # count int64, value float64
type(df.index) #pandas.core.indexes.datetimes.DatetimeIndex

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index
MultiIndex(levels=[[2015, 2016, 2017, 2018, 2019], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11]],
names=['year', 'month'])


It seems MultiIndex levels are being created even for dates outside the range of the data. Instead of filtering is there a way of avoiding this during the groupby() call itself? Thanks







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 9 at 9:16









shanlodhshanlodh

189212




189212












  • What is df.index.min() and df.index.max() ?

    – Chris A
    Mar 9 at 9:23











  • df.index.min() == (2015, 1), df.index.max() == (2019, 12)

    – shanlodh
    Mar 9 at 9:24











  • The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

    – shanlodh
    Mar 9 at 9:26











  • @perl: the problem is that it's grouping non-existent data

    – shanlodh
    Mar 9 at 9:26






  • 1





    Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

    – perl
    Mar 9 at 9:34


















  • What is df.index.min() and df.index.max() ?

    – Chris A
    Mar 9 at 9:23











  • df.index.min() == (2015, 1), df.index.max() == (2019, 12)

    – shanlodh
    Mar 9 at 9:24











  • The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

    – shanlodh
    Mar 9 at 9:26











  • @perl: the problem is that it's grouping non-existent data

    – shanlodh
    Mar 9 at 9:26






  • 1





    Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

    – perl
    Mar 9 at 9:34

















What is df.index.min() and df.index.max() ?

– Chris A
Mar 9 at 9:23





What is df.index.min() and df.index.max() ?

– Chris A
Mar 9 at 9:23













df.index.min() == (2015, 1), df.index.max() == (2019, 12)

– shanlodh
Mar 9 at 9:24





df.index.min() == (2015, 1), df.index.max() == (2019, 12)

– shanlodh
Mar 9 at 9:24













The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

– shanlodh
Mar 9 at 9:26





The data runs only until Mar-19, shouldn't df.index.max() == (2019, 3)?

– shanlodh
Mar 9 at 9:26













@perl: the problem is that it's grouping non-existent data

– shanlodh
Mar 9 at 9:26





@perl: the problem is that it's grouping non-existent data

– shanlodh
Mar 9 at 9:26




1




1





Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

– perl
Mar 9 at 9:34






Oh, but you said your df.index.max() == (2019, 12). So the problem is with the index in the original DataFrame

– perl
Mar 9 at 9:34













1 Answer
1






active

oldest

votes


















1














The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:



df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

group_by_list = [df.index.year, df.index.month]
grouped_df = df.groupby(group_by_list).sum()

index_rename_names_list = ['year', 'month']
index_rename_position_list = [0, 1]
grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

grouped_df.index.max()


Output:



(2019, 3)


P.S. By the way, any reason for not using resample instead of groupby?






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55075781%2fpandas-group-by-year-date-producing-spurious-values%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:



    df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

    group_by_list = [df.index.year, df.index.month]
    grouped_df = df.groupby(group_by_list).sum()

    index_rename_names_list = ['year', 'month']
    index_rename_position_list = [0, 1]
    grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

    grouped_df.index.max()


    Output:



    (2019, 3)


    P.S. By the way, any reason for not using resample instead of groupby?






    share|improve this answer





























      1














      The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:



      df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

      group_by_list = [df.index.year, df.index.month]
      grouped_df = df.groupby(group_by_list).sum()

      index_rename_names_list = ['year', 'month']
      index_rename_position_list = [0, 1]
      grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

      grouped_df.index.max()


      Output:



      (2019, 3)


      P.S. By the way, any reason for not using resample instead of groupby?






      share|improve this answer



























        1












        1








        1







        The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:



        df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

        group_by_list = [df.index.year, df.index.month]
        grouped_df = df.groupby(group_by_list).sum()

        index_rename_names_list = ['year', 'month']
        index_rename_position_list = [0, 1]
        grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

        grouped_df.index.max()


        Output:



        (2019, 3)


        P.S. By the way, any reason for not using resample instead of groupby?






        share|improve this answer















        The problem seems to be with the index of the original DataFrame df, e.g. if we set df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03')), it works without any issue:



        df = pd.DataFrame('a': 1, index=pd.date_range('2015-01-01', '2019-03-03'))

        group_by_list = [df.index.year, df.index.month]
        grouped_df = df.groupby(group_by_list).sum()

        index_rename_names_list = ['year', 'month']
        index_rename_position_list = [0, 1]
        grouped_df.index.rename(index_rename_names_list, index_rename_position_list, inplace = True)

        grouped_df.index.max()


        Output:



        (2019, 3)


        P.S. By the way, any reason for not using resample instead of groupby?







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Mar 9 at 9:37

























        answered Mar 9 at 9:27









        perlperl

        1,916416




        1,916416





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55075781%2fpandas-group-by-year-date-producing-spurious-values%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Save data to MySQL database using ExtJS and PHP [closed]2019 Community Moderator ElectionHow can I prevent SQL injection in PHP?Which MySQL data type to use for storing boolean valuesPHP: Delete an element from an arrayHow do I connect to a MySQL Database in Python?Should I use the datetime or timestamp data type in MySQL?How to get a list of MySQL user accountsHow Do You Parse and Process HTML/XML in PHP?Reference — What does this symbol mean in PHP?How does PHP 'foreach' actually work?Why shouldn't I use mysql_* functions in PHP?

            Compiling GNU Global with universal-ctags support Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!Tags for Emacs: Relationship between etags, ebrowse, cscope, GNU Global and exuberant ctagsVim and Ctags tips and trickscscope or ctags why choose one over the other?scons and ctagsctags cannot open option file “.ctags”Adding tag scopes in universal-ctagsShould I use Universal-ctags?Universal ctags on WindowsHow do I install GNU Global with universal ctags support using Homebrew?Universal ctags with emacsHow to highlight ctags generated by Universal Ctags in Vim?

            Add ONERROR event to image from jsp tldHow to add an image to a JPanel?Saving image from PHP URLHTML img scalingCheck if an image is loaded (no errors) with jQueryHow to force an <img> to take up width, even if the image is not loadedHow do I populate hidden form field with a value set in Spring ControllerStyling Raw elements Generated from JSP tagds with Jquery MobileLimit resizing of images with explicitly set width and height attributeserror TLD use in a jsp fileJsp tld files cannot be resolved