pandas python Update subset of column A based on subset of one or more other columnspandas Add new “rank” columns for every columnHow to import other Python files?Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to drop rows of Pandas DataFrame whose value in certain columns is NaNChange data type of columns in PandasSelect rows from a DataFrame based on values in a column in pandasDeleting DataFrame row in Pandas based on column valueGet list from pandas DataFrame column headers

How to get directions in deep space?

What is the tangent at a sharp point on a curve?

Asserting that Atheism and Theism are both faith based positions

Started in 1987 vs. Starting in 1987

Why didn't Voldemort know what Grindelwald looked like?

How do I lift the insulation blower into the attic?

Offset in split text content

If the Dominion rule using their Jem'Hadar troops, why is their life expectancy so low?

Is there any common country to visit for persons holding UK and Schengen visas?

Derivative of an interpolated function

What is the meaning of "You've never met a graph you didn't like?"

How would a solely written language work mechanically

Why is "la Gestapo" feminine?

Index matching algorithm without hash-based data structures?

When is the exact date for EOL of Ubuntu 14.04 LTS?

Weird lines in Microsoft Word

What can I do if I am asked to learn different programming languages very frequently?

What is the purpose of using a decision tree?

Writing in a Christian voice

Capacitor electron flow

Does capillary rise violate hydrostatic paradox?

Why does the frost depth increase when the surface temperature warms up?

What do the positive and negative (+/-) transmit and receive pins mean on Ethernet cables?

Is there a distance limit for minecart tracks?



pandas python Update subset of column A based on subset of one or more other columns


pandas Add new “rank” columns for every columnHow to import other Python files?Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to drop rows of Pandas DataFrame whose value in certain columns is NaNChange data type of columns in PandasSelect rows from a DataFrame based on values in a column in pandasDeleting DataFrame row in Pandas based on column valueGet list from pandas DataFrame column headers













3















Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.



My pandas df has 4.5 mil rows and 23 columns. The table below shows a few lines from df2 which is generated from df. It shows two groups (eeskin and hduquant) and three features (failed, exit_status, and job_number):



# report by group

group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 -1 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 -1 4.07192e+06 61


The "rank" column value of -1 is a placeholder.



I want to update each group's rank for each feature. "Feature" in this case means each unique value in the "feature" column: failed, exit_status, and job_number. Updating the rankings for job_number, for example, means modifying the values in column "rank" only on rows where column "feature" equals job_number. As it turns out each of these rows also corresponds to a different group value in the "group" column.



So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.



The rank for feature "job_number" is based on the value of "#_jobs" col (highest number of jobs is rank 1). For feature "failed", rank is based on the "freq" of "top_value". exits_status can remain -1 for now.



The result should look like this:



 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 2 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 2 0 63
1 hduquant job_number 2 64 1 4.07192e+06 61


"eeskin" ranks 1 for failed and 2 for job_number. "hdquant" ranks 2 for failed and 1 for job_number.



I am able to update the rank values for job_number with this code:



if feat == 'job_number':
grouped = grouped.sort_values("#_jobs", ascending=False)
grouped['rank'] = grouped.index + 1


 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 2 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 1 4.07192e+06 61


But when I try to update both, neither is updated:



feat = ['job_number', 'failed']

for f in feat:
if f == 'job_number':
grouped = grouped.sort_values("#_jobs", ascending=False)
grouped['rank'] = grouped.index + 1
elif f == 'failed': # or f == 'exit_status'
x = len(not grouped[f] == 0)
grouped['x'] = x
grouped = grouped.sort_values("x", ascending=False)
grouped['rank'] = grouped.index + 1
del grouped['x']


 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 -1 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 -1 4.07192e+06 61


I've tried to implement the suggestion from Matt W. but so far without success:



df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)


I modified his code a bit as follows but also without success:



df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))



Addendum @Matt W.



Input:



import pandas as pd

df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]],
columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
df = (df.astype(dtype='group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'))
df


Output:



 group owner job_number submission_time start_time end_time failed exit_status granted_pe slots task_number maxvmem h_data h_rt highp exclusive h_vmem gpu pe slot campus wait_time wtime
0 g1 u1 3902779 2018-09-27 21:38:06 2018-10-01 07:24:38 2018-10-01 08:00:42 0 0 single 1 55696 609865728.0 4.0 6.0 0 0 4.0 0 single 1 0 3 days 09:46:32 00:36:04
4080243 g50 u92 4071923 2018-10-25 02:08:14 2018-10-27 01:41:58 2018-10-27 02:08:50 0 0 shared 1 119 7.654482e+08 2.5 1.5 0 1 16.0 0 shared 1 0 1 days 23:33:44 00:26:52
4080244 g50 u92 4071922 2018-10-25 02:08:11 2018-10-27 01:46:53 2018-10-27 02:08:53 0 0 shared 1 2208 1.074463e+09 2.5 1.5 0 10 24.0 0 shared 1 0 1 days 23:38:42 00:22:00


The code produces the first line. I tacked on a couple more lines just for variety.



There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.



I want to create an overall report and one report per group, both focused on resource usage.



Components of overall report:



General stats:



  • count, mean, std, min, 25%, 50%, 75%, max (numerical)

  • count, unique, top, freq (string)

  • count, first, last # time delta cols (time delta)

Jobs:



  • job_number with the most task_numbers, job_number w most submission times

  • job (as defined above) with earliest/latest

    • submission_time, start_time, and end_time


  • job with most

    • failed != 0

    • exit_status != 0


  • job with most (sum of values)

    • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


  • job with most (count/len)

    • pe == single

    • pe == shared

    • pe == for each addtl category of pe


  • job with longest/shortest cumulative

    • wait_time and wtime


Owners:



  • owner with the most jobs

  • owner with earliest/latest

    • submission_time, start_time, end_time


  • owner with most

    • failed != 0

    • exit_status != 0


  • owner with most (sum of values)

    • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, gpu


  • owner with most (count/len)

    • pe == single

    • pe == shared

    • pe == for each addtl category of pe


  • owner with longest/shortest cumulative

    • wait_time and wtime


Groups:



  • group with the most jobs

  • group with the most owners

  • group with earliest/latest

    • submission_time, start_time, and end_time


  • group with most

    • failed != 0

    • exit_status != 0


  • group with most (sum of values) of each

    • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


  • group with most

    • pe == single

    • pe == shared

    • pe == for each addtl category of pe


  • group with longest/shortest cumulative

    • wait_time and wtime


Components of individual "by group" reports:



By feature (column in df):



General stats:



  • count, mean, std, min, 25%, 50%, 75%, max (numerical)

  • count, unique, top, freq (string)

  • count, first, last # time delta cols (time delta)

The group's stats:



By job:



  • job_number with the most task_numbers, job_number w most submission_times

  • job (as defined above) with earliest/latest

    • submission_time, start_time, and end_time


  • job with most

    • failed != 0

    • exit_status != 0


  • job with most

    • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


  • job with most

    • pe == single (count/len)

    • pe == shared (count/len)

    • pe == for each addtl category of pe (count/len)


  • job with longest/shortest cumulative

    • wait_time and wtime


By owner:



  • owner with the most jobs

  • owner with earliest/latest

    • submission_time, start_time, end_time


  • owner with most

    • failed != 0

    • exit_status != 0


  • owner with most

    • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


  • owner with most

    • pe == single (count/len)

    • pe == shared (count/len)

    • pe == for each addtl category of pe (count/len)


  • owner with longest/shortest cumulative

    • wait_time and wtime


By ranking:



Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.



Rank for:



  • number of

    • jobs, job_numbers, task_numbers, submission times


  • time of first

    • submission_time, start_time, end_time


  • time of last

    • submission_time, start_time, end_time


  • number of

    • failed != 0

    • exit_status != 0


  • number of

    • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


  • number of

    • pe == single

    • pe == shared

    • pe == for each addtl category of pe


  • total cumulative for all jobs

    • wait_time and wtime










share|improve this question




























    3















    Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.



    My pandas df has 4.5 mil rows and 23 columns. The table below shows a few lines from df2 which is generated from df. It shows two groups (eeskin and hduquant) and three features (failed, exit_status, and job_number):



    # report by group

    group feature #_cats #_jobs rank top_value freq
    10 eeskin failed 1 6 -1 100 6
    21 eeskin exit_status 1 6 -1 0 6
    0 eeskin job_number 1 6 -1 4.08219e+06 6
    21 hduquant exit_status 5 64 -1 37 58
    11 hduquant failed 2 64 -1 0 63
    1 hduquant job_number 2 64 -1 4.07192e+06 61


    The "rank" column value of -1 is a placeholder.



    I want to update each group's rank for each feature. "Feature" in this case means each unique value in the "feature" column: failed, exit_status, and job_number. Updating the rankings for job_number, for example, means modifying the values in column "rank" only on rows where column "feature" equals job_number. As it turns out each of these rows also corresponds to a different group value in the "group" column.



    So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.



    The rank for feature "job_number" is based on the value of "#_jobs" col (highest number of jobs is rank 1). For feature "failed", rank is based on the "freq" of "top_value". exits_status can remain -1 for now.



    The result should look like this:



     group feature #_cats #_jobs rank top_value freq 
    10 eeskin failed 1 6 1 100 6
    21 eeskin exit_status 1 6 -1 0 6
    0 eeskin job_number 1 6 2 4.08219e+06 6
    21 hduquant exit_status 5 64 -1 37 58
    11 hduquant failed 2 64 2 0 63
    1 hduquant job_number 2 64 1 4.07192e+06 61


    "eeskin" ranks 1 for failed and 2 for job_number. "hdquant" ranks 2 for failed and 1 for job_number.



    I am able to update the rank values for job_number with this code:



    if feat == 'job_number':
    grouped = grouped.sort_values("#_jobs", ascending=False)
    grouped['rank'] = grouped.index + 1


     group feature #_cats #_jobs rank top_value freq 
    10 eeskin failed 1 6 -1 100 6
    21 eeskin exit_status 1 6 -1 0 6
    0 eeskin job_number 1 6 2 4.08219e+06 6
    21 hduquant exit_status 5 64 -1 37 58
    11 hduquant failed 2 64 -1 0 63
    1 hduquant job_number 2 64 1 4.07192e+06 61


    But when I try to update both, neither is updated:



    feat = ['job_number', 'failed']

    for f in feat:
    if f == 'job_number':
    grouped = grouped.sort_values("#_jobs", ascending=False)
    grouped['rank'] = grouped.index + 1
    elif f == 'failed': # or f == 'exit_status'
    x = len(not grouped[f] == 0)
    grouped['x'] = x
    grouped = grouped.sort_values("x", ascending=False)
    grouped['rank'] = grouped.index + 1
    del grouped['x']


     group feature #_cats #_jobs rank top_value freq 
    10 eeskin failed 1 6 -1 100 6
    21 eeskin exit_status 1 6 -1 0 6
    0 eeskin job_number 1 6 -1 4.08219e+06 6
    21 hduquant exit_status 5 64 -1 37 58
    11 hduquant failed 2 64 -1 0 63
    1 hduquant job_number 2 64 -1 4.07192e+06 61


    I've tried to implement the suggestion from Matt W. but so far without success:



    df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)


    I modified his code a bit as follows but also without success:



    df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))



    Addendum @Matt W.



    Input:



    import pandas as pd

    df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]],
    columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
    df = (df.astype(dtype='group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'))
    df


    Output:



     group owner job_number submission_time start_time end_time failed exit_status granted_pe slots task_number maxvmem h_data h_rt highp exclusive h_vmem gpu pe slot campus wait_time wtime
    0 g1 u1 3902779 2018-09-27 21:38:06 2018-10-01 07:24:38 2018-10-01 08:00:42 0 0 single 1 55696 609865728.0 4.0 6.0 0 0 4.0 0 single 1 0 3 days 09:46:32 00:36:04
    4080243 g50 u92 4071923 2018-10-25 02:08:14 2018-10-27 01:41:58 2018-10-27 02:08:50 0 0 shared 1 119 7.654482e+08 2.5 1.5 0 1 16.0 0 shared 1 0 1 days 23:33:44 00:26:52
    4080244 g50 u92 4071922 2018-10-25 02:08:11 2018-10-27 01:46:53 2018-10-27 02:08:53 0 0 shared 1 2208 1.074463e+09 2.5 1.5 0 10 24.0 0 shared 1 0 1 days 23:38:42 00:22:00


    The code produces the first line. I tacked on a couple more lines just for variety.



    There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.



    I want to create an overall report and one report per group, both focused on resource usage.



    Components of overall report:



    General stats:



    • count, mean, std, min, 25%, 50%, 75%, max (numerical)

    • count, unique, top, freq (string)

    • count, first, last # time delta cols (time delta)

    Jobs:



    • job_number with the most task_numbers, job_number w most submission times

    • job (as defined above) with earliest/latest

      • submission_time, start_time, and end_time


    • job with most

      • failed != 0

      • exit_status != 0


    • job with most (sum of values)

      • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


    • job with most (count/len)

      • pe == single

      • pe == shared

      • pe == for each addtl category of pe


    • job with longest/shortest cumulative

      • wait_time and wtime


    Owners:



    • owner with the most jobs

    • owner with earliest/latest

      • submission_time, start_time, end_time


    • owner with most

      • failed != 0

      • exit_status != 0


    • owner with most (sum of values)

      • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, gpu


    • owner with most (count/len)

      • pe == single

      • pe == shared

      • pe == for each addtl category of pe


    • owner with longest/shortest cumulative

      • wait_time and wtime


    Groups:



    • group with the most jobs

    • group with the most owners

    • group with earliest/latest

      • submission_time, start_time, and end_time


    • group with most

      • failed != 0

      • exit_status != 0


    • group with most (sum of values) of each

      • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


    • group with most

      • pe == single

      • pe == shared

      • pe == for each addtl category of pe


    • group with longest/shortest cumulative

      • wait_time and wtime


    Components of individual "by group" reports:



    By feature (column in df):



    General stats:



    • count, mean, std, min, 25%, 50%, 75%, max (numerical)

    • count, unique, top, freq (string)

    • count, first, last # time delta cols (time delta)

    The group's stats:



    By job:



    • job_number with the most task_numbers, job_number w most submission_times

    • job (as defined above) with earliest/latest

      • submission_time, start_time, and end_time


    • job with most

      • failed != 0

      • exit_status != 0


    • job with most

      • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


    • job with most

      • pe == single (count/len)

      • pe == shared (count/len)

      • pe == for each addtl category of pe (count/len)


    • job with longest/shortest cumulative

      • wait_time and wtime


    By owner:



    • owner with the most jobs

    • owner with earliest/latest

      • submission_time, start_time, end_time


    • owner with most

      • failed != 0

      • exit_status != 0


    • owner with most

      • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


    • owner with most

      • pe == single (count/len)

      • pe == shared (count/len)

      • pe == for each addtl category of pe (count/len)


    • owner with longest/shortest cumulative

      • wait_time and wtime


    By ranking:



    Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.



    Rank for:



    • number of

      • jobs, job_numbers, task_numbers, submission times


    • time of first

      • submission_time, start_time, end_time


    • time of last

      • submission_time, start_time, end_time


    • number of

      • failed != 0

      • exit_status != 0


    • number of

      • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


    • number of

      • pe == single

      • pe == shared

      • pe == for each addtl category of pe


    • total cumulative for all jobs

      • wait_time and wtime










    share|improve this question


























      3












      3








      3








      Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.



      My pandas df has 4.5 mil rows and 23 columns. The table below shows a few lines from df2 which is generated from df. It shows two groups (eeskin and hduquant) and three features (failed, exit_status, and job_number):



      # report by group

      group feature #_cats #_jobs rank top_value freq
      10 eeskin failed 1 6 -1 100 6
      21 eeskin exit_status 1 6 -1 0 6
      0 eeskin job_number 1 6 -1 4.08219e+06 6
      21 hduquant exit_status 5 64 -1 37 58
      11 hduquant failed 2 64 -1 0 63
      1 hduquant job_number 2 64 -1 4.07192e+06 61


      The "rank" column value of -1 is a placeholder.



      I want to update each group's rank for each feature. "Feature" in this case means each unique value in the "feature" column: failed, exit_status, and job_number. Updating the rankings for job_number, for example, means modifying the values in column "rank" only on rows where column "feature" equals job_number. As it turns out each of these rows also corresponds to a different group value in the "group" column.



      So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.



      The rank for feature "job_number" is based on the value of "#_jobs" col (highest number of jobs is rank 1). For feature "failed", rank is based on the "freq" of "top_value". exits_status can remain -1 for now.



      The result should look like this:



       group feature #_cats #_jobs rank top_value freq 
      10 eeskin failed 1 6 1 100 6
      21 eeskin exit_status 1 6 -1 0 6
      0 eeskin job_number 1 6 2 4.08219e+06 6
      21 hduquant exit_status 5 64 -1 37 58
      11 hduquant failed 2 64 2 0 63
      1 hduquant job_number 2 64 1 4.07192e+06 61


      "eeskin" ranks 1 for failed and 2 for job_number. "hdquant" ranks 2 for failed and 1 for job_number.



      I am able to update the rank values for job_number with this code:



      if feat == 'job_number':
      grouped = grouped.sort_values("#_jobs", ascending=False)
      grouped['rank'] = grouped.index + 1


       group feature #_cats #_jobs rank top_value freq 
      10 eeskin failed 1 6 -1 100 6
      21 eeskin exit_status 1 6 -1 0 6
      0 eeskin job_number 1 6 2 4.08219e+06 6
      21 hduquant exit_status 5 64 -1 37 58
      11 hduquant failed 2 64 -1 0 63
      1 hduquant job_number 2 64 1 4.07192e+06 61


      But when I try to update both, neither is updated:



      feat = ['job_number', 'failed']

      for f in feat:
      if f == 'job_number':
      grouped = grouped.sort_values("#_jobs", ascending=False)
      grouped['rank'] = grouped.index + 1
      elif f == 'failed': # or f == 'exit_status'
      x = len(not grouped[f] == 0)
      grouped['x'] = x
      grouped = grouped.sort_values("x", ascending=False)
      grouped['rank'] = grouped.index + 1
      del grouped['x']


       group feature #_cats #_jobs rank top_value freq 
      10 eeskin failed 1 6 -1 100 6
      21 eeskin exit_status 1 6 -1 0 6
      0 eeskin job_number 1 6 -1 4.08219e+06 6
      21 hduquant exit_status 5 64 -1 37 58
      11 hduquant failed 2 64 -1 0 63
      1 hduquant job_number 2 64 -1 4.07192e+06 61


      I've tried to implement the suggestion from Matt W. but so far without success:



      df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)


      I modified his code a bit as follows but also without success:



      df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))



      Addendum @Matt W.



      Input:



      import pandas as pd

      df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]],
      columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
      df = (df.astype(dtype='group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'))
      df


      Output:



       group owner job_number submission_time start_time end_time failed exit_status granted_pe slots task_number maxvmem h_data h_rt highp exclusive h_vmem gpu pe slot campus wait_time wtime
      0 g1 u1 3902779 2018-09-27 21:38:06 2018-10-01 07:24:38 2018-10-01 08:00:42 0 0 single 1 55696 609865728.0 4.0 6.0 0 0 4.0 0 single 1 0 3 days 09:46:32 00:36:04
      4080243 g50 u92 4071923 2018-10-25 02:08:14 2018-10-27 01:41:58 2018-10-27 02:08:50 0 0 shared 1 119 7.654482e+08 2.5 1.5 0 1 16.0 0 shared 1 0 1 days 23:33:44 00:26:52
      4080244 g50 u92 4071922 2018-10-25 02:08:11 2018-10-27 01:46:53 2018-10-27 02:08:53 0 0 shared 1 2208 1.074463e+09 2.5 1.5 0 10 24.0 0 shared 1 0 1 days 23:38:42 00:22:00


      The code produces the first line. I tacked on a couple more lines just for variety.



      There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.



      I want to create an overall report and one report per group, both focused on resource usage.



      Components of overall report:



      General stats:



      • count, mean, std, min, 25%, 50%, 75%, max (numerical)

      • count, unique, top, freq (string)

      • count, first, last # time delta cols (time delta)

      Jobs:



      • job_number with the most task_numbers, job_number w most submission times

      • job (as defined above) with earliest/latest

        • submission_time, start_time, and end_time


      • job with most

        • failed != 0

        • exit_status != 0


      • job with most (sum of values)

        • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


      • job with most (count/len)

        • pe == single

        • pe == shared

        • pe == for each addtl category of pe


      • job with longest/shortest cumulative

        • wait_time and wtime


      Owners:



      • owner with the most jobs

      • owner with earliest/latest

        • submission_time, start_time, end_time


      • owner with most

        • failed != 0

        • exit_status != 0


      • owner with most (sum of values)

        • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, gpu


      • owner with most (count/len)

        • pe == single

        • pe == shared

        • pe == for each addtl category of pe


      • owner with longest/shortest cumulative

        • wait_time and wtime


      Groups:



      • group with the most jobs

      • group with the most owners

      • group with earliest/latest

        • submission_time, start_time, and end_time


      • group with most

        • failed != 0

        • exit_status != 0


      • group with most (sum of values) of each

        • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


      • group with most

        • pe == single

        • pe == shared

        • pe == for each addtl category of pe


      • group with longest/shortest cumulative

        • wait_time and wtime


      Components of individual "by group" reports:



      By feature (column in df):



      General stats:



      • count, mean, std, min, 25%, 50%, 75%, max (numerical)

      • count, unique, top, freq (string)

      • count, first, last # time delta cols (time delta)

      The group's stats:



      By job:



      • job_number with the most task_numbers, job_number w most submission_times

      • job (as defined above) with earliest/latest

        • submission_time, start_time, and end_time


      • job with most

        • failed != 0

        • exit_status != 0


      • job with most

        • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


      • job with most

        • pe == single (count/len)

        • pe == shared (count/len)

        • pe == for each addtl category of pe (count/len)


      • job with longest/shortest cumulative

        • wait_time and wtime


      By owner:



      • owner with the most jobs

      • owner with earliest/latest

        • submission_time, start_time, end_time


      • owner with most

        • failed != 0

        • exit_status != 0


      • owner with most

        • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


      • owner with most

        • pe == single (count/len)

        • pe == shared (count/len)

        • pe == for each addtl category of pe (count/len)


      • owner with longest/shortest cumulative

        • wait_time and wtime


      By ranking:



      Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.



      Rank for:



      • number of

        • jobs, job_numbers, task_numbers, submission times


      • time of first

        • submission_time, start_time, end_time


      • time of last

        • submission_time, start_time, end_time


      • number of

        • failed != 0

        • exit_status != 0


      • number of

        • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


      • number of

        • pe == single

        • pe == shared

        • pe == for each addtl category of pe


      • total cumulative for all jobs

        • wait_time and wtime










      share|improve this question
















      Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.



      My pandas df has 4.5 mil rows and 23 columns. The table below shows a few lines from df2 which is generated from df. It shows two groups (eeskin and hduquant) and three features (failed, exit_status, and job_number):



      # report by group

      group feature #_cats #_jobs rank top_value freq
      10 eeskin failed 1 6 -1 100 6
      21 eeskin exit_status 1 6 -1 0 6
      0 eeskin job_number 1 6 -1 4.08219e+06 6
      21 hduquant exit_status 5 64 -1 37 58
      11 hduquant failed 2 64 -1 0 63
      1 hduquant job_number 2 64 -1 4.07192e+06 61


      The "rank" column value of -1 is a placeholder.



      I want to update each group's rank for each feature. "Feature" in this case means each unique value in the "feature" column: failed, exit_status, and job_number. Updating the rankings for job_number, for example, means modifying the values in column "rank" only on rows where column "feature" equals job_number. As it turns out each of these rows also corresponds to a different group value in the "group" column.



      So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.



      The rank for feature "job_number" is based on the value of "#_jobs" col (highest number of jobs is rank 1). For feature "failed", rank is based on the "freq" of "top_value". exits_status can remain -1 for now.



      The result should look like this:



       group feature #_cats #_jobs rank top_value freq 
      10 eeskin failed 1 6 1 100 6
      21 eeskin exit_status 1 6 -1 0 6
      0 eeskin job_number 1 6 2 4.08219e+06 6
      21 hduquant exit_status 5 64 -1 37 58
      11 hduquant failed 2 64 2 0 63
      1 hduquant job_number 2 64 1 4.07192e+06 61


      "eeskin" ranks 1 for failed and 2 for job_number. "hdquant" ranks 2 for failed and 1 for job_number.



      I am able to update the rank values for job_number with this code:



      if feat == 'job_number':
      grouped = grouped.sort_values("#_jobs", ascending=False)
      grouped['rank'] = grouped.index + 1


       group feature #_cats #_jobs rank top_value freq 
      10 eeskin failed 1 6 -1 100 6
      21 eeskin exit_status 1 6 -1 0 6
      0 eeskin job_number 1 6 2 4.08219e+06 6
      21 hduquant exit_status 5 64 -1 37 58
      11 hduquant failed 2 64 -1 0 63
      1 hduquant job_number 2 64 1 4.07192e+06 61


      But when I try to update both, neither is updated:



      feat = ['job_number', 'failed']

      for f in feat:
      if f == 'job_number':
      grouped = grouped.sort_values("#_jobs", ascending=False)
      grouped['rank'] = grouped.index + 1
      elif f == 'failed': # or f == 'exit_status'
      x = len(not grouped[f] == 0)
      grouped['x'] = x
      grouped = grouped.sort_values("x", ascending=False)
      grouped['rank'] = grouped.index + 1
      del grouped['x']


       group feature #_cats #_jobs rank top_value freq 
      10 eeskin failed 1 6 -1 100 6
      21 eeskin exit_status 1 6 -1 0 6
      0 eeskin job_number 1 6 -1 4.08219e+06 6
      21 hduquant exit_status 5 64 -1 37 58
      11 hduquant failed 2 64 -1 0 63
      1 hduquant job_number 2 64 -1 4.07192e+06 61


      I've tried to implement the suggestion from Matt W. but so far without success:



      df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)


      I modified his code a bit as follows but also without success:



      df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))



      Addendum @Matt W.



      Input:



      import pandas as pd

      df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]],
      columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
      df = (df.astype(dtype='group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'))
      df


      Output:



       group owner job_number submission_time start_time end_time failed exit_status granted_pe slots task_number maxvmem h_data h_rt highp exclusive h_vmem gpu pe slot campus wait_time wtime
      0 g1 u1 3902779 2018-09-27 21:38:06 2018-10-01 07:24:38 2018-10-01 08:00:42 0 0 single 1 55696 609865728.0 4.0 6.0 0 0 4.0 0 single 1 0 3 days 09:46:32 00:36:04
      4080243 g50 u92 4071923 2018-10-25 02:08:14 2018-10-27 01:41:58 2018-10-27 02:08:50 0 0 shared 1 119 7.654482e+08 2.5 1.5 0 1 16.0 0 shared 1 0 1 days 23:33:44 00:26:52
      4080244 g50 u92 4071922 2018-10-25 02:08:11 2018-10-27 01:46:53 2018-10-27 02:08:53 0 0 shared 1 2208 1.074463e+09 2.5 1.5 0 10 24.0 0 shared 1 0 1 days 23:38:42 00:22:00


      The code produces the first line. I tacked on a couple more lines just for variety.



      There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.



      I want to create an overall report and one report per group, both focused on resource usage.



      Components of overall report:



      General stats:



      • count, mean, std, min, 25%, 50%, 75%, max (numerical)

      • count, unique, top, freq (string)

      • count, first, last # time delta cols (time delta)

      Jobs:



      • job_number with the most task_numbers, job_number w most submission times

      • job (as defined above) with earliest/latest

        • submission_time, start_time, and end_time


      • job with most

        • failed != 0

        • exit_status != 0


      • job with most (sum of values)

        • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


      • job with most (count/len)

        • pe == single

        • pe == shared

        • pe == for each addtl category of pe


      • job with longest/shortest cumulative

        • wait_time and wtime


      Owners:



      • owner with the most jobs

      • owner with earliest/latest

        • submission_time, start_time, end_time


      • owner with most

        • failed != 0

        • exit_status != 0


      • owner with most (sum of values)

        • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, gpu


      • owner with most (count/len)

        • pe == single

        • pe == shared

        • pe == for each addtl category of pe


      • owner with longest/shortest cumulative

        • wait_time and wtime


      Groups:



      • group with the most jobs

      • group with the most owners

      • group with earliest/latest

        • submission_time, start_time, and end_time


      • group with most

        • failed != 0

        • exit_status != 0


      • group with most (sum of values) of each

        • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


      • group with most

        • pe == single

        • pe == shared

        • pe == for each addtl category of pe


      • group with longest/shortest cumulative

        • wait_time and wtime


      Components of individual "by group" reports:



      By feature (column in df):



      General stats:



      • count, mean, std, min, 25%, 50%, 75%, max (numerical)

      • count, unique, top, freq (string)

      • count, first, last # time delta cols (time delta)

      The group's stats:



      By job:



      • job_number with the most task_numbers, job_number w most submission_times

      • job (as defined above) with earliest/latest

        • submission_time, start_time, and end_time


      • job with most

        • failed != 0

        • exit_status != 0


      • job with most

        • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


      • job with most

        • pe == single (count/len)

        • pe == shared (count/len)

        • pe == for each addtl category of pe (count/len)


      • job with longest/shortest cumulative

        • wait_time and wtime


      By owner:



      • owner with the most jobs

      • owner with earliest/latest

        • submission_time, start_time, end_time


      • owner with most

        • failed != 0

        • exit_status != 0


      • owner with most

        • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


      • owner with most

        • pe == single (count/len)

        • pe == shared (count/len)

        • pe == for each addtl category of pe (count/len)


      • owner with longest/shortest cumulative

        • wait_time and wtime


      By ranking:



      Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.



      Rank for:



      • number of

        • jobs, job_numbers, task_numbers, submission times


      • time of first

        • submission_time, start_time, end_time


      • time of last

        • submission_time, start_time, end_time


      • number of

        • failed != 0

        • exit_status != 0


      • number of

        • granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu


      • number of

        • pe == single

        • pe == shared

        • pe == for each addtl category of pe


      • total cumulative for all jobs

        • wait_time and wtime







      python pandas






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 8 at 18:07







      Karl Baker

















      asked Mar 7 at 1:19









      Karl BakerKarl Baker

      252114




      252114






















          1 Answer
          1






          active

          oldest

          votes


















          1














          You can do this an easier way by using pandas .loc



          Initialize dataframe:



          df = pd.DataFrame('group':['e','e','e','h','h','h'],
          'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
          'cats':[1, 1, 1, 5, 2, 2],
          'jobs':[1, 1, 1, 64, 64, 64],
          'rank':[-1, -1, -1, -1, -1, -1],
          'topvalue':[100, 0, 4, 37, 0, 3.9],
          'freq':[1, 1, 1, 58, 63, 61]
          )


          We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function



          Rank job feature, by jobs value:



          df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)


          Rank failure feature by frequency where top value is not 0:



          For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.



          This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1



          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)


          This way we don't filter out the 0s.



          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)





          share|improve this answer























          • This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using .apply() method. How would I go about that?

            – Karl Baker
            Mar 7 at 2:06











          • for feat2 in features2: grouped = df.groupby('group')[feat2] grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False) --> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"

            – Karl Baker
            Mar 7 at 2:10











          • Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.

            – Matt W.
            Mar 7 at 2:18











          • If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.

            – Matt W.
            Mar 7 at 2:20






          • 1





            I found an answer that works for me here. It uses groupby and loc/iloc.

            – Karl Baker
            Mar 10 at 5:00











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55034626%2fpandas-python-update-subset-of-column-a-based-on-subset-of-one-or-more-other-col%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          You can do this an easier way by using pandas .loc



          Initialize dataframe:



          df = pd.DataFrame('group':['e','e','e','h','h','h'],
          'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
          'cats':[1, 1, 1, 5, 2, 2],
          'jobs':[1, 1, 1, 64, 64, 64],
          'rank':[-1, -1, -1, -1, -1, -1],
          'topvalue':[100, 0, 4, 37, 0, 3.9],
          'freq':[1, 1, 1, 58, 63, 61]
          )


          We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function



          Rank job feature, by jobs value:



          df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)


          Rank failure feature by frequency where top value is not 0:



          For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.



          This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1



          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)


          This way we don't filter out the 0s.



          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)





          share|improve this answer























          • This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using .apply() method. How would I go about that?

            – Karl Baker
            Mar 7 at 2:06











          • for feat2 in features2: grouped = df.groupby('group')[feat2] grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False) --> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"

            – Karl Baker
            Mar 7 at 2:10











          • Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.

            – Matt W.
            Mar 7 at 2:18











          • If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.

            – Matt W.
            Mar 7 at 2:20






          • 1





            I found an answer that works for me here. It uses groupby and loc/iloc.

            – Karl Baker
            Mar 10 at 5:00
















          1














          You can do this an easier way by using pandas .loc



          Initialize dataframe:



          df = pd.DataFrame('group':['e','e','e','h','h','h'],
          'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
          'cats':[1, 1, 1, 5, 2, 2],
          'jobs':[1, 1, 1, 64, 64, 64],
          'rank':[-1, -1, -1, -1, -1, -1],
          'topvalue':[100, 0, 4, 37, 0, 3.9],
          'freq':[1, 1, 1, 58, 63, 61]
          )


          We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function



          Rank job feature, by jobs value:



          df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)


          Rank failure feature by frequency where top value is not 0:



          For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.



          This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1



          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)


          This way we don't filter out the 0s.



          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)





          share|improve this answer























          • This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using .apply() method. How would I go about that?

            – Karl Baker
            Mar 7 at 2:06











          • for feat2 in features2: grouped = df.groupby('group')[feat2] grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False) --> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"

            – Karl Baker
            Mar 7 at 2:10











          • Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.

            – Matt W.
            Mar 7 at 2:18











          • If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.

            – Matt W.
            Mar 7 at 2:20






          • 1





            I found an answer that works for me here. It uses groupby and loc/iloc.

            – Karl Baker
            Mar 10 at 5:00














          1












          1








          1







          You can do this an easier way by using pandas .loc



          Initialize dataframe:



          df = pd.DataFrame('group':['e','e','e','h','h','h'],
          'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
          'cats':[1, 1, 1, 5, 2, 2],
          'jobs':[1, 1, 1, 64, 64, 64],
          'rank':[-1, -1, -1, -1, -1, -1],
          'topvalue':[100, 0, 4, 37, 0, 3.9],
          'freq':[1, 1, 1, 58, 63, 61]
          )


          We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function



          Rank job feature, by jobs value:



          df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)


          Rank failure feature by frequency where top value is not 0:



          For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.



          This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1



          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)


          This way we don't filter out the 0s.



          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)





          share|improve this answer













          You can do this an easier way by using pandas .loc



          Initialize dataframe:



          df = pd.DataFrame('group':['e','e','e','h','h','h'],
          'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
          'cats':[1, 1, 1, 5, 2, 2],
          'jobs':[1, 1, 1, 64, 64, 64],
          'rank':[-1, -1, -1, -1, -1, -1],
          'topvalue':[100, 0, 4, 37, 0, 3.9],
          'freq':[1, 1, 1, 58, 63, 61]
          )


          We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function



          Rank job feature, by jobs value:



          df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)


          Rank failure feature by frequency where top value is not 0:



          For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.



          This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1



          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)


          This way we don't filter out the 0s.



          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 7 at 1:47









          Matt W.Matt W.

          2,196625




          2,196625












          • This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using .apply() method. How would I go about that?

            – Karl Baker
            Mar 7 at 2:06











          • for feat2 in features2: grouped = df.groupby('group')[feat2] grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False) --> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"

            – Karl Baker
            Mar 7 at 2:10











          • Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.

            – Matt W.
            Mar 7 at 2:18











          • If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.

            – Matt W.
            Mar 7 at 2:20






          • 1





            I found an answer that works for me here. It uses groupby and loc/iloc.

            – Karl Baker
            Mar 10 at 5:00


















          • This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using .apply() method. How would I go about that?

            – Karl Baker
            Mar 7 at 2:06











          • for feat2 in features2: grouped = df.groupby('group')[feat2] grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False) --> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"

            – Karl Baker
            Mar 7 at 2:10











          • Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.

            – Matt W.
            Mar 7 at 2:18











          • If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.

            – Matt W.
            Mar 7 at 2:20






          • 1





            I found an answer that works for me here. It uses groupby and loc/iloc.

            – Karl Baker
            Mar 10 at 5:00

















          This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using .apply() method. How would I go about that?

          – Karl Baker
          Mar 7 at 2:06





          This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using .apply() method. How would I go about that?

          – Karl Baker
          Mar 7 at 2:06













          for feat2 in features2: grouped = df.groupby('group')[feat2] grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False) --> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"

          – Karl Baker
          Mar 7 at 2:10





          for feat2 in features2: grouped = df.groupby('group')[feat2] grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False) --> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"

          – Karl Baker
          Mar 7 at 2:10













          Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.

          – Matt W.
          Mar 7 at 2:18





          Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.

          – Matt W.
          Mar 7 at 2:18













          If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.

          – Matt W.
          Mar 7 at 2:20





          If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.

          – Matt W.
          Mar 7 at 2:20




          1




          1





          I found an answer that works for me here. It uses groupby and loc/iloc.

          – Karl Baker
          Mar 10 at 5:00






          I found an answer that works for me here. It uses groupby and loc/iloc.

          – Karl Baker
          Mar 10 at 5:00




















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55034626%2fpandas-python-update-subset-of-column-a-based-on-subset-of-one-or-more-other-col%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          1928 у кіно

          Захаров Федір Захарович

          Ель Греко