pandas python Update subset of column A based on subset of one or more other columnspandas Add new “rank” columns for every columnHow to import other Python files?Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to drop rows of Pandas DataFrame whose value in certain columns is NaNChange data type of columns in PandasSelect rows from a DataFrame based on values in a column in pandasDeleting DataFrame row in Pandas based on column valueGet list from pandas DataFrame column headers

How to get directions in deep space?

What is the tangent at a sharp point on a curve?

Asserting that Atheism and Theism are both faith based positions

Started in 1987 vs. Starting in 1987

Why didn't Voldemort know what Grindelwald looked like?

How do I lift the insulation blower into the attic?

Offset in split text content

If the Dominion rule using their Jem'Hadar troops, why is their life expectancy so low?

Is there any common country to visit for persons holding UK and Schengen visas?

Derivative of an interpolated function

What is the meaning of "You've never met a graph you didn't like?"

How would a solely written language work mechanically

Why is "la Gestapo" feminine?

Index matching algorithm without hash-based data structures?

When is the exact date for EOL of Ubuntu 14.04 LTS?

Weird lines in Microsoft Word

What can I do if I am asked to learn different programming languages very frequently?

What is the purpose of using a decision tree?

Writing in a Christian voice

Capacitor electron flow

Does capillary rise violate hydrostatic paradox?

Why does the frost depth increase when the surface temperature warms up?

What do the positive and negative (+/-) transmit and receive pins mean on Ethernet cables?

Is there a distance limit for minecart tracks?

pandas python Update subset of column A based on subset of one or more other columns

pandas Add new “rank” columns for every columnHow to import other Python files?Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to drop rows of Pandas DataFrame whose value in certain columns is NaNChange data type of columns in PandasSelect rows from a DataFrame based on values in a column in pandasDeleting DataFrame row in Pandas based on column valueGet list from pandas DataFrame column headers

Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.

My pandas df has 4.5 mil rows and 23 columns. The table below shows a few lines from df2 which is generated from df. It shows two groups (eeskin and hduquant) and three features (failed, exit_status, and job_number):

# report by group

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 -1 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 -1 0 63 
1 hduquant job_number 2 64 -1 4.07192e+06 61

The "rank" column value of -1 is a placeholder.

I want to update each group's rank for each feature. "Feature" in this case means each unique value in the "feature" column: failed, exit_status, and job_number. Updating the rankings for job_number, for example, means modifying the values in column "rank" only on rows where column "feature" equals job_number. As it turns out each of these rows also corresponds to a different group value in the "group" column.

So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.

The rank for feature "job_number" is based on the value of "#_jobs" col (highest number of jobs is rank 1). For feature "failed", rank is based on the "freq" of "top_value". exits_status can remain -1 for now.

The result should look like this:

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 2 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 2 0 63 
1 hduquant job_number 2 64 1 4.07192e+06 61

"eeskin" ranks 1 for failed and 2 for job_number. "hdquant" ranks 2 for failed and 1 for job_number.

I am able to update the rank values for job_number with this code:

if feat == 'job_number':
 grouped = grouped.sort_values("#_jobs", ascending=False)
 grouped['rank'] = grouped.index + 1

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 2 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 -1 0 63 
1 hduquant job_number 2 64 1 4.07192e+06 61

But when I try to update both, neither is updated:

feat = ['job_number', 'failed']

for f in feat:
 if f == 'job_number':
 grouped = grouped.sort_values("#_jobs", ascending=False)
 grouped['rank'] = grouped.index + 1
 elif f == 'failed': # or f == 'exit_status'
 x = len(not grouped[f] == 0)
 grouped['x'] = x
 grouped = grouped.sort_values("x", ascending=False)
 grouped['rank'] = grouped.index + 1 
 del grouped['x']

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 -1 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 -1 0 63 
1 hduquant job_number 2 64 -1 4.07192e+06 61

I've tried to implement the suggestion from Matt W. but so far without success:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

I modified his code a bit as follows but also without success:

df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))

Addendum @Matt W.

Input:

import pandas as pd

df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]], 
 columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
df = (df.astype(dtype='group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'))
df

Output:

 group owner job_number submission_time start_time end_time failed exit_status granted_pe slots task_number maxvmem h_data h_rt highp exclusive h_vmem gpu pe slot campus wait_time wtime
 0 g1 u1 3902779 2018-09-27 21:38:06 2018-10-01 07:24:38 2018-10-01 08:00:42 0 0 single 1 55696 609865728.0 4.0 6.0 0 0 4.0 0 single 1 0 3 days 09:46:32 00:36:04 
4080243 g50 u92 4071923 2018-10-25 02:08:14 2018-10-27 01:41:58 2018-10-27 02:08:50 0 0 shared 1 119 7.654482e+08 2.5 1.5 0 1 16.0 0 shared 1 0 1 days 23:33:44 00:26:52
4080244 g50 u92 4071922 2018-10-25 02:08:11 2018-10-27 01:46:53 2018-10-27 02:08:53 0 0 shared 1 2208 1.074463e+09 2.5 1.5 0 10 24.0 0 shared 1 0 1 days 23:38:42 00:22:00

The code produces the first line. I tacked on a couple more lines just for variety.

There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.

I want to create an overall report and one report per group, both focused on resource usage.

Components of overall report:

General stats:

count, mean, std, min, 25%, 50%, 75%, max (numerical)

count, unique, top, freq (string)

count, first, last # time delta cols (time delta)

Jobs:

job_number with the most task_numbers, job_number w most submission times

job (as defined above) with earliest/latest
- submission_time, start_time, and end_time

job with most
- failed != 0
- exit_status != 0

job with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

job with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe

job with longest/shortest cumulative
- wait_time and wtime

Owners:

owner with the most jobs

owner with earliest/latest
- submission_time, start_time, end_time

owner with most
- failed != 0
- exit_status != 0

owner with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, gpu

owner with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe

owner with longest/shortest cumulative
- wait_time and wtime

Groups:

group with the most jobs

group with the most owners

group with earliest/latest
- submission_time, start_time, and end_time

group with most
- failed != 0
- exit_status != 0

group with most (sum of values) of each
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

group with most
- pe == single
- pe == shared
- pe == for each addtl category of pe

group with longest/shortest cumulative
- wait_time and wtime

Components of individual "by group" reports:

By feature (column in df):

General stats:

count, mean, std, min, 25%, 50%, 75%, max (numerical)

count, unique, top, freq (string)

count, first, last # time delta cols (time delta)

The group's stats:

By job:

job_number with the most task_numbers, job_number w most submission_times

job (as defined above) with earliest/latest
- submission_time, start_time, and end_time

job with most
- failed != 0
- exit_status != 0

job with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

job with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)

job with longest/shortest cumulative
- wait_time and wtime

By owner:

owner with the most jobs

owner with earliest/latest
- submission_time, start_time, end_time

owner with most
- failed != 0
- exit_status != 0

owner with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

owner with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)

owner with longest/shortest cumulative
- wait_time and wtime

By ranking:

Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.

Rank for:

number of
- jobs, job_numbers, task_numbers, submission times

time of first
- submission_time, start_time, end_time

time of last
- submission_time, start_time, end_time

number of
- failed != 0
- exit_status != 0

number of
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

number of
- pe == single
- pe == shared
- pe == for each addtl category of pe

total cumulative for all jobs
- wait_time and wtime

edited Mar 8 at 18:07

asked Mar 7 at 1:19

Karl Baker

252114

add a comment |

Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.

# report by group

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 -1 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 -1 0 63 
1 hduquant job_number 2 64 -1 4.07192e+06 61

The "rank" column value of -1 is a placeholder.

So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.

The result should look like this:

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 2 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 2 0 63 
1 hduquant job_number 2 64 1 4.07192e+06 61

"eeskin" ranks 1 for failed and 2 for job_number. "hdquant" ranks 2 for failed and 1 for job_number.

I am able to update the rank values for job_number with this code:

if feat == 'job_number':
 grouped = grouped.sort_values("#_jobs", ascending=False)
 grouped['rank'] = grouped.index + 1

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 2 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 -1 0 63 
1 hduquant job_number 2 64 1 4.07192e+06 61

But when I try to update both, neither is updated:

feat = ['job_number', 'failed']

for f in feat:
 if f == 'job_number':
 grouped = grouped.sort_values("#_jobs", ascending=False)
 grouped['rank'] = grouped.index + 1
 elif f == 'failed': # or f == 'exit_status'
 x = len(not grouped[f] == 0)
 grouped['x'] = x
 grouped = grouped.sort_values("x", ascending=False)
 grouped['rank'] = grouped.index + 1 
 del grouped['x']

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 -1 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 -1 0 63 
1 hduquant job_number 2 64 -1 4.07192e+06 61

I've tried to implement the suggestion from Matt W. but so far without success:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

I modified his code a bit as follows but also without success:

df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))

Addendum @Matt W.

Input:

import pandas as pd

df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]], 
 columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
df = (df.astype(dtype='group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'))
df

Output:

 group owner job_number submission_time start_time end_time failed exit_status granted_pe slots task_number maxvmem h_data h_rt highp exclusive h_vmem gpu pe slot campus wait_time wtime
 0 g1 u1 3902779 2018-09-27 21:38:06 2018-10-01 07:24:38 2018-10-01 08:00:42 0 0 single 1 55696 609865728.0 4.0 6.0 0 0 4.0 0 single 1 0 3 days 09:46:32 00:36:04 
4080243 g50 u92 4071923 2018-10-25 02:08:14 2018-10-27 01:41:58 2018-10-27 02:08:50 0 0 shared 1 119 7.654482e+08 2.5 1.5 0 1 16.0 0 shared 1 0 1 days 23:33:44 00:26:52
4080244 g50 u92 4071922 2018-10-25 02:08:11 2018-10-27 01:46:53 2018-10-27 02:08:53 0 0 shared 1 2208 1.074463e+09 2.5 1.5 0 10 24.0 0 shared 1 0 1 days 23:38:42 00:22:00

The code produces the first line. I tacked on a couple more lines just for variety.

There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.

I want to create an overall report and one report per group, both focused on resource usage.

Components of overall report:

General stats:

count, mean, std, min, 25%, 50%, 75%, max (numerical)

count, unique, top, freq (string)

count, first, last # time delta cols (time delta)

Jobs:

job_number with the most task_numbers, job_number w most submission times

job (as defined above) with earliest/latest
- submission_time, start_time, and end_time

job with most
- failed != 0
- exit_status != 0

job with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

job with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe

job with longest/shortest cumulative
- wait_time and wtime

Owners:

owner with the most jobs

owner with earliest/latest
- submission_time, start_time, end_time

owner with most
- failed != 0
- exit_status != 0

owner with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, gpu

owner with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe

owner with longest/shortest cumulative
- wait_time and wtime

Groups:

group with the most jobs

group with the most owners

group with earliest/latest
- submission_time, start_time, and end_time

group with most
- failed != 0
- exit_status != 0

group with most (sum of values) of each
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

group with most
- pe == single
- pe == shared
- pe == for each addtl category of pe

group with longest/shortest cumulative
- wait_time and wtime

Components of individual "by group" reports:

By feature (column in df):

General stats:

count, mean, std, min, 25%, 50%, 75%, max (numerical)

count, unique, top, freq (string)

count, first, last # time delta cols (time delta)

The group's stats:

By job:

job_number with the most task_numbers, job_number w most submission_times

job (as defined above) with earliest/latest
- submission_time, start_time, and end_time

job with most
- failed != 0
- exit_status != 0

job with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

job with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)

job with longest/shortest cumulative
- wait_time and wtime

By owner:

owner with the most jobs

owner with earliest/latest
- submission_time, start_time, end_time

owner with most
- failed != 0
- exit_status != 0

owner with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

owner with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)

owner with longest/shortest cumulative
- wait_time and wtime

By ranking:

Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.

Rank for:

number of
- jobs, job_numbers, task_numbers, submission times

time of first
- submission_time, start_time, end_time

time of last
- submission_time, start_time, end_time

number of
- failed != 0
- exit_status != 0

number of
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

number of
- pe == single
- pe == shared
- pe == for each addtl category of pe

total cumulative for all jobs
- wait_time and wtime

edited Mar 8 at 18:07

asked Mar 7 at 1:19

Karl Baker

252114

add a comment |

Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.

# report by group

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 -1 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 -1 0 63 
1 hduquant job_number 2 64 -1 4.07192e+06 61

The "rank" column value of -1 is a placeholder.

So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.

The result should look like this:

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 2 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 2 0 63 
1 hduquant job_number 2 64 1 4.07192e+06 61

"eeskin" ranks 1 for failed and 2 for job_number. "hdquant" ranks 2 for failed and 1 for job_number.

I am able to update the rank values for job_number with this code:

if feat == 'job_number':
 grouped = grouped.sort_values("#_jobs", ascending=False)
 grouped['rank'] = grouped.index + 1

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 2 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 -1 0 63 
1 hduquant job_number 2 64 1 4.07192e+06 61

But when I try to update both, neither is updated:

feat = ['job_number', 'failed']

for f in feat:
 if f == 'job_number':
 grouped = grouped.sort_values("#_jobs", ascending=False)
 grouped['rank'] = grouped.index + 1
 elif f == 'failed': # or f == 'exit_status'
 x = len(not grouped[f] == 0)
 grouped['x'] = x
 grouped = grouped.sort_values("x", ascending=False)
 grouped['rank'] = grouped.index + 1 
 del grouped['x']

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 -1 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 -1 0 63 
1 hduquant job_number 2 64 -1 4.07192e+06 61

I've tried to implement the suggestion from Matt W. but so far without success:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

I modified his code a bit as follows but also without success:

df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))

Addendum @Matt W.

Input:

import pandas as pd

df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]], 
 columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
df = (df.astype(dtype='group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'))
df

Output:

 group owner job_number submission_time start_time end_time failed exit_status granted_pe slots task_number maxvmem h_data h_rt highp exclusive h_vmem gpu pe slot campus wait_time wtime
 0 g1 u1 3902779 2018-09-27 21:38:06 2018-10-01 07:24:38 2018-10-01 08:00:42 0 0 single 1 55696 609865728.0 4.0 6.0 0 0 4.0 0 single 1 0 3 days 09:46:32 00:36:04 
4080243 g50 u92 4071923 2018-10-25 02:08:14 2018-10-27 01:41:58 2018-10-27 02:08:50 0 0 shared 1 119 7.654482e+08 2.5 1.5 0 1 16.0 0 shared 1 0 1 days 23:33:44 00:26:52
4080244 g50 u92 4071922 2018-10-25 02:08:11 2018-10-27 01:46:53 2018-10-27 02:08:53 0 0 shared 1 2208 1.074463e+09 2.5 1.5 0 10 24.0 0 shared 1 0 1 days 23:38:42 00:22:00

The code produces the first line. I tacked on a couple more lines just for variety.

There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.

I want to create an overall report and one report per group, both focused on resource usage.

Components of overall report:

General stats:

count, mean, std, min, 25%, 50%, 75%, max (numerical)

count, unique, top, freq (string)

count, first, last # time delta cols (time delta)

Jobs:

job_number with the most task_numbers, job_number w most submission times

job (as defined above) with earliest/latest
- submission_time, start_time, and end_time

job with most
- failed != 0
- exit_status != 0

job with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

job with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe

job with longest/shortest cumulative
- wait_time and wtime

Owners:

owner with the most jobs

owner with earliest/latest
- submission_time, start_time, end_time

owner with most
- failed != 0
- exit_status != 0

owner with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, gpu

owner with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe

owner with longest/shortest cumulative
- wait_time and wtime

Groups:

group with the most jobs

group with the most owners

group with earliest/latest
- submission_time, start_time, and end_time

group with most
- failed != 0
- exit_status != 0

group with most (sum of values) of each
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

group with most
- pe == single
- pe == shared
- pe == for each addtl category of pe

group with longest/shortest cumulative
- wait_time and wtime

Components of individual "by group" reports:

By feature (column in df):

General stats:

count, mean, std, min, 25%, 50%, 75%, max (numerical)

count, unique, top, freq (string)

count, first, last # time delta cols (time delta)

The group's stats:

By job:

job_number with the most task_numbers, job_number w most submission_times

job (as defined above) with earliest/latest
- submission_time, start_time, and end_time

job with most
- failed != 0
- exit_status != 0

job with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

job with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)

job with longest/shortest cumulative
- wait_time and wtime

By owner:

owner with the most jobs

owner with earliest/latest
- submission_time, start_time, end_time

owner with most
- failed != 0
- exit_status != 0

owner with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

owner with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)

owner with longest/shortest cumulative
- wait_time and wtime

By ranking:

Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.

Rank for:

number of
- jobs, job_numbers, task_numbers, submission times

time of first
- submission_time, start_time, end_time

time of last
- submission_time, start_time, end_time

number of
- failed != 0
- exit_status != 0

number of
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

number of
- pe == single
- pe == shared
- pe == for each addtl category of pe

total cumulative for all jobs
- wait_time and wtime

edited Mar 8 at 18:07

asked Mar 7 at 1:19

Karl Baker

252114

Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.

# report by group

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 -1 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 -1 0 63 
1 hduquant job_number 2 64 -1 4.07192e+06 61

The "rank" column value of -1 is a placeholder.

So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.

The result should look like this:

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 2 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 2 0 63 
1 hduquant job_number 2 64 1 4.07192e+06 61

"eeskin" ranks 1 for failed and 2 for job_number. "hdquant" ranks 2 for failed and 1 for job_number.

I am able to update the rank values for job_number with this code:

if feat == 'job_number':
 grouped = grouped.sort_values("#_jobs", ascending=False)
 grouped['rank'] = grouped.index + 1

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 2 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 -1 0 63 
1 hduquant job_number 2 64 1 4.07192e+06 61

But when I try to update both, neither is updated:

feat = ['job_number', 'failed']

for f in feat:
 if f == 'job_number':
 grouped = grouped.sort_values("#_jobs", ascending=False)
 grouped['rank'] = grouped.index + 1
 elif f == 'failed': # or f == 'exit_status'
 x = len(not grouped[f] == 0)
 grouped['x'] = x
 grouped = grouped.sort_values("x", ascending=False)
 grouped['rank'] = grouped.index + 1 
 del grouped['x']

 group feature #_cats #_jobs rank top_value freq 
10 eeskin failed 1 6 -1 100 6 
21 eeskin exit_status 1 6 -1 0 6 
0 eeskin job_number 1 6 -1 4.08219e+06 6 
21 hduquant exit_status 5 64 -1 37 58 
11 hduquant failed 2 64 -1 0 63 
1 hduquant job_number 2 64 -1 4.07192e+06 61

I've tried to implement the suggestion from Matt W. but so far without success:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

I modified his code a bit as follows but also without success:

df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))

Addendum @Matt W.

Input:

import pandas as pd

df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]], 
 columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
df = (df.astype(dtype='group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'))
df

Output:

 group owner job_number submission_time start_time end_time failed exit_status granted_pe slots task_number maxvmem h_data h_rt highp exclusive h_vmem gpu pe slot campus wait_time wtime
 0 g1 u1 3902779 2018-09-27 21:38:06 2018-10-01 07:24:38 2018-10-01 08:00:42 0 0 single 1 55696 609865728.0 4.0 6.0 0 0 4.0 0 single 1 0 3 days 09:46:32 00:36:04 
4080243 g50 u92 4071923 2018-10-25 02:08:14 2018-10-27 01:41:58 2018-10-27 02:08:50 0 0 shared 1 119 7.654482e+08 2.5 1.5 0 1 16.0 0 shared 1 0 1 days 23:33:44 00:26:52
4080244 g50 u92 4071922 2018-10-25 02:08:11 2018-10-27 01:46:53 2018-10-27 02:08:53 0 0 shared 1 2208 1.074463e+09 2.5 1.5 0 10 24.0 0 shared 1 0 1 days 23:38:42 00:22:00

The code produces the first line. I tacked on a couple more lines just for variety.

There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.

I want to create an overall report and one report per group, both focused on resource usage.

Components of overall report:

General stats:

count, mean, std, min, 25%, 50%, 75%, max (numerical)

count, unique, top, freq (string)

count, first, last # time delta cols (time delta)

Jobs:

job_number with the most task_numbers, job_number w most submission times

job (as defined above) with earliest/latest
- submission_time, start_time, and end_time

job with most
- failed != 0
- exit_status != 0

job with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

job with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe

job with longest/shortest cumulative
- wait_time and wtime

Owners:

owner with the most jobs

owner with earliest/latest
- submission_time, start_time, end_time

owner with most
- failed != 0
- exit_status != 0

owner with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, gpu

owner with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe

owner with longest/shortest cumulative
- wait_time and wtime

Groups:

group with the most jobs

group with the most owners

group with earliest/latest
- submission_time, start_time, and end_time

group with most
- failed != 0
- exit_status != 0

group with most (sum of values) of each
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

group with most
- pe == single
- pe == shared
- pe == for each addtl category of pe

group with longest/shortest cumulative
- wait_time and wtime

Components of individual "by group" reports:

By feature (column in df):

General stats:

count, mean, std, min, 25%, 50%, 75%, max (numerical)

count, unique, top, freq (string)

count, first, last # time delta cols (time delta)

The group's stats:

By job:

job_number with the most task_numbers, job_number w most submission_times

job (as defined above) with earliest/latest
- submission_time, start_time, and end_time

job with most
- failed != 0
- exit_status != 0

job with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

job with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)

job with longest/shortest cumulative
- wait_time and wtime

By owner:

owner with the most jobs

owner with earliest/latest
- submission_time, start_time, end_time

owner with most
- failed != 0
- exit_status != 0

owner with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

owner with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)

owner with longest/shortest cumulative
- wait_time and wtime

By ranking:

Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.

Rank for:

number of
- jobs, job_numbers, task_numbers, submission times

time of first
- submission_time, start_time, end_time

time of last
- submission_time, start_time, end_time

number of
- failed != 0
- exit_status != 0

number of
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu

number of
- pe == single
- pe == shared
- pe == for each addtl category of pe

total cumulative for all jobs
- wait_time and wtime

python pandas

edited Mar 8 at 18:07

asked Mar 7 at 1:19

Karl Baker

252114

edited Mar 8 at 18:07

asked Mar 7 at 1:19

Karl Baker

252114

edited Mar 8 at 18:07

asked Mar 7 at 1:19

Karl Baker

252114

asked Mar 7 at 1:19

Karl Baker

252114

asked Mar 7 at 1:19

Karl Baker

252114

add a comment |

1 Answer
1

active

oldest

votes

You can do this an easier way by using pandas .loc

Initialize dataframe:

df = pd.DataFrame('group':['e','e','e','h','h','h'],
 'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
 'cats':[1, 1, 1, 5, 2, 2],
 'jobs':[1, 1, 1, 64, 64, 64],
 'rank':[-1, -1, -1, -1, -1, -1],
 'topvalue':[100, 0, 4, 37, 0, 3.9],
 'freq':[1, 1, 1, 58, 63, 61]
)

We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function

Rank job feature, by jobs value:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

Rank failure feature by frequency where top value is not 0:

For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.

This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
 df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

This way we don't filter out the 0s.

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
 df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

answered Mar 7 at 1:47

Matt W.

2,196625

This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using .apply() method. How would I go about that?

– Karl Baker
Mar 7 at 2:06

for feat2 in features2: grouped = df.groupby('group')[feat2] grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False) --> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"

– Karl Baker
Mar 7 at 2:10

Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.

– Matt W.
Mar 7 at 2:18

If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.

– Matt W.
Mar 7 at 2:20

1

I found an answer that works for me here. It uses groupby and loc/iloc.

– Karl Baker
Mar 10 at 5:00

|
show 4 more comments

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55034626%2fpandas-python-update-subset-of-column-a-based-on-subset-of-one-or-more-other-col%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You can do this an easier way by using pandas .loc

Initialize dataframe:

df = pd.DataFrame('group':['e','e','e','h','h','h'],
 'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
 'cats':[1, 1, 1, 5, 2, 2],
 'jobs':[1, 1, 1, 64, 64, 64],
 'rank':[-1, -1, -1, -1, -1, -1],
 'topvalue':[100, 0, 4, 37, 0, 3.9],
 'freq':[1, 1, 1, 58, 63, 61]
)

We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function

Rank job feature, by jobs value:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

Rank failure feature by frequency where top value is not 0:

For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.

This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
 df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

This way we don't filter out the 0s.

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
 df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

answered Mar 7 at 1:47

Matt W.

2,196625

This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using .apply() method. How would I go about that?

– Karl Baker
Mar 7 at 2:06

for feat2 in features2: grouped = df.groupby('group')[feat2] grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False) --> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"

– Karl Baker
Mar 7 at 2:10

Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.

– Matt W.
Mar 7 at 2:18

If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.

– Matt W.
Mar 7 at 2:20

1

I found an answer that works for me here. It uses groupby and loc/iloc.

– Karl Baker
Mar 10 at 5:00

|
show 4 more comments

You can do this an easier way by using pandas .loc

Initialize dataframe:

df = pd.DataFrame('group':['e','e','e','h','h','h'],
 'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
 'cats':[1, 1, 1, 5, 2, 2],
 'jobs':[1, 1, 1, 64, 64, 64],
 'rank':[-1, -1, -1, -1, -1, -1],
 'topvalue':[100, 0, 4, 37, 0, 3.9],
 'freq':[1, 1, 1, 58, 63, 61]
)

We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function

Rank job feature, by jobs value:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

Rank failure feature by frequency where top value is not 0:

For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.

This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
 df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

This way we don't filter out the 0s.

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
 df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

answered Mar 7 at 1:47

Matt W.

2,196625

This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using .apply() method. How would I go about that?

– Karl Baker
Mar 7 at 2:06

for feat2 in features2: grouped = df.groupby('group')[feat2] grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False) --> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"

– Karl Baker
Mar 7 at 2:10

Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.

– Matt W.
Mar 7 at 2:18

If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.

– Matt W.
Mar 7 at 2:20

1

I found an answer that works for me here. It uses groupby and loc/iloc.

– Karl Baker
Mar 10 at 5:00

|
show 4 more comments

You can do this an easier way by using pandas .loc

Initialize dataframe:

df = pd.DataFrame('group':['e','e','e','h','h','h'],
 'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
 'cats':[1, 1, 1, 5, 2, 2],
 'jobs':[1, 1, 1, 64, 64, 64],
 'rank':[-1, -1, -1, -1, -1, -1],
 'topvalue':[100, 0, 4, 37, 0, 3.9],
 'freq':[1, 1, 1, 58, 63, 61]
)

We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function

Rank job feature, by jobs value:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

Rank failure feature by frequency where top value is not 0:

For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.

This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
 df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

This way we don't filter out the 0s.

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
 df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

answered Mar 7 at 1:47

Matt W.

2,196625

You can do this an easier way by using pandas .loc

Initialize dataframe:

df = pd.DataFrame('group':['e','e','e','h','h','h'],
 'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
 'cats':[1, 1, 1, 5, 2, 2],
 'jobs':[1, 1, 1, 64, 64, 64],
 'rank':[-1, -1, -1, -1, -1, -1],
 'topvalue':[100, 0, 4, 37, 0, 3.9],
 'freq':[1, 1, 1, 58, 63, 61]
)

We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function

Rank job feature, by jobs value:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

Rank failure feature by frequency where top value is not 0:

For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.

This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
 df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

This way we don't filter out the 0s.

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
 df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

answered Mar 7 at 1:47

Matt W.

2,196625

answered Mar 7 at 1:47

Matt W.

2,196625

answered Mar 7 at 1:47

Matt W.

2,196625

answered Mar 7 at 1:47

Matt W.

2,196625

This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using .apply() method. How would I go about that?

– Karl Baker
Mar 7 at 2:06

for feat2 in features2: grouped = df.groupby('group')[feat2] grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False) --> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"

– Karl Baker
Mar 7 at 2:10

Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.

– Matt W.
Mar 7 at 2:18

If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.

– Matt W.
Mar 7 at 2:20

1

I found an answer that works for me here. It uses groupby and loc/iloc.

– Karl Baker
Mar 10 at 5:00

|
show 4 more comments

This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using .apply() method. How would I go about that?

– Karl Baker
Mar 7 at 2:06

for feat2 in features2: grouped = df.groupby('group')[feat2] grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False) --> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"

– Karl Baker
Mar 7 at 2:10

Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.

– Matt W.
Mar 7 at 2:18

If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.

– Matt W.
Mar 7 at 2:20

1

I found an answer that works for me here. It uses groupby and loc/iloc.

– Karl Baker
Mar 10 at 5:00

This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using .apply() method. How would I go about that?

– Karl Baker
Mar 7 at 2:06

for feat2 in features2: grouped = df.groupby('group')[feat2] grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False) --> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"

– Karl Baker
Mar 7 at 2:10

Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.

– Matt W.
Mar 7 at 2:18

If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.

– Matt W.
Mar 7 at 2:20

I found an answer that works for me here. It uses groupby and loc/iloc.

– Karl Baker
Mar 10 at 5:00

|
show 4 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ufdjrw

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Лубенський полк

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Лубенський полк

1 Answer
1

1 Answer
1

1 Answer
1