pandas python Update subset of column A based on subset of one or more other columnspandas Add new “rank” columns for every columnHow to import other Python files?Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to drop rows of Pandas DataFrame whose value in certain columns is NaNChange data type of columns in PandasSelect rows from a DataFrame based on values in a column in pandasDeleting DataFrame row in Pandas based on column valueGet list from pandas DataFrame column headers
How to get directions in deep space?
What is the tangent at a sharp point on a curve?
Asserting that Atheism and Theism are both faith based positions
Started in 1987 vs. Starting in 1987
Why didn't Voldemort know what Grindelwald looked like?
How do I lift the insulation blower into the attic?
Offset in split text content
If the Dominion rule using their Jem'Hadar troops, why is their life expectancy so low?
Is there any common country to visit for persons holding UK and Schengen visas?
Derivative of an interpolated function
What is the meaning of "You've never met a graph you didn't like?"
How would a solely written language work mechanically
Why is "la Gestapo" feminine?
Index matching algorithm without hash-based data structures?
When is the exact date for EOL of Ubuntu 14.04 LTS?
Weird lines in Microsoft Word
What can I do if I am asked to learn different programming languages very frequently?
What is the purpose of using a decision tree?
Writing in a Christian voice
Capacitor electron flow
Does capillary rise violate hydrostatic paradox?
Why does the frost depth increase when the surface temperature warms up?
What do the positive and negative (+/-) transmit and receive pins mean on Ethernet cables?
Is there a distance limit for minecart tracks?
pandas python Update subset of column A based on subset of one or more other columns
pandas Add new “rank” columns for every columnHow to import other Python files?Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to drop rows of Pandas DataFrame whose value in certain columns is NaNChange data type of columns in PandasSelect rows from a DataFrame based on values in a column in pandasDeleting DataFrame row in Pandas based on column valueGet list from pandas DataFrame column headers
Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.
My pandas df
has 4.5 mil rows and 23 columns. The table below shows a few lines from df2
which is generated from df
. It shows two groups (eeskin and hduquant) and three features (failed, exit_status, and job_number):
# report by group
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 -1 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 -1 4.07192e+06 61
The "rank" column value of -1 is a placeholder.
I want to update each group's rank for each feature. "Feature" in this case means each unique value in the "feature" column: failed, exit_status, and job_number. Updating the rankings for job_number, for example, means modifying the values in column "rank" only on rows where column "feature" equals job_number. As it turns out each of these rows also corresponds to a different group value in the "group" column.
So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.
The rank for feature "job_number" is based on the value of "#_jobs" col (highest number of jobs is rank 1). For feature "failed", rank is based on the "freq" of "top_value". exits_status
can remain -1 for now.
The result should look like this:
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 2 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 2 0 63
1 hduquant job_number 2 64 1 4.07192e+06 61
"eeskin" ranks 1 for failed
and 2 for job_number
. "hdquant" ranks 2 for failed
and 1 for job_number
.
I am able to update the rank values for job_number
with this code:
if feat == 'job_number':
grouped = grouped.sort_values("#_jobs", ascending=False)
grouped['rank'] = grouped.index + 1
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 2 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 1 4.07192e+06 61
But when I try to update both, neither is updated:
feat = ['job_number', 'failed']
for f in feat:
if f == 'job_number':
grouped = grouped.sort_values("#_jobs", ascending=False)
grouped['rank'] = grouped.index + 1
elif f == 'failed': # or f == 'exit_status'
x = len(not grouped[f] == 0)
grouped['x'] = x
grouped = grouped.sort_values("x", ascending=False)
grouped['rank'] = grouped.index + 1
del grouped['x']
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 -1 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 -1 4.07192e+06 61
I've tried to implement the suggestion from Matt W. but so far without success:
df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)
I modified his code a bit as follows but also without success:
df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))
Addendum @Matt W.
Input:
import pandas as pd
df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]],
columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
df = (df.astype(dtype='group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'))
df
Output:
group owner job_number submission_time start_time end_time failed exit_status granted_pe slots task_number maxvmem h_data h_rt highp exclusive h_vmem gpu pe slot campus wait_time wtime
0 g1 u1 3902779 2018-09-27 21:38:06 2018-10-01 07:24:38 2018-10-01 08:00:42 0 0 single 1 55696 609865728.0 4.0 6.0 0 0 4.0 0 single 1 0 3 days 09:46:32 00:36:04
4080243 g50 u92 4071923 2018-10-25 02:08:14 2018-10-27 01:41:58 2018-10-27 02:08:50 0 0 shared 1 119 7.654482e+08 2.5 1.5 0 1 16.0 0 shared 1 0 1 days 23:33:44 00:26:52
4080244 g50 u92 4071922 2018-10-25 02:08:11 2018-10-27 01:46:53 2018-10-27 02:08:53 0 0 shared 1 2208 1.074463e+09 2.5 1.5 0 10 24.0 0 shared 1 0 1 days 23:38:42 00:22:00
The code produces the first line. I tacked on a couple more lines just for variety.
There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.
I want to create an overall report and one report per group, both focused on resource usage.
Components of overall report:
General stats:
- count, mean, std, min, 25%, 50%, 75%, max (numerical)
- count, unique, top, freq (string)
- count, first, last # time delta cols (time delta)
Jobs:
- job_number with the most task_numbers, job_number w most submission times
- job (as defined above) with earliest/latest
- submission_time, start_time, and end_time
- job with most
- failed != 0
- exit_status != 0
- job with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- job with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe
- job with longest/shortest cumulative
- wait_time and wtime
Owners:
- owner with the most jobs
- owner with earliest/latest
- submission_time, start_time, end_time
- owner with most
- failed != 0
- exit_status != 0
- owner with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, gpu
- owner with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe
- owner with longest/shortest cumulative
- wait_time and wtime
Groups:
- group with the most jobs
- group with the most owners
- group with earliest/latest
- submission_time, start_time, and end_time
- group with most
- failed != 0
- exit_status != 0
- group with most (sum of values) of each
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- group with most
- pe == single
- pe == shared
- pe == for each addtl category of pe
- group with longest/shortest cumulative
- wait_time and wtime
Components of individual "by group" reports:
By feature (column in df):
General stats:
- count, mean, std, min, 25%, 50%, 75%, max (numerical)
- count, unique, top, freq (string)
- count, first, last # time delta cols (time delta)
The group's stats:
By job:
- job_number with the most task_numbers, job_number w most submission_times
- job (as defined above) with earliest/latest
- submission_time, start_time, and end_time
- job with most
- failed != 0
- exit_status != 0
- job with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- job with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)
- job with longest/shortest cumulative
- wait_time and wtime
By owner:
- owner with the most jobs
- owner with earliest/latest
- submission_time, start_time, end_time
- owner with most
- failed != 0
- exit_status != 0
- owner with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- owner with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)
- owner with longest/shortest cumulative
- wait_time and wtime
By ranking:
Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.
Rank for:
- number of
- jobs, job_numbers, task_numbers, submission times
- time of first
- submission_time, start_time, end_time
- time of last
- submission_time, start_time, end_time
- number of
- failed != 0
- exit_status != 0
- number of
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- number of
- pe == single
- pe == shared
- pe == for each addtl category of pe
- total cumulative for all jobs
- wait_time and wtime
python pandas
add a comment |
Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.
My pandas df
has 4.5 mil rows and 23 columns. The table below shows a few lines from df2
which is generated from df
. It shows two groups (eeskin and hduquant) and three features (failed, exit_status, and job_number):
# report by group
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 -1 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 -1 4.07192e+06 61
The "rank" column value of -1 is a placeholder.
I want to update each group's rank for each feature. "Feature" in this case means each unique value in the "feature" column: failed, exit_status, and job_number. Updating the rankings for job_number, for example, means modifying the values in column "rank" only on rows where column "feature" equals job_number. As it turns out each of these rows also corresponds to a different group value in the "group" column.
So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.
The rank for feature "job_number" is based on the value of "#_jobs" col (highest number of jobs is rank 1). For feature "failed", rank is based on the "freq" of "top_value". exits_status
can remain -1 for now.
The result should look like this:
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 2 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 2 0 63
1 hduquant job_number 2 64 1 4.07192e+06 61
"eeskin" ranks 1 for failed
and 2 for job_number
. "hdquant" ranks 2 for failed
and 1 for job_number
.
I am able to update the rank values for job_number
with this code:
if feat == 'job_number':
grouped = grouped.sort_values("#_jobs", ascending=False)
grouped['rank'] = grouped.index + 1
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 2 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 1 4.07192e+06 61
But when I try to update both, neither is updated:
feat = ['job_number', 'failed']
for f in feat:
if f == 'job_number':
grouped = grouped.sort_values("#_jobs", ascending=False)
grouped['rank'] = grouped.index + 1
elif f == 'failed': # or f == 'exit_status'
x = len(not grouped[f] == 0)
grouped['x'] = x
grouped = grouped.sort_values("x", ascending=False)
grouped['rank'] = grouped.index + 1
del grouped['x']
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 -1 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 -1 4.07192e+06 61
I've tried to implement the suggestion from Matt W. but so far without success:
df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)
I modified his code a bit as follows but also without success:
df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))
Addendum @Matt W.
Input:
import pandas as pd
df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]],
columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
df = (df.astype(dtype='group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'))
df
Output:
group owner job_number submission_time start_time end_time failed exit_status granted_pe slots task_number maxvmem h_data h_rt highp exclusive h_vmem gpu pe slot campus wait_time wtime
0 g1 u1 3902779 2018-09-27 21:38:06 2018-10-01 07:24:38 2018-10-01 08:00:42 0 0 single 1 55696 609865728.0 4.0 6.0 0 0 4.0 0 single 1 0 3 days 09:46:32 00:36:04
4080243 g50 u92 4071923 2018-10-25 02:08:14 2018-10-27 01:41:58 2018-10-27 02:08:50 0 0 shared 1 119 7.654482e+08 2.5 1.5 0 1 16.0 0 shared 1 0 1 days 23:33:44 00:26:52
4080244 g50 u92 4071922 2018-10-25 02:08:11 2018-10-27 01:46:53 2018-10-27 02:08:53 0 0 shared 1 2208 1.074463e+09 2.5 1.5 0 10 24.0 0 shared 1 0 1 days 23:38:42 00:22:00
The code produces the first line. I tacked on a couple more lines just for variety.
There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.
I want to create an overall report and one report per group, both focused on resource usage.
Components of overall report:
General stats:
- count, mean, std, min, 25%, 50%, 75%, max (numerical)
- count, unique, top, freq (string)
- count, first, last # time delta cols (time delta)
Jobs:
- job_number with the most task_numbers, job_number w most submission times
- job (as defined above) with earliest/latest
- submission_time, start_time, and end_time
- job with most
- failed != 0
- exit_status != 0
- job with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- job with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe
- job with longest/shortest cumulative
- wait_time and wtime
Owners:
- owner with the most jobs
- owner with earliest/latest
- submission_time, start_time, end_time
- owner with most
- failed != 0
- exit_status != 0
- owner with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, gpu
- owner with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe
- owner with longest/shortest cumulative
- wait_time and wtime
Groups:
- group with the most jobs
- group with the most owners
- group with earliest/latest
- submission_time, start_time, and end_time
- group with most
- failed != 0
- exit_status != 0
- group with most (sum of values) of each
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- group with most
- pe == single
- pe == shared
- pe == for each addtl category of pe
- group with longest/shortest cumulative
- wait_time and wtime
Components of individual "by group" reports:
By feature (column in df):
General stats:
- count, mean, std, min, 25%, 50%, 75%, max (numerical)
- count, unique, top, freq (string)
- count, first, last # time delta cols (time delta)
The group's stats:
By job:
- job_number with the most task_numbers, job_number w most submission_times
- job (as defined above) with earliest/latest
- submission_time, start_time, and end_time
- job with most
- failed != 0
- exit_status != 0
- job with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- job with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)
- job with longest/shortest cumulative
- wait_time and wtime
By owner:
- owner with the most jobs
- owner with earliest/latest
- submission_time, start_time, end_time
- owner with most
- failed != 0
- exit_status != 0
- owner with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- owner with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)
- owner with longest/shortest cumulative
- wait_time and wtime
By ranking:
Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.
Rank for:
- number of
- jobs, job_numbers, task_numbers, submission times
- time of first
- submission_time, start_time, end_time
- time of last
- submission_time, start_time, end_time
- number of
- failed != 0
- exit_status != 0
- number of
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- number of
- pe == single
- pe == shared
- pe == for each addtl category of pe
- total cumulative for all jobs
- wait_time and wtime
python pandas
add a comment |
Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.
My pandas df
has 4.5 mil rows and 23 columns. The table below shows a few lines from df2
which is generated from df
. It shows two groups (eeskin and hduquant) and three features (failed, exit_status, and job_number):
# report by group
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 -1 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 -1 4.07192e+06 61
The "rank" column value of -1 is a placeholder.
I want to update each group's rank for each feature. "Feature" in this case means each unique value in the "feature" column: failed, exit_status, and job_number. Updating the rankings for job_number, for example, means modifying the values in column "rank" only on rows where column "feature" equals job_number. As it turns out each of these rows also corresponds to a different group value in the "group" column.
So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.
The rank for feature "job_number" is based on the value of "#_jobs" col (highest number of jobs is rank 1). For feature "failed", rank is based on the "freq" of "top_value". exits_status
can remain -1 for now.
The result should look like this:
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 2 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 2 0 63
1 hduquant job_number 2 64 1 4.07192e+06 61
"eeskin" ranks 1 for failed
and 2 for job_number
. "hdquant" ranks 2 for failed
and 1 for job_number
.
I am able to update the rank values for job_number
with this code:
if feat == 'job_number':
grouped = grouped.sort_values("#_jobs", ascending=False)
grouped['rank'] = grouped.index + 1
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 2 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 1 4.07192e+06 61
But when I try to update both, neither is updated:
feat = ['job_number', 'failed']
for f in feat:
if f == 'job_number':
grouped = grouped.sort_values("#_jobs", ascending=False)
grouped['rank'] = grouped.index + 1
elif f == 'failed': # or f == 'exit_status'
x = len(not grouped[f] == 0)
grouped['x'] = x
grouped = grouped.sort_values("x", ascending=False)
grouped['rank'] = grouped.index + 1
del grouped['x']
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 -1 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 -1 4.07192e+06 61
I've tried to implement the suggestion from Matt W. but so far without success:
df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)
I modified his code a bit as follows but also without success:
df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))
Addendum @Matt W.
Input:
import pandas as pd
df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]],
columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
df = (df.astype(dtype='group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'))
df
Output:
group owner job_number submission_time start_time end_time failed exit_status granted_pe slots task_number maxvmem h_data h_rt highp exclusive h_vmem gpu pe slot campus wait_time wtime
0 g1 u1 3902779 2018-09-27 21:38:06 2018-10-01 07:24:38 2018-10-01 08:00:42 0 0 single 1 55696 609865728.0 4.0 6.0 0 0 4.0 0 single 1 0 3 days 09:46:32 00:36:04
4080243 g50 u92 4071923 2018-10-25 02:08:14 2018-10-27 01:41:58 2018-10-27 02:08:50 0 0 shared 1 119 7.654482e+08 2.5 1.5 0 1 16.0 0 shared 1 0 1 days 23:33:44 00:26:52
4080244 g50 u92 4071922 2018-10-25 02:08:11 2018-10-27 01:46:53 2018-10-27 02:08:53 0 0 shared 1 2208 1.074463e+09 2.5 1.5 0 10 24.0 0 shared 1 0 1 days 23:38:42 00:22:00
The code produces the first line. I tacked on a couple more lines just for variety.
There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.
I want to create an overall report and one report per group, both focused on resource usage.
Components of overall report:
General stats:
- count, mean, std, min, 25%, 50%, 75%, max (numerical)
- count, unique, top, freq (string)
- count, first, last # time delta cols (time delta)
Jobs:
- job_number with the most task_numbers, job_number w most submission times
- job (as defined above) with earliest/latest
- submission_time, start_time, and end_time
- job with most
- failed != 0
- exit_status != 0
- job with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- job with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe
- job with longest/shortest cumulative
- wait_time and wtime
Owners:
- owner with the most jobs
- owner with earliest/latest
- submission_time, start_time, end_time
- owner with most
- failed != 0
- exit_status != 0
- owner with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, gpu
- owner with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe
- owner with longest/shortest cumulative
- wait_time and wtime
Groups:
- group with the most jobs
- group with the most owners
- group with earliest/latest
- submission_time, start_time, and end_time
- group with most
- failed != 0
- exit_status != 0
- group with most (sum of values) of each
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- group with most
- pe == single
- pe == shared
- pe == for each addtl category of pe
- group with longest/shortest cumulative
- wait_time and wtime
Components of individual "by group" reports:
By feature (column in df):
General stats:
- count, mean, std, min, 25%, 50%, 75%, max (numerical)
- count, unique, top, freq (string)
- count, first, last # time delta cols (time delta)
The group's stats:
By job:
- job_number with the most task_numbers, job_number w most submission_times
- job (as defined above) with earliest/latest
- submission_time, start_time, and end_time
- job with most
- failed != 0
- exit_status != 0
- job with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- job with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)
- job with longest/shortest cumulative
- wait_time and wtime
By owner:
- owner with the most jobs
- owner with earliest/latest
- submission_time, start_time, end_time
- owner with most
- failed != 0
- exit_status != 0
- owner with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- owner with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)
- owner with longest/shortest cumulative
- wait_time and wtime
By ranking:
Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.
Rank for:
- number of
- jobs, job_numbers, task_numbers, submission times
- time of first
- submission_time, start_time, end_time
- time of last
- submission_time, start_time, end_time
- number of
- failed != 0
- exit_status != 0
- number of
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- number of
- pe == single
- pe == shared
- pe == for each addtl category of pe
- total cumulative for all jobs
- wait_time and wtime
python pandas
Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.
My pandas df
has 4.5 mil rows and 23 columns. The table below shows a few lines from df2
which is generated from df
. It shows two groups (eeskin and hduquant) and three features (failed, exit_status, and job_number):
# report by group
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 -1 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 -1 4.07192e+06 61
The "rank" column value of -1 is a placeholder.
I want to update each group's rank for each feature. "Feature" in this case means each unique value in the "feature" column: failed, exit_status, and job_number. Updating the rankings for job_number, for example, means modifying the values in column "rank" only on rows where column "feature" equals job_number. As it turns out each of these rows also corresponds to a different group value in the "group" column.
So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.
The rank for feature "job_number" is based on the value of "#_jobs" col (highest number of jobs is rank 1). For feature "failed", rank is based on the "freq" of "top_value". exits_status
can remain -1 for now.
The result should look like this:
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 2 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 2 0 63
1 hduquant job_number 2 64 1 4.07192e+06 61
"eeskin" ranks 1 for failed
and 2 for job_number
. "hdquant" ranks 2 for failed
and 1 for job_number
.
I am able to update the rank values for job_number
with this code:
if feat == 'job_number':
grouped = grouped.sort_values("#_jobs", ascending=False)
grouped['rank'] = grouped.index + 1
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 2 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 1 4.07192e+06 61
But when I try to update both, neither is updated:
feat = ['job_number', 'failed']
for f in feat:
if f == 'job_number':
grouped = grouped.sort_values("#_jobs", ascending=False)
grouped['rank'] = grouped.index + 1
elif f == 'failed': # or f == 'exit_status'
x = len(not grouped[f] == 0)
grouped['x'] = x
grouped = grouped.sort_values("x", ascending=False)
grouped['rank'] = grouped.index + 1
del grouped['x']
group feature #_cats #_jobs rank top_value freq
10 eeskin failed 1 6 -1 100 6
21 eeskin exit_status 1 6 -1 0 6
0 eeskin job_number 1 6 -1 4.08219e+06 6
21 hduquant exit_status 5 64 -1 37 58
11 hduquant failed 2 64 -1 0 63
1 hduquant job_number 2 64 -1 4.07192e+06 61
I've tried to implement the suggestion from Matt W. but so far without success:
df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)
I modified his code a bit as follows but also without success:
df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))
Addendum @Matt W.
Input:
import pandas as pd
df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]],
columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
df = (df.astype(dtype='group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'))
df
Output:
group owner job_number submission_time start_time end_time failed exit_status granted_pe slots task_number maxvmem h_data h_rt highp exclusive h_vmem gpu pe slot campus wait_time wtime
0 g1 u1 3902779 2018-09-27 21:38:06 2018-10-01 07:24:38 2018-10-01 08:00:42 0 0 single 1 55696 609865728.0 4.0 6.0 0 0 4.0 0 single 1 0 3 days 09:46:32 00:36:04
4080243 g50 u92 4071923 2018-10-25 02:08:14 2018-10-27 01:41:58 2018-10-27 02:08:50 0 0 shared 1 119 7.654482e+08 2.5 1.5 0 1 16.0 0 shared 1 0 1 days 23:33:44 00:26:52
4080244 g50 u92 4071922 2018-10-25 02:08:11 2018-10-27 01:46:53 2018-10-27 02:08:53 0 0 shared 1 2208 1.074463e+09 2.5 1.5 0 10 24.0 0 shared 1 0 1 days 23:38:42 00:22:00
The code produces the first line. I tacked on a couple more lines just for variety.
There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.
I want to create an overall report and one report per group, both focused on resource usage.
Components of overall report:
General stats:
- count, mean, std, min, 25%, 50%, 75%, max (numerical)
- count, unique, top, freq (string)
- count, first, last # time delta cols (time delta)
Jobs:
- job_number with the most task_numbers, job_number w most submission times
- job (as defined above) with earliest/latest
- submission_time, start_time, and end_time
- job with most
- failed != 0
- exit_status != 0
- job with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- job with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe
- job with longest/shortest cumulative
- wait_time and wtime
Owners:
- owner with the most jobs
- owner with earliest/latest
- submission_time, start_time, end_time
- owner with most
- failed != 0
- exit_status != 0
- owner with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, gpu
- owner with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe
- owner with longest/shortest cumulative
- wait_time and wtime
Groups:
- group with the most jobs
- group with the most owners
- group with earliest/latest
- submission_time, start_time, and end_time
- group with most
- failed != 0
- exit_status != 0
- group with most (sum of values) of each
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- group with most
- pe == single
- pe == shared
- pe == for each addtl category of pe
- group with longest/shortest cumulative
- wait_time and wtime
Components of individual "by group" reports:
By feature (column in df):
General stats:
- count, mean, std, min, 25%, 50%, 75%, max (numerical)
- count, unique, top, freq (string)
- count, first, last # time delta cols (time delta)
The group's stats:
By job:
- job_number with the most task_numbers, job_number w most submission_times
- job (as defined above) with earliest/latest
- submission_time, start_time, and end_time
- job with most
- failed != 0
- exit_status != 0
- job with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- job with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)
- job with longest/shortest cumulative
- wait_time and wtime
By owner:
- owner with the most jobs
- owner with earliest/latest
- submission_time, start_time, end_time
- owner with most
- failed != 0
- exit_status != 0
- owner with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- owner with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)
- owner with longest/shortest cumulative
- wait_time and wtime
By ranking:
Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.
Rank for:
- number of
- jobs, job_numbers, task_numbers, submission times
- time of first
- submission_time, start_time, end_time
- time of last
- submission_time, start_time, end_time
- number of
- failed != 0
- exit_status != 0
- number of
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
- number of
- pe == single
- pe == shared
- pe == for each addtl category of pe
- total cumulative for all jobs
- wait_time and wtime
python pandas
python pandas
edited Mar 8 at 18:07
Karl Baker
asked Mar 7 at 1:19
Karl BakerKarl Baker
252114
252114
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
You can do this an easier way by using pandas .loc
Initialize dataframe:
df = pd.DataFrame('group':['e','e','e','h','h','h'],
'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
'cats':[1, 1, 1, 5, 2, 2],
'jobs':[1, 1, 1, 64, 64, 64],
'rank':[-1, -1, -1, -1, -1, -1],
'topvalue':[100, 0, 4, 37, 0, 3.9],
'freq':[1, 1, 1, 58, 63, 61]
)
We want to rank jobs feature so we just isolate the rank locations using .loc
, and then on the right side of the assignment, we isolate the jobs column using .loc
and use the .rank()
function
Rank job feature, by jobs value:
df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)
Rank failure feature by frequency where top value is not 0:
For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.
This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0
ranks stay as -1
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
This way we don't filter out the 0s.
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using.apply()
method. How would I go about that?
– Karl Baker
Mar 7 at 2:06
for feat2 in features2:
grouped = df.groupby('group')[feat2]
grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False)
--> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"
– Karl Baker
Mar 7 at 2:10
Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.
– Matt W.
Mar 7 at 2:18
If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.
– Matt W.
Mar 7 at 2:20
1
I found an answer that works for me here. It uses groupby and loc/iloc.
– Karl Baker
Mar 10 at 5:00
|
show 4 more comments
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55034626%2fpandas-python-update-subset-of-column-a-based-on-subset-of-one-or-more-other-col%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can do this an easier way by using pandas .loc
Initialize dataframe:
df = pd.DataFrame('group':['e','e','e','h','h','h'],
'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
'cats':[1, 1, 1, 5, 2, 2],
'jobs':[1, 1, 1, 64, 64, 64],
'rank':[-1, -1, -1, -1, -1, -1],
'topvalue':[100, 0, 4, 37, 0, 3.9],
'freq':[1, 1, 1, 58, 63, 61]
)
We want to rank jobs feature so we just isolate the rank locations using .loc
, and then on the right side of the assignment, we isolate the jobs column using .loc
and use the .rank()
function
Rank job feature, by jobs value:
df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)
Rank failure feature by frequency where top value is not 0:
For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.
This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0
ranks stay as -1
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
This way we don't filter out the 0s.
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using.apply()
method. How would I go about that?
– Karl Baker
Mar 7 at 2:06
for feat2 in features2:
grouped = df.groupby('group')[feat2]
grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False)
--> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"
– Karl Baker
Mar 7 at 2:10
Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.
– Matt W.
Mar 7 at 2:18
If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.
– Matt W.
Mar 7 at 2:20
1
I found an answer that works for me here. It uses groupby and loc/iloc.
– Karl Baker
Mar 10 at 5:00
|
show 4 more comments
You can do this an easier way by using pandas .loc
Initialize dataframe:
df = pd.DataFrame('group':['e','e','e','h','h','h'],
'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
'cats':[1, 1, 1, 5, 2, 2],
'jobs':[1, 1, 1, 64, 64, 64],
'rank':[-1, -1, -1, -1, -1, -1],
'topvalue':[100, 0, 4, 37, 0, 3.9],
'freq':[1, 1, 1, 58, 63, 61]
)
We want to rank jobs feature so we just isolate the rank locations using .loc
, and then on the right side of the assignment, we isolate the jobs column using .loc
and use the .rank()
function
Rank job feature, by jobs value:
df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)
Rank failure feature by frequency where top value is not 0:
For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.
This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0
ranks stay as -1
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
This way we don't filter out the 0s.
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using.apply()
method. How would I go about that?
– Karl Baker
Mar 7 at 2:06
for feat2 in features2:
grouped = df.groupby('group')[feat2]
grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False)
--> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"
– Karl Baker
Mar 7 at 2:10
Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.
– Matt W.
Mar 7 at 2:18
If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.
– Matt W.
Mar 7 at 2:20
1
I found an answer that works for me here. It uses groupby and loc/iloc.
– Karl Baker
Mar 10 at 5:00
|
show 4 more comments
You can do this an easier way by using pandas .loc
Initialize dataframe:
df = pd.DataFrame('group':['e','e','e','h','h','h'],
'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
'cats':[1, 1, 1, 5, 2, 2],
'jobs':[1, 1, 1, 64, 64, 64],
'rank':[-1, -1, -1, -1, -1, -1],
'topvalue':[100, 0, 4, 37, 0, 3.9],
'freq':[1, 1, 1, 58, 63, 61]
)
We want to rank jobs feature so we just isolate the rank locations using .loc
, and then on the right side of the assignment, we isolate the jobs column using .loc
and use the .rank()
function
Rank job feature, by jobs value:
df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)
Rank failure feature by frequency where top value is not 0:
For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.
This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0
ranks stay as -1
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
This way we don't filter out the 0s.
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
You can do this an easier way by using pandas .loc
Initialize dataframe:
df = pd.DataFrame('group':['e','e','e','h','h','h'],
'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
'cats':[1, 1, 1, 5, 2, 2],
'jobs':[1, 1, 1, 64, 64, 64],
'rank':[-1, -1, -1, -1, -1, -1],
'topvalue':[100, 0, 4, 37, 0, 3.9],
'freq':[1, 1, 1, 58, 63, 61]
)
We want to rank jobs feature so we just isolate the rank locations using .loc
, and then on the right side of the assignment, we isolate the jobs column using .loc
and use the .rank()
function
Rank job feature, by jobs value:
df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)
Rank failure feature by frequency where top value is not 0:
For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.
This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0
ranks stay as -1
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
This way we don't filter out the 0s.
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
answered Mar 7 at 1:47
Matt W.Matt W.
2,196625
2,196625
This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using.apply()
method. How would I go about that?
– Karl Baker
Mar 7 at 2:06
for feat2 in features2:
grouped = df.groupby('group')[feat2]
grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False)
--> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"
– Karl Baker
Mar 7 at 2:10
Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.
– Matt W.
Mar 7 at 2:18
If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.
– Matt W.
Mar 7 at 2:20
1
I found an answer that works for me here. It uses groupby and loc/iloc.
– Karl Baker
Mar 10 at 5:00
|
show 4 more comments
This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using.apply()
method. How would I go about that?
– Karl Baker
Mar 7 at 2:06
for feat2 in features2:
grouped = df.groupby('group')[feat2]
grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False)
--> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"
– Karl Baker
Mar 7 at 2:10
Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.
– Matt W.
Mar 7 at 2:18
If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.
– Matt W.
Mar 7 at 2:20
1
I found an answer that works for me here. It uses groupby and loc/iloc.
– Karl Baker
Mar 10 at 5:00
This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using
.apply()
method. How would I go about that?– Karl Baker
Mar 7 at 2:06
This looks promising but when I tested it I realized my df is actually a GroupBy object. Error message suggests using
.apply()
method. How would I go about that?– Karl Baker
Mar 7 at 2:06
for feat2 in features2:
grouped = df.groupby('group')[feat2]
grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False)
--> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"– Karl Baker
Mar 7 at 2:10
for feat2 in features2:
grouped = df.groupby('group')[feat2]
grouped.loc[grouped.feature == 'job', 'rank'] = grouped.loc[grouped.feature == 'job', 'jobs'].rank(ascending=False)
--> "AttributeError: Cannot access callable attribute 'loc' of 'SeriesGroupBy' objects, try using the 'apply' method"– Karl Baker
Mar 7 at 2:10
Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.
– Matt W.
Mar 7 at 2:18
Dont' do the for feat in features part. don't groupby the dataframe. Leave it as is.
– Matt W.
Mar 7 at 2:18
If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.
– Matt W.
Mar 7 at 2:20
If you have a ton of features like 20+, that you need to do this with, and they each use a different column for rank, I would then suggest creating 1 column with the ranking values, and grouping by and ranking based on that 1 column.
– Matt W.
Mar 7 at 2:20
1
1
I found an answer that works for me here. It uses groupby and loc/iloc.
– Karl Baker
Mar 10 at 5:00
I found an answer that works for me here. It uses groupby and loc/iloc.
– Karl Baker
Mar 10 at 5:00
|
show 4 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55034626%2fpandas-python-update-subset-of-column-a-based-on-subset-of-one-or-more-other-col%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown