Nested cross-validation: How does cross_validate handle GridSearchCV as its input estimator?2019 Community Moderator Electionscikit-learn GridSearchCV with multiple repetitionsGridSearchCV final modelR caret / How does cross-validation for train within rfe workHow to generate a custom cross-validation generator in scikit-learn?Using sklearn cross_val_score and kfolds to fit and help predict modelUsing Scikit-Learn GridSearchCV for cross validation with PredefinedSplit - Suspiciously good cross validation resultsHow to access hyperparameters in case of nested cross-validation using scikit-learnWhy did Run best_estimator_ from GridSearch using cross-validation produce different accuracy score?How to give GridSearchCV a list of indicies for cross-validation?Nest cross validation for predictions using groupsFitting in nested cross-validation with cross_val_score with pipeline and GridSearchGridsearchCV and Kfold Cross validation
Who is our nearest planetary neighbor, on average?
Making a sword in the stone, in a medieval world without magic
Why must traveling waves have the same amplitude to form a standing wave?
I need to drive a 7/16" nut but am unsure how to use the socket I bought for my screwdriver
Do I need life insurance if I can cover my own funeral costs?
Why are there 40 737 Max planes in flight when they have been grounded as not airworthy?
Ban on all campaign finance?
When do we add an hyphen (-) to a complex adjective word?
Calculus II Professor will not accept my correct integral evaluation that uses a different method, should I bring this up further?
How is the Swiss post e-voting system supposed to work, and how was it wrong?
Rules about breaking the rules. How do I do it well?
Does this AnyDice function accurately calculate the number of ogres you make unconcious with three 4th-level castings of Sleep?
Provisioning profile doesn't include the application-identifier and keychain-access-groups entitlements
How to generate globally unique ids for different tables of the same database?
2D counterpart of std::array in C++17
How to write cleanly even if my character uses expletive language?
Where is the 1/8 CR apprentice in Volo's Guide to Monsters?
Instead of Universal Basic Income, why not Universal Basic NEEDS?
My adviser wants to be the first author
Latest web browser compatible with Windows 98
Converting Functions to Arrow functions
Know when to turn notes upside-down(eighth notes, sixteen notes, etc.)
Employee lack of ownership
Be in awe of my brilliance!
Nested cross-validation: How does cross_validate handle GridSearchCV as its input estimator?
2019 Community Moderator Electionscikit-learn GridSearchCV with multiple repetitionsGridSearchCV final modelR caret / How does cross-validation for train within rfe workHow to generate a custom cross-validation generator in scikit-learn?Using sklearn cross_val_score and kfolds to fit and help predict modelUsing Scikit-Learn GridSearchCV for cross validation with PredefinedSplit - Suspiciously good cross validation resultsHow to access hyperparameters in case of nested cross-validation using scikit-learnWhy did Run best_estimator_ from GridSearch using cross-validation produce different accuracy score?How to give GridSearchCV a list of indicies for cross-validation?Nest cross validation for predictions using groupsFitting in nested cross-validation with cross_val_score with pipeline and GridSearchGridsearchCV and Kfold Cross validation
The following code combines cross_validate
with GridSearchCV
to perform a nested cross-validation for an SVC on the iris dataset.
(Modified example of the following documentation page:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py.)
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_validate, KFold
import numpy as np
np.set_printoptions(precision=2)
# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
# Set up possible values of parameters to optimize over
p_grid = "C": [1, 10],
"gamma": [.01, .1]
# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")
# Choose techniques for the inner and outer loop of nested cross-validation
inner_cv = KFold(n_splits=5, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=1)
# Perform nested cross-validation
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv, iid=False)
clf.fit(X_iris, y_iris)
best_estimator = clf.best_estimator_
cv_dic = cross_validate(clf, X_iris, y_iris, cv=outer_cv, scoring=['accuracy'], return_estimator=False, return_train_score=True)
mean_val_score = cv_dic['test_accuracy'].mean()
print('nested_train_scores: ', cv_dic['train_accuracy'])
print('nested_val_scores: ', cv_dic['test_accuracy'])
print('mean score: 0:.2f'.format(mean_val_score))
cross_validate
splits the data set in each fold into a training and a test set. In each fold, the input estimator is then trained based on the training set associated with the fold. The inputted estimator here is clf
, a parameterized GridSearchCV
estimator, i.e. an estimator that cross-validates itself again.
I have three questions about the whole thing:
- If
clf
is used as the estimator forcross_validate
, does it (in the course of theGridSearchCV
cross validation) split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination? - Out of all models tested via
GridSearchCV
, doescross_validate
validate only the model stored in thebest_estimator_
attribute? - Does
cross_validate
train a model at all (if so, why?) or is the model stored inbest_estimator_
validated directly via the test set?
To make it clearer how the questions are meant, here is an illustration of how I imagine the double cross validation at the moment.
python python-3.x scikit-learn nested cross-validation
add a comment |
The following code combines cross_validate
with GridSearchCV
to perform a nested cross-validation for an SVC on the iris dataset.
(Modified example of the following documentation page:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py.)
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_validate, KFold
import numpy as np
np.set_printoptions(precision=2)
# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
# Set up possible values of parameters to optimize over
p_grid = "C": [1, 10],
"gamma": [.01, .1]
# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")
# Choose techniques for the inner and outer loop of nested cross-validation
inner_cv = KFold(n_splits=5, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=1)
# Perform nested cross-validation
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv, iid=False)
clf.fit(X_iris, y_iris)
best_estimator = clf.best_estimator_
cv_dic = cross_validate(clf, X_iris, y_iris, cv=outer_cv, scoring=['accuracy'], return_estimator=False, return_train_score=True)
mean_val_score = cv_dic['test_accuracy'].mean()
print('nested_train_scores: ', cv_dic['train_accuracy'])
print('nested_val_scores: ', cv_dic['test_accuracy'])
print('mean score: 0:.2f'.format(mean_val_score))
cross_validate
splits the data set in each fold into a training and a test set. In each fold, the input estimator is then trained based on the training set associated with the fold. The inputted estimator here is clf
, a parameterized GridSearchCV
estimator, i.e. an estimator that cross-validates itself again.
I have three questions about the whole thing:
- If
clf
is used as the estimator forcross_validate
, does it (in the course of theGridSearchCV
cross validation) split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination? - Out of all models tested via
GridSearchCV
, doescross_validate
validate only the model stored in thebest_estimator_
attribute? - Does
cross_validate
train a model at all (if so, why?) or is the model stored inbest_estimator_
validated directly via the test set?
To make it clearer how the questions are meant, here is an illustration of how I imagine the double cross validation at the moment.
python python-3.x scikit-learn nested cross-validation
add a comment |
The following code combines cross_validate
with GridSearchCV
to perform a nested cross-validation for an SVC on the iris dataset.
(Modified example of the following documentation page:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py.)
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_validate, KFold
import numpy as np
np.set_printoptions(precision=2)
# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
# Set up possible values of parameters to optimize over
p_grid = "C": [1, 10],
"gamma": [.01, .1]
# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")
# Choose techniques for the inner and outer loop of nested cross-validation
inner_cv = KFold(n_splits=5, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=1)
# Perform nested cross-validation
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv, iid=False)
clf.fit(X_iris, y_iris)
best_estimator = clf.best_estimator_
cv_dic = cross_validate(clf, X_iris, y_iris, cv=outer_cv, scoring=['accuracy'], return_estimator=False, return_train_score=True)
mean_val_score = cv_dic['test_accuracy'].mean()
print('nested_train_scores: ', cv_dic['train_accuracy'])
print('nested_val_scores: ', cv_dic['test_accuracy'])
print('mean score: 0:.2f'.format(mean_val_score))
cross_validate
splits the data set in each fold into a training and a test set. In each fold, the input estimator is then trained based on the training set associated with the fold. The inputted estimator here is clf
, a parameterized GridSearchCV
estimator, i.e. an estimator that cross-validates itself again.
I have three questions about the whole thing:
- If
clf
is used as the estimator forcross_validate
, does it (in the course of theGridSearchCV
cross validation) split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination? - Out of all models tested via
GridSearchCV
, doescross_validate
validate only the model stored in thebest_estimator_
attribute? - Does
cross_validate
train a model at all (if so, why?) or is the model stored inbest_estimator_
validated directly via the test set?
To make it clearer how the questions are meant, here is an illustration of how I imagine the double cross validation at the moment.
python python-3.x scikit-learn nested cross-validation
The following code combines cross_validate
with GridSearchCV
to perform a nested cross-validation for an SVC on the iris dataset.
(Modified example of the following documentation page:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py.)
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_validate, KFold
import numpy as np
np.set_printoptions(precision=2)
# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
# Set up possible values of parameters to optimize over
p_grid = "C": [1, 10],
"gamma": [.01, .1]
# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")
# Choose techniques for the inner and outer loop of nested cross-validation
inner_cv = KFold(n_splits=5, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=1)
# Perform nested cross-validation
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv, iid=False)
clf.fit(X_iris, y_iris)
best_estimator = clf.best_estimator_
cv_dic = cross_validate(clf, X_iris, y_iris, cv=outer_cv, scoring=['accuracy'], return_estimator=False, return_train_score=True)
mean_val_score = cv_dic['test_accuracy'].mean()
print('nested_train_scores: ', cv_dic['train_accuracy'])
print('nested_val_scores: ', cv_dic['test_accuracy'])
print('mean score: 0:.2f'.format(mean_val_score))
cross_validate
splits the data set in each fold into a training and a test set. In each fold, the input estimator is then trained based on the training set associated with the fold. The inputted estimator here is clf
, a parameterized GridSearchCV
estimator, i.e. an estimator that cross-validates itself again.
I have three questions about the whole thing:
- If
clf
is used as the estimator forcross_validate
, does it (in the course of theGridSearchCV
cross validation) split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination? - Out of all models tested via
GridSearchCV
, doescross_validate
validate only the model stored in thebest_estimator_
attribute? - Does
cross_validate
train a model at all (if so, why?) or is the model stored inbest_estimator_
validated directly via the test set?
To make it clearer how the questions are meant, here is an illustration of how I imagine the double cross validation at the moment.
python python-3.x scikit-learn nested cross-validation
python python-3.x scikit-learn nested cross-validation
edited Mar 8 at 18:36
zwithouta
asked Mar 6 at 18:48
zwithoutazwithouta
957
957
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
If
clf
is used as the estimator forcross_validate
, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?
Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).
Update Yes, when you will pass the GridSearchCV
classifier into cross-validate
it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.
Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?
Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit
parameter is True
by default in your case.) However, this best estimator will has to be trained again
Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?
As per your third and final question, Yes, it trains an estimator and returns it if return_estimator
is set to True
. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?
Update
The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV
but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV
and using the best estimator. However, there is no way for cross-validate
to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55030190%2fnested-cross-validation-how-does-cross-validate-handle-gridsearchcv-as-its-inpu%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
If
clf
is used as the estimator forcross_validate
, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?
Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).
Update Yes, when you will pass the GridSearchCV
classifier into cross-validate
it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.
Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?
Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit
parameter is True
by default in your case.) However, this best estimator will has to be trained again
Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?
As per your third and final question, Yes, it trains an estimator and returns it if return_estimator
is set to True
. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?
Update
The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV
but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV
and using the best estimator. However, there is no way for cross-validate
to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.
add a comment |
If
clf
is used as the estimator forcross_validate
, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?
Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).
Update Yes, when you will pass the GridSearchCV
classifier into cross-validate
it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.
Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?
Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit
parameter is True
by default in your case.) However, this best estimator will has to be trained again
Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?
As per your third and final question, Yes, it trains an estimator and returns it if return_estimator
is set to True
. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?
Update
The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV
but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV
and using the best estimator. However, there is no way for cross-validate
to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.
add a comment |
If
clf
is used as the estimator forcross_validate
, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?
Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).
Update Yes, when you will pass the GridSearchCV
classifier into cross-validate
it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.
Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?
Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit
parameter is True
by default in your case.) However, this best estimator will has to be trained again
Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?
As per your third and final question, Yes, it trains an estimator and returns it if return_estimator
is set to True
. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?
Update
The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV
but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV
and using the best estimator. However, there is no way for cross-validate
to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.
If
clf
is used as the estimator forcross_validate
, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?
Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).
Update Yes, when you will pass the GridSearchCV
classifier into cross-validate
it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.
Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?
Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit
parameter is True
by default in your case.) However, this best estimator will has to be trained again
Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?
As per your third and final question, Yes, it trains an estimator and returns it if return_estimator
is set to True
. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?
Update
The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV
but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV
and using the best estimator. However, there is no way for cross-validate
to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.
edited Mar 7 at 22:26
answered Mar 6 at 19:17
Mohammed KashifMohammed Kashif
4,5631726
4,5631726
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55030190%2fnested-cross-validation-how-does-cross-validate-handle-gridsearchcv-as-its-inpu%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown