Nested cross-validation: How does cross_validate handle GridSearchCV as its input estimator?2019 Community Moderator Electionscikit-learn GridSearchCV with multiple repetitionsGridSearchCV final modelR caret / How does cross-validation for train within rfe workHow to generate a custom cross-validation generator in scikit-learn?Using sklearn cross_val_score and kfolds to fit and help predict modelUsing Scikit-Learn GridSearchCV for cross validation with PredefinedSplit - Suspiciously good cross validation resultsHow to access hyperparameters in case of nested cross-validation using scikit-learnWhy did Run best_estimator_ from GridSearch using cross-validation produce different accuracy score?How to give GridSearchCV a list of indicies for cross-validation?Nest cross validation for predictions using groupsFitting in nested cross-validation with cross_val_score with pipeline and GridSearchGridsearchCV and Kfold Cross validation

Who is our nearest planetary neighbor, on average?

Making a sword in the stone, in a medieval world without magic

Why must traveling waves have the same amplitude to form a standing wave?

I need to drive a 7/16" nut but am unsure how to use the socket I bought for my screwdriver

Do I need life insurance if I can cover my own funeral costs?

Why are there 40 737 Max planes in flight when they have been grounded as not airworthy?

Ban on all campaign finance?

When do we add an hyphen (-) to a complex adjective word?

Calculus II Professor will not accept my correct integral evaluation that uses a different method, should I bring this up further?

How is the Swiss post e-voting system supposed to work, and how was it wrong?

Rules about breaking the rules. How do I do it well?

Does this AnyDice function accurately calculate the number of ogres you make unconcious with three 4th-level castings of Sleep?

Provisioning profile doesn't include the application-identifier and keychain-access-groups entitlements

How to generate globally unique ids for different tables of the same database?

2D counterpart of std::array in C++17

How to write cleanly even if my character uses expletive language?

Where is the 1/8 CR apprentice in Volo's Guide to Monsters?

Instead of Universal Basic Income, why not Universal Basic NEEDS?

My adviser wants to be the first author

Latest web browser compatible with Windows 98

Converting Functions to Arrow functions

Know when to turn notes upside-down(eighth notes, sixteen notes, etc.)

Employee lack of ownership

Be in awe of my brilliance!



Nested cross-validation: How does cross_validate handle GridSearchCV as its input estimator?



2019 Community Moderator Electionscikit-learn GridSearchCV with multiple repetitionsGridSearchCV final modelR caret / How does cross-validation for train within rfe workHow to generate a custom cross-validation generator in scikit-learn?Using sklearn cross_val_score and kfolds to fit and help predict modelUsing Scikit-Learn GridSearchCV for cross validation with PredefinedSplit - Suspiciously good cross validation resultsHow to access hyperparameters in case of nested cross-validation using scikit-learnWhy did Run best_estimator_ from GridSearch using cross-validation produce different accuracy score?How to give GridSearchCV a list of indicies for cross-validation?Nest cross validation for predictions using groupsFitting in nested cross-validation with cross_val_score with pipeline and GridSearchGridsearchCV and Kfold Cross validation










2















The following code combines cross_validate with GridSearchCV to perform a nested cross-validation for an SVC on the iris dataset.



(Modified example of the following documentation page:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py.)




from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_validate, KFold
import numpy as np
np.set_printoptions(precision=2)

# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Set up possible values of parameters to optimize over
p_grid = "C": [1, 10],
"gamma": [.01, .1]

# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")

# Choose techniques for the inner and outer loop of nested cross-validation
inner_cv = KFold(n_splits=5, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=1)

# Perform nested cross-validation
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv, iid=False)
clf.fit(X_iris, y_iris)
best_estimator = clf.best_estimator_

cv_dic = cross_validate(clf, X_iris, y_iris, cv=outer_cv, scoring=['accuracy'], return_estimator=False, return_train_score=True)
mean_val_score = cv_dic['test_accuracy'].mean()

print('nested_train_scores: ', cv_dic['train_accuracy'])
print('nested_val_scores: ', cv_dic['test_accuracy'])
print('mean score: 0:.2f'.format(mean_val_score))


cross_validate splits the data set in each fold into a training and a test set. In each fold, the input estimator is then trained based on the training set associated with the fold. The inputted estimator here is clf, a parameterized GridSearchCV estimator, i.e. an estimator that cross-validates itself again.



I have three questions about the whole thing:



  1. If clf is used as the estimator for cross_validate, does it (in the course of the GridSearchCV cross validation) split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?

  2. Out of all models tested via GridSearchCV, does cross_validate validate only the model stored in the best_estimator_ attribute?

  3. Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?

To make it clearer how the questions are meant, here is an illustration of how I imagine the double cross validation at the moment.



enter image description here










share|improve this question




























    2















    The following code combines cross_validate with GridSearchCV to perform a nested cross-validation for an SVC on the iris dataset.



    (Modified example of the following documentation page:
    https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py.)




    from sklearn.datasets import load_iris
    from sklearn.svm import SVC
    from sklearn.model_selection import GridSearchCV, cross_validate, KFold
    import numpy as np
    np.set_printoptions(precision=2)

    # Load the dataset
    iris = load_iris()
    X_iris = iris.data
    y_iris = iris.target

    # Set up possible values of parameters to optimize over
    p_grid = "C": [1, 10],
    "gamma": [.01, .1]

    # We will use a Support Vector Classifier with "rbf" kernel
    svm = SVC(kernel="rbf")

    # Choose techniques for the inner and outer loop of nested cross-validation
    inner_cv = KFold(n_splits=5, shuffle=True, random_state=1)
    outer_cv = KFold(n_splits=4, shuffle=True, random_state=1)

    # Perform nested cross-validation
    clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv, iid=False)
    clf.fit(X_iris, y_iris)
    best_estimator = clf.best_estimator_

    cv_dic = cross_validate(clf, X_iris, y_iris, cv=outer_cv, scoring=['accuracy'], return_estimator=False, return_train_score=True)
    mean_val_score = cv_dic['test_accuracy'].mean()

    print('nested_train_scores: ', cv_dic['train_accuracy'])
    print('nested_val_scores: ', cv_dic['test_accuracy'])
    print('mean score: 0:.2f'.format(mean_val_score))


    cross_validate splits the data set in each fold into a training and a test set. In each fold, the input estimator is then trained based on the training set associated with the fold. The inputted estimator here is clf, a parameterized GridSearchCV estimator, i.e. an estimator that cross-validates itself again.



    I have three questions about the whole thing:



    1. If clf is used as the estimator for cross_validate, does it (in the course of the GridSearchCV cross validation) split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?

    2. Out of all models tested via GridSearchCV, does cross_validate validate only the model stored in the best_estimator_ attribute?

    3. Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?

    To make it clearer how the questions are meant, here is an illustration of how I imagine the double cross validation at the moment.



    enter image description here










    share|improve this question


























      2












      2








      2








      The following code combines cross_validate with GridSearchCV to perform a nested cross-validation for an SVC on the iris dataset.



      (Modified example of the following documentation page:
      https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py.)




      from sklearn.datasets import load_iris
      from sklearn.svm import SVC
      from sklearn.model_selection import GridSearchCV, cross_validate, KFold
      import numpy as np
      np.set_printoptions(precision=2)

      # Load the dataset
      iris = load_iris()
      X_iris = iris.data
      y_iris = iris.target

      # Set up possible values of parameters to optimize over
      p_grid = "C": [1, 10],
      "gamma": [.01, .1]

      # We will use a Support Vector Classifier with "rbf" kernel
      svm = SVC(kernel="rbf")

      # Choose techniques for the inner and outer loop of nested cross-validation
      inner_cv = KFold(n_splits=5, shuffle=True, random_state=1)
      outer_cv = KFold(n_splits=4, shuffle=True, random_state=1)

      # Perform nested cross-validation
      clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv, iid=False)
      clf.fit(X_iris, y_iris)
      best_estimator = clf.best_estimator_

      cv_dic = cross_validate(clf, X_iris, y_iris, cv=outer_cv, scoring=['accuracy'], return_estimator=False, return_train_score=True)
      mean_val_score = cv_dic['test_accuracy'].mean()

      print('nested_train_scores: ', cv_dic['train_accuracy'])
      print('nested_val_scores: ', cv_dic['test_accuracy'])
      print('mean score: 0:.2f'.format(mean_val_score))


      cross_validate splits the data set in each fold into a training and a test set. In each fold, the input estimator is then trained based on the training set associated with the fold. The inputted estimator here is clf, a parameterized GridSearchCV estimator, i.e. an estimator that cross-validates itself again.



      I have three questions about the whole thing:



      1. If clf is used as the estimator for cross_validate, does it (in the course of the GridSearchCV cross validation) split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?

      2. Out of all models tested via GridSearchCV, does cross_validate validate only the model stored in the best_estimator_ attribute?

      3. Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?

      To make it clearer how the questions are meant, here is an illustration of how I imagine the double cross validation at the moment.



      enter image description here










      share|improve this question
















      The following code combines cross_validate with GridSearchCV to perform a nested cross-validation for an SVC on the iris dataset.



      (Modified example of the following documentation page:
      https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py.)




      from sklearn.datasets import load_iris
      from sklearn.svm import SVC
      from sklearn.model_selection import GridSearchCV, cross_validate, KFold
      import numpy as np
      np.set_printoptions(precision=2)

      # Load the dataset
      iris = load_iris()
      X_iris = iris.data
      y_iris = iris.target

      # Set up possible values of parameters to optimize over
      p_grid = "C": [1, 10],
      "gamma": [.01, .1]

      # We will use a Support Vector Classifier with "rbf" kernel
      svm = SVC(kernel="rbf")

      # Choose techniques for the inner and outer loop of nested cross-validation
      inner_cv = KFold(n_splits=5, shuffle=True, random_state=1)
      outer_cv = KFold(n_splits=4, shuffle=True, random_state=1)

      # Perform nested cross-validation
      clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv, iid=False)
      clf.fit(X_iris, y_iris)
      best_estimator = clf.best_estimator_

      cv_dic = cross_validate(clf, X_iris, y_iris, cv=outer_cv, scoring=['accuracy'], return_estimator=False, return_train_score=True)
      mean_val_score = cv_dic['test_accuracy'].mean()

      print('nested_train_scores: ', cv_dic['train_accuracy'])
      print('nested_val_scores: ', cv_dic['test_accuracy'])
      print('mean score: 0:.2f'.format(mean_val_score))


      cross_validate splits the data set in each fold into a training and a test set. In each fold, the input estimator is then trained based on the training set associated with the fold. The inputted estimator here is clf, a parameterized GridSearchCV estimator, i.e. an estimator that cross-validates itself again.



      I have three questions about the whole thing:



      1. If clf is used as the estimator for cross_validate, does it (in the course of the GridSearchCV cross validation) split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?

      2. Out of all models tested via GridSearchCV, does cross_validate validate only the model stored in the best_estimator_ attribute?

      3. Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?

      To make it clearer how the questions are meant, here is an illustration of how I imagine the double cross validation at the moment.



      enter image description here







      python python-3.x scikit-learn nested cross-validation






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 8 at 18:36







      zwithouta

















      asked Mar 6 at 18:48









      zwithoutazwithouta

      957




      957






















          1 Answer
          1






          active

          oldest

          votes


















          2















          If clf is used as the estimator for cross_validate, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?




          Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).



          Update Yes, when you will pass the GridSearchCV classifier into cross-validate it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.




          Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?




          Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit parameter is True by default in your case.) However, this best estimator will has to be trained again




          Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?




          As per your third and final question, Yes, it trains an estimator and returns it if return_estimator is set to True. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?



          Update
          The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV and using the best estimator. However, there is no way for cross-validate to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.






          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55030190%2fnested-cross-validation-how-does-cross-validate-handle-gridsearchcv-as-its-inpu%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2















            If clf is used as the estimator for cross_validate, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?




            Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).



            Update Yes, when you will pass the GridSearchCV classifier into cross-validate it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.




            Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?




            Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit parameter is True by default in your case.) However, this best estimator will has to be trained again




            Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?




            As per your third and final question, Yes, it trains an estimator and returns it if return_estimator is set to True. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?



            Update
            The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV and using the best estimator. However, there is no way for cross-validate to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.






            share|improve this answer





























              2















              If clf is used as the estimator for cross_validate, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?




              Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).



              Update Yes, when you will pass the GridSearchCV classifier into cross-validate it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.




              Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?




              Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit parameter is True by default in your case.) However, this best estimator will has to be trained again




              Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?




              As per your third and final question, Yes, it trains an estimator and returns it if return_estimator is set to True. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?



              Update
              The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV and using the best estimator. However, there is no way for cross-validate to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.






              share|improve this answer



























                2












                2








                2








                If clf is used as the estimator for cross_validate, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?




                Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).



                Update Yes, when you will pass the GridSearchCV classifier into cross-validate it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.




                Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?




                Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit parameter is True by default in your case.) However, this best estimator will has to be trained again




                Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?




                As per your third and final question, Yes, it trains an estimator and returns it if return_estimator is set to True. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?



                Update
                The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV and using the best estimator. However, there is no way for cross-validate to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.






                share|improve this answer
















                If clf is used as the estimator for cross_validate, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?




                Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).



                Update Yes, when you will pass the GridSearchCV classifier into cross-validate it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.




                Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?




                Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit parameter is True by default in your case.) However, this best estimator will has to be trained again




                Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?




                As per your third and final question, Yes, it trains an estimator and returns it if return_estimator is set to True. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?



                Update
                The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV and using the best estimator. However, there is no way for cross-validate to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Mar 7 at 22:26

























                answered Mar 6 at 19:17









                Mohammed KashifMohammed Kashif

                4,5631726




                4,5631726





























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55030190%2fnested-cross-validation-how-does-cross-validate-handle-gridsearchcv-as-its-inpu%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    1928 у кіно

                    Захаров Федір Захарович

                    Ель Греко