Question

After identifying the best parameters using a pipeline and GridSearchCV, how do I pickle/joblib this process to re-use later? I see how to do this when it s a single classifier...

import joblib
joblib.dump(clf,  filename.pkl )

But how do I save this overall pipeline with the best parameters after performing and completing a gridsearch?

I tried:

joblib.dump(grid, output.pkl ) - But that dumped every gridsearch attempt (many files)
joblib.dump(pipeline, output.pkl ) - But I don t think that contains the best parameters

X_train = df[ Keyword ]
y_train = df[ Ad Group ]

pipeline = Pipeline([
  ( tfidf , TfidfVectorizer()),
  ( sgd , SGDClassifier())
  ])

parameters = { tfidf__ngram_range : [(1, 1), (1, 2)],
               tfidf__use_idf : (True, False),
               tfidf__max_df : [0.25, 0.5, 0.75, 1.0],
               tfidf__max_features : [10, 50, 100, 250, 500, 1000, None],
               tfidf__stop_words : ( english , None),
               tfidf__smooth_idf : (True, False),
               tfidf__norm : ( l1 ,  l2 , None),
              }
              
grid = GridSearchCV(pipeline, parameters, cv=2, verbose=1)
grid.fit(X_train, y_train)

#These were the best combination of tuning parameters discovered
##best_params = { tfidf__max_features : None,  tfidf__use_idf : False,
##                tfidf__smooth_idf : False,  tfidf__ngram_range : (1, 2),
##                tfidf__max_df : 1.0,  tfidf__stop_words :  english ,
##                tfidf__norm :  l2 }

Answer 1

import joblib
joblib.dump(grid.best_estimator_,  filename.pkl )

If you want to dump your object into one file - use:

joblib.dump(grid.best_estimator_,  filename.pkl , compress = 1)

Answer 2

I just want to point out that when it comes to the size on disk, saving the GridSearchCV or its best estimator doesn t differ much (for my personal project, it was 1865 KB vs 1801 KB) but compressing makes a world of difference. In other words, passing compress=True (or an integer between 1 and 9) is important.

In the following example, case1.pkl will have a much smaller size on disk than case2.pkl and case3.pkl, while case2.pkl and case3.pkl will have very similar sizes.

import joblib
joblib.dump(grid,  case1.pkl , compress=True)     # <--- good

joblib.dump(grid,  case2.pkl )
joblib.dump(grid.best_estimator_,  case3.pkl )

If you want to use pickle instead of joblib, you can combine it with the built-in gzip to compress it:

import pickle
import gzip

with gzip.open( case4.pkl ,  wb ) as f:
    pickle.dump(grid, f)

On a side note, when you load the pickled model, make sure the joblib version is at least as recent as the joblib version that was used to dump the model in the first place. Otherwise, a KeyError may be raised.

友情链接