数据集和数据集如何适当
X,y=make_classification(
n_samples=1000, n_features=5000, n_redundant=2, n_informative=200, random_state=1
)
names = [
lightGBM ,
"Decision Tree",
"Random Forest",
"Nearest Neighbors",
"Neural Net",
"AdaBoost",
"Naive Bayes",
]
classifiers = [
LGBMClassifier(),
DecisionTreeClassifier(),
RandomForestClassifier(),
KNeighborsClassifier(3),
MLPClassifier(alpha=1, max_iter=1000),
AdaBoostClassifier(),
GaussianNB(),
]
scores, times=[], []
for name, clf in zip(names, classifiers):
start=process_time()
clf.fit(X_train,y_train)
end=process_time()
score=clf.score(X_test,y_test)
run_time=end-start
times.append(run_time)
scores.append(score)
df=pd.DataFrame({ runtime :times, score :scores}).T
df.columns=names
print(df)
所尝试的若干模式的结果如下(所有参数都是缺省)。
lightGBM Decision Tree Random Forest Nearest Neighbors
runtime 123.859375 3.359375 3.75 0.015625
score 0.560000 0.470000 0.54 0.726667
Neural Net AdaBoost Naive Bayes
runtime 87.703125 27.734375 0.046875
score 0.813333 0.533333 0.546667
可以看出,树木模型分类器在这一数据集上表现不佳。 然而,当我将N_informative parailes调整到20时,树木模型的预测性能大大提高。
Is this a problem with the structure of the tree model itself or with the parameters? I want to know the reason why tree models have a poor behavior on this dataset and how i can improve it except for chaning the dataset. I have tried adjusting some parameters of lightGBM like reg_lambda, max_depth or num_leaves but it hasn t helped improve the performance any help is much appreciated