Question

我拥有一个网站一年的数据。我愿培训一个机器学习算法,根据某些变量(如字数、张贴日等)预测新内容的成功。

我想取一新数据,就数据的某些特点提出意见,并有可能使网站能够做得更好。

此外,我还要继续向培训组补充未来数据,并不断培训算法,以便随着时间的推移而进行模拟。

我的问题是:我如何利用智慧来实现这一目标?

Answer 1

什么是双重分类问题,即你必须决定某项投入是否好。

不同的回归算法、立克谢-列收入使转换算法变得十分容易,使你们能够看到什么奏效和什么。

从我头上看,我试图采取的一些方法:

SVM
Random forests (Forest of randomized trees in scikits)
Regression (Ridge, Lasso, IRLS, logistic)
Naive Bayes
k nearest neighbors

如何评估某种方法的质量? 使用交叉验证(如果你有足够数据,则有10倍,否则有5倍)。该手册中有一节(5.1)。

Adding new data to the training set will require to retrain your model. Depending on the computing power you have at hand it may or may not be a problem. If you have a lot of examples, adding one won t change much, so be sure to re-train your algorithm with a handful of new examples. That will save computational time.

使用培训套的弹性算法称为离线算法。另一方面,在线算法每当提出一个新的实例时就学习。如果你实际需要,就象最近邻那样,尝试在线方法。

If you need example code, scikit-learn doc is very helpful: - http://scikit-learn.org/0.10/auto_examples/linear_model/logistic_l1_l2_sparsity.html#example-linear-model-logistic-l1-l2-sparsity-py - http://scikit-learn.org/0.10/modules/linear_model.html#ridge-regression

http://scikit-learn.org/0.10/user_guide.html

友情链接