业余爱好者只有pc没有高大上的机器？用于数据挖掘的分类算法有哪些

发表时间：2017-12-22 16:12:01 作者： 来源： 浏览：次

在上一篇文章中，小编为您详细介绍了关于《作为一个电脑白痴我想配置一台台式机咋配？想组装一台能特效全开玩dota2的台式电脑》相关知识。本篇中小编将再为您讲解标题业余爱好者只有pc没有高大上的机器？用于数据挖掘的分类算法有哪些。

没有多大问题，kaggle的数据量①般比较小，都是几百MB以下的，现在单机都是⑧核③②GB这样的配置，处理绰绰有余。极少见很大数据量的比赛，比如 Description - Microsoft Malware Classification Challenge (BIG ②⓪①⑤) 这样非图像但是数据量又很大的也就这①个。

当然了，Kaggle图像类的比赛数据量和计算能力要求都比较高，鉴于现在全民深度学习全民显卡计算，要是想参加图像类的比赛，可以考虑自己买个显卡（⑨⑧⓪ GTX之类的）或者AWS上租用gpu instance。

Kaggle大数据量对中国国内的选手们的问题可能不在计算能力，而是在下载数据上，很多国内选手在论坛上抱怨的①点就是国内下载kaggle的比赛数据太慢了，毕竟要越过长城才能达到世界的某个角落，不是嘛。

选手们加油。

直接转载quora上的回答了

What are the advantages of different classification algorithms?

Here are some general guidelines I\'ve found over the years.

How large is your training set?

If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN or logistic regression), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren\'t powerful enough to provide accurate models.

You can also think of this as a generative model vs. discriminative model distinction.

Advantages of some particular algorithms

Advantages of Naive Bayes: Super simple, you\'re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn\'t hold, a NB classifier still often performs surprisingly well in practice. A good bet if you want to do some kind of semi-supervised learning, or want something embarrassingly simple that performs pretty well.

Advantages of Logistic Regression: Lots of ways to regularize your model, and you don\'t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you\'re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

Advantages of Decision Trees: Easy to interpret and explain (for some people -- I\'m not sure I fall into this camp). Non-parametric, so you don\'t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). Their main disadvantage is that they easily overfit, but that\'s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they\'re fast and scalable, and you don\'t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.

Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you\'re data isn\'t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.

To go back to the particular question of logistic regression vs. decision trees (which I\'ll assume to be a question of logistic regression vs. random forests) and summarize a bit: both are fast and scalable, random forests tend to beat out logistic regression in terms of accuracy, but logistic regression can be updated online and gives you useful probabilities. And since you\'re at Square (not quite sure what an inference scientist is, other than the embodiment of fun) and possibly working on fraud detection: having probabilities associated to each classification might be useful if you want to quickly adjust thresholds to change false positive/false negative rates, and regardless of the algorithm you choose, if your classes are heavily imbalanced (as often happens with fraud), you should probably resample the classes or adjust your error metrics to make the classes more equal.

But...

Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, your choice of classification algorithm might not really matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).

And if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize and Middle Earth, just use an ensemble method to choose them all!

编后语：关于《业余爱好者只有pc没有高大上的机器？用于数据挖掘的分类算法有哪些》关于知识就介绍到这里，希望本站内容能让您有所收获，如有疑问可跟帖留言，值班小编第一时间回复。下一篇内容是有关《顺丰保价理赔咋算有几种保价理赔方式？如何看待菜鸟物流与顺丰快递关于合作物流服务接口的冲突》，感兴趣的同学可以点击进去看看。

资源转载网络，如有侵权联系删除。