NLP常用工具及机器学习各类工具比较



1、NLP常用工具  转自http://www.cppblog.com/baby-fly/archive/2010/10/08/129003.html

各种工具包的有效利用可以使研究者事半功倍。
以下是NLP版版友们提供整理的NLP研究工具包。
同时欢迎大家提供更多更好用的工具包,造福国内的NLP研究。

*NLP Toolbox
  CLT
http://complingone.georgetown.edu/~linguist/compling.html
  GATE http://gate.ac.uk/
  Natural Language Toolkit(NLTK) http://nltk.org
  MALLET http://mallet.cs.umass.edu/index.php/Main_Page
  OpenNLP http://opennlp.sourceforge.net/

*English Stemmer
  Snowball
http://snowball.tartarus.org/

*English POS Tagger
  Stanford POS Tagger
http://nlp.stanford.edu/software/tagger.shtml
  TreeTagger http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
  TnT http://www.coli.uni-saarland.de/~thorsten/tnt/

*English&Chinese Parser
  Stanford Parser
http://nlp.stanford.edu/software/lex-parser.shtml
  Berkeley Parser http://nlp.cs.berkeley.edu/Main.html#Parsing

*English Keyphrase Extractor
  KEA
http://www.nzdl.org/Kea/index_old.html
  
*English Name Entity Recognizer
  Stanford NER
http://nlp.stanford.edu/software/CRF-NER.shtml

*Chinese Word Segmentator
  中科院ICTCLAS
http://www.nlp.org.cn/project/project.php?proj_id=6
  Stanford Word Segmenter http://nlp.stanford.edu/software/segmenter.shtml

*Topic Modeling Tools
  Matlab
http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm
  GibbsLDA++ http://gibbslda.sourceforge.net/
  GLDA http://code.google.com/p/glda/

*Conditional Random Fields
  FlexCRFs
http://flexcrfs.sourceforge.net/ 含有MPI并行版本。
  CRF++  http://crfpp.sourFceforge.net/
  CRF Package http://crf.sourceforge.net/
  CRF Matlab http://www.cs.ubc.ca/~murphyk/Software/CRFall.zip
  CRFSuit http://www.chokkan.org/software/crfsuite/
  SGD with CRF http://leon.bottou.org/projects/sgd
  HCRF http://sourceforge.net/projects/hcrf/

*Support Vector Machine
  LIBSVM
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
  LIBLINEAR http://www.csie.ntu.edu.tw/~cjlin/liblinear/
  Pegasos http://www.cs.huji.ac.il/~shais/code/index.html

*Search Engines
  Lucene
http://lucene.apache.org/
  中科院FirteX http://www.firtex.org/

*Machine Learning and Data Mining Toolbox
  Weka
http://www.cs.waikato.ac.nz/ml/weka/

  
2、面向四种语言的最佳资源库再加上Java on Hadoop,相信足以帮助大家将机器学习转化为切实可靠的业务工具。以下是一些最为常见且具备实用价值的开源机器学习工具介绍   转自 http://www.open-open.com/news/view/15b88af

Python: 数据科学家们纷纷投身于Python怀抱,希望利用它来作为R语言等分析型语言的开放式替代方案,目前也有不少企业雇主正积极寻找具备大数据处理经验的人 才、其中熟练使用Python就是最重要的必备技能之一。有鉴于此,一直在不断扩大的Python软件名单中开始出现大量与机器学习密切相关的资源库。

首先要提到的推荐选项就是scikit-learn(官方网站:http://scikit-learn.org/stable/)。它能够加载至 算法与模块当中,在GitHub上受到了广泛赞赏(fork版本数量接近2000),而且赢得了众多业界巨头的青睐。另一位紧随其后的选手是 PyBrain(官方网站:http://www.pybrain.org/),它的设计目的在于降低使用难度并提供与其它强大工具相对接的能力。顾名思 义,PyBrain的关注重点在于神经网络与非监督式学习,它同时也提供一套用于培训与重新定义算法的机制。

Go: 作为谷歌打造的系统语言,Go的设计重点在于并行机制,而这似乎正是最适合编写机器学习库的理想环境。 虽然目前与之相关的库项目规模尚小甚至可以说是刚刚萌芽,但也已经有一些通用型方案值得加以关注。其中最受推崇的要娄GoLearn(官方网 站:https://github.com/sjwhitworth/golearn),其开发者将其描述为一套“内置电池”的机器学习库。它提供过滤、 分类以及回归分析等多种工具。另一套较小且更为基础的库是mlgo(官方网站:https://code.google.com/p/mlgo/),虽然 目前它能提供的算法数量还少得可怜,但计划在未来推出更多相关成果。

Java on Hadoop: Mahout(在印地语中意味着‘象骑兵’,官方网站:https://mahout.apache.org/)当中包含有多种常见的机器学习方案,足以 在每一位用户偏爱的大数据框架中实际起效。这款软件包以算法为关注重点而非方法,因此使用者需要对算法具备一定程度的理解。换句话来说,如果大家认真学 习,肯定不难看出其各部分功能是如何被整合在一起的;举例来说,大家完全可以通过寥寥数行代码构建起基于用户的推荐系统。

Cloudera公司推行的Oryx(官方网站:https://github.com/cloudera/oryx)又是另一个基于Hadoop 的机器学习项目,其特性在于通过交付实时流结果而非处理批量作业来对Mahout处理结果进行进一步分析。遗憾的是,目前该项目尚处于初始阶段——注意, 这只是个项目而非实际产品——但它不断演变与改善的步伐为自己赢得了获取关注的资本。

Java: 与前面提到的、主要面向Hadoop的Mahout不同,还有其它一些面向Java机器学习库同样 具备广泛的使用受众。由新西兰怀卡托大学创建的Weka(官方网站:http://www.cs.waikato.ac.nz/~ml/weka/)就是 一款工作台类的应用程序,它在常见的算法集合当中加入了虚拟化与数据挖掘功能。对于那些希望为工作内容打造一套前端并有意利用Java进行初始开发的用户 来说,Weka可能是各位起步的最佳选择。另一套更加传统的库,Java-ML(官方网站:http://java- ml.sourceforge.net/)也能确切起效,但它更适合那些已经习惯了将Java与机器学习配合使用的朋友。

JavaScript: 大家可能都听说过“亚特伍德定律”这个笑话,大意是说任何能够由JavaScript所编写的内容最终都会由JavaScript进行编写,这一理论在机 器学习库领域同样正确。当然,目前JavaScript编写而成的方案在这一领域数量仍然相对较小——其中大多数选项都仅仅是单一算法而非完整的库——但 已经有部分实用性工具渐渐脱颖而出。ConvNetJS(官方网站:http://cs.stanford.edu/people/karpathy /convnetjs/)允许大家直接在浏览器当中进行神经网络培训、从而实现深层学习,而名为brain(官方网站:https: //github.com/harthur/brain)的项目则能够以可安装NPM模块的形式交付神经网络。此外,Encog(官方网站:https: //github.com/encog/encog-javascript)库同样值得关注,而且它适用于多种平台:Java、C#、C/C++以及 JavaScript。


 

3、机器学习各类工具weka、scikit-learn等各项指标的对比   转自于http://blog.csdn.net/waleking/article/details/7584147   和http://www.shogun-toolbox.org/page/features/


feature                                                                           shogun  wekakernlabdlibniemeorangejava-mlpyMLmlpypybraintorch3  scikit-learn
General FeaturesGraphical User Interfacecrosstickcrosstickticktickcrosscrosscrossticktickcross
 One Class Classificationticktickticktickcrosscrosscrosstickcrosscrosscrosstick
 Classificationticktickticktickticktickticktickticktickticktick
 Multiclass classificationtickticktickcrosstickcrossticktickticktickticktick
 Regressionticktickticktickticktickcrosstickcrosstickticktick
 Structured Output Learningtickcrosscrosscrosstickcrosscrosscrosscrosscrosscrosscross
 Pre-Processingtickticktickticktickticktickticktickcrossticktick
 Built-in Model Selection Strategiesticktickticktickcrosstickticktickcrosscrosscrosstick
 Visualizationcrosstickcrosscrossticktickcrosstickticktickticktick
 Test Frameworkticktickcrossticktickuntestedtickcrosscrosscrosscrosstick
 Large Scale Learningtickcrosscrossticktickcrosscrosscrosstickcrosscrosscross
 Semi-supervised Learningcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 Multitask Learningtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 Domain Adaptationtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 Serializationticktickticktickticktickticktickticktickcrosstick
 Parallelized Codeticktickcrosstickcrosscrosscrosscrosscrosscrosscrosstick
 Performance Measures (auROC etc)ticktickcrosstickticktickticktickticktickticktick
 Image Processingcrosscrosscrosstickcrosscrosscrosscrosscrosscrosscrosscross
Supported Operating SystemsLinuxticktickticktickticktickticktickticktickticktick
 Windowstickticktickticktickticktickcrossticktickticktick
 Mac OSXtickticktickticktickticktickticktickcrossticktick
 Other Unixtickticktickticktickticktickcrosstickcrossticktick
Language BindingsPythontickcrosscrosscrossticktickcrosstickticktickcrosstick
 Rtickcrosstickcrosscrosscrosscrosscrosscrosscrosscrosscross
 Matlabtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 Octavetickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 C/C++tickcrosscrossticktickcrosscrosscrosscrosscrosstickcross
 Command Linetickcrosscrosscrosscrosscrosscrosscrosstickticktickcross
 Javaticktickcrosscrosstickcrosstickcrosscrosscrosscrosscross
 C#tickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 Luatickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 Rubytickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
SVM SolversSVMLightticktickcrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 LibSVMticktickticktickticktickticktickcrosstickcrosstick
 SVM Ocastickcrosscrosstickcrosscrosscrosscrosscrosscrosscrosscross
 LibLinearticktickcrosscrosscrosscrosscrosscrosscrosscrosscrosstick
 BMRMtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 LaRanktickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 SVMPegasoscrosstickcrossticktickcrosscrosscrosscrosscrosscrosscross
 SVM SGDtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosstick
 othertickcrosstickcrosscrosscrosscrossticktickcrosstickcross
RegressionKernel Ridge Regressiontickcrosscrosscrosscrosscrosscrosstickcrosscrosscrosstick
 Support Vector Regressiontickticktickcrosscrosscrosscrosstickcrosscrossticktick
 Gaussian Processescrossticktickcrosscrosscrosscrosscrosscrosscrosscrosstick
 Relevance Vector Machinecrosstickticktickcrosscrosscrosscrosscrosscrosscrosscross
Multiple Kernel LearningMKLtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 q-norm MKLtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
ClassifiersNaive Bayesticktickcrosscrosscrosstickcrosscrosscrosstickticktick
 Bayesian Networkscrosstickcrosstickcrosscrosscrosscrosscrosstickcrosscross
 Multi Layer Perceptroncrosstickcrossticktickcrosscrosscrosscrossticktickcross
 RBF Networkscrosstickcrosstickcrosscrosscrosscrosscrosstickcrosscross
 Logistic Regressionticktickuntestedcrossticktickcrosscrosscrosscrosscrosstick
 LASSOcrosscrossuntestedcrosstickcrosscrosscrosscrosscrosscrosstick
 Decision Treescrosstickcrosscrosscrossticktickcrosscrosscrosscrosscross
 k-NNticktickticktickcrosstickticktickticktickticktick
Linear ClassifiersLinear Programming Machinetickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 LDAtickcrosscrosscrosscrosscrosscrosscrosstickcrosscrosstick
DistributionsMarkov Chainstickcrosscrosscrosscrosscrosstickcrosscrosscrosscrosscross
 Hidden Markov Modelstickcrosscrosscrosscrosscrosscrosscrosscrosscrossticktick
KernelsLinearticktickticktickticktickticktickticktickticktick
 Gaussianticktickticktickcrosstickticktickticktickticktick
 Polynomialticktickticktickcrosstickticktickticktickticktick
 String Kernelstickticktickcrosscrosscrosscrosstickcrosscrosscrosscross
 Sigmoid Kernelticktickcrosstickcrosstickcrosscrosscrosscrosscrosstick
 Kernel Normalizertickuntestedtickcrosscrosscrosscrosstickcrosscrosscrossuntested
Feature SelectionForwardcrosstickcrossuntestedcrossticktickticktickcrosscrosstick
 Wrapper methodscrosstickcrossuntestedcrossuntestedtickticktickcrosscrosscross
 Recursive Feature Selectioncrosstickcrosstickcrossuntestedtickticktickcrosscrosstick
Missing FeaturesMean value imputationcrosstickcrosscrosscrossticktickcrosstickcrosscrosscross
 EM-based/model based imputationcrosstickcrosscrosscrosstickcrosscrosscrosscrosscrosscross
ClusteringHierarchical Clusteringticktickcrosscrosscrosstickcrosscrosstickcrosscrosstick
 k-meansticktickticktickcrosstickticktickticktickticktick
OptimizationBFGScrosstickcrossticktickcrosscrosscrosscrosscrosscrosscross
 conjugate gradientcrosscrosscrosstickcrosscrosscrosscrosscrosscrosscrosscross
 gradient descenttickticktickcrosstickcrosscrosscrossticktickticktick
 bindings to CPLEXtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 bindings to Mosekcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
 bindings to other solvertickcrosstickcrosscrosstickcrosstickcrosscrosscrosstick
Supported File FormatsBinaryticktickcrosscrosscrosscrosscrosscrosscrosstickcrosstick
 Arffcrosstickcrosscrosscrosscrosstickcrosscrosscrosscrosscross
 HDF5tickcrosstickcrosscrosscrosscrosscrosscrosscrosscrosscross
 CSVcrossticktickcrosscrossticktickticktickcrossticktick
 libSVM/ SVMLight formatticktickcrossticktickcrosscrosstickcrosstickcrosstick
 Excelcrosscrosstickcrosscrosstickcrosscrosscrosscrosscrosscross
Supported Data TypesSparse Data Representationticktickcrosstickticktickticktickticktickcrosstick
 Dense Matricesticktickticktickcrosstickticktickticktickticktick
 Stringsticktickticktickcrosscrosscrosscrosscrosscrossticktick
 Support for native (e.g. C) types (char, signed and unsigned int8, int16, int32, int64, float, double, long double)tickcrosscrosstickcrosscrosscrosscrosstickcrosscrosstick

4、机器学习的开源工具介绍

I. Information Retrieval
1. Lemur/Indri
The Lemur Toolkit for Language Modeling and Information Retrieval
http://www.lemurproject.org/
Indri:
Lemur's latest search engine

2. Lucene/Nutch
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Lucene是apache的顶级开源项目,基于Apache 2.0协议,完全用java编写,具有perl, c/c++, dotNet等多个port
http://lucene.apache.org/
http://www.nutch.org/

3. WGet
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.
http://www.gnu.org/software/wget/wget.html

II. Natural Language Processing
1. EGYPT
: A Statistical Machine Translation Toolkit
http://www.clsp.jhu.edu/ws99/projects/mt/
包括GIZA等四个工具

2. GIZA++ (Statistical Machine Translation)
http://www.fjoch.com/GIZA++.html
GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och.
Franz Josef Och先后在德国Aachen大学,ISI(南加州大学信息科学研究所)和Google工作。GIZA++现已有Windows移植版本,对IBM 的model 1-5有很好支持。

3. PHARAOH (Statistical Machine Translation)
http://www.isi.edu/licensed-sw/pharaoh/
a beam search decoder for phrase-based statistical machine translation models

4. OpenNLP:
http://opennlp.sourceforge.net/
包括Maxent等20多个工具

btw: 这些SMT的工具还都喜欢用埃及相关的名字命名,像什么GIZA、PHARAOH、Cairo等等。Och在ISI时开发了GIZA++,PHARAOH也是由来自ISI的Philipp Koehn 开发的,关系还真是复杂啊

5. MINIPAR by Dekang Lin (Univ. of Alberta, Canada)
MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.
binary填一个表后可以免费下载
http://www.cs.ualberta.ca/~lindek/minipar.htm

6. WordNet
http://wordnet.princeton.edu/
WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.
WordNet was developed by the Cognitive Science Laboratory at Princeton University under the direction of Professor George A. Miller (Principal Investigator).
WordNet最新版本是2.1 (for Windows & Unix-like OS),提供bin, src和doc。
WordNet的在线版本是http://wordnet.princeton.edu/perl/webwn

7. HowNet
http://www.keenage.com/
HowNet is an on-line common-sense knowledge base unveiling inter-conceptual relations and inter-attribute relations of concepts as connoting in lexicons of the Chinese and their English equivalents.
由CAS的Zhendong Dong & Qiang Dong开发,是一个类似于WordNet的东东

8. Statistical Language Modeling Toolkit
http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html
The CMU-Cambridge Statistical Language Modeling toolkit is a suite of UNIX software tools to facilitate the construction and testing of statistical language models.

9. SRI Language Modeling Toolkit
www.speech.sri.com/projects/srilm/
SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation. It has been under development in the SRI Speech Technology and Research Laboratory since 1995.

10. ReWrite Decoder
http://www.isi.edu/licensed-sw/rewrite-decoder/
The ISI ReWrite Decoder Release 1.0.0a by Daniel Marcu and Ulrich Germann. It is a program that translates from one natural languge into another using statistical machine translation.

11. GATE (General Architecture for Text Engineering)
http://gate.ac.uk/
A Java Library for Text Engineering

III. Machine Learning
1. YASMET
: Yet Another Small MaxEnt Toolkit (Statistical Machine Learning)
http://www.fjoch.com/YASMET.html
由Franz Josef Och编写。此外,OpenNLP项目里有一个java的MaxEnt工具,使用GIS估计参数,由东北大学的张乐(目前在英国留学)port为C++版本

2. LibSVM
由国立台湾大学(ntu)的Chih-Jen Lin开发,有C++,Java,perl,C#等多个语言版本
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC ), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM ). It supports multi-class classification.

3. SVM Light
由cornell的Thorsten Joachims在dortmund大学时开发,成为LibSVM之后最为有名的SVM软件包。开源,用C语言编写,用于ranking问题
http://svmlight.joachims.org/

4. CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/
a software package for clustering low- and high-dimensional datasets
这个软件包只提供executable/library两种形式,不提供源代码下载

5. CRF++
http://chasen.org/~taku/software/CRF++/
Yet Another CRF toolkit for segmenting/labelling sequential data
CRF(Conditional Random Fields),由HMM/MEMM发展起来,广泛用于IE、IR、NLP领域

6. SVM Struct
http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html
同SVM Light,均由cornell的Thorsten Joachims开发。
SVMstruct is a Support Vector Machine (SVM) algorithm for predicting multivariate outputs. It performs supervised learning by approximating a mapping
h: X --> Y
using labeled training examples (x1,y1), ..., (xn,yn).
Unlike regular SVMs, however, which consider only univariate predictions like in classification and regression, SVMstruct can predict complex objects y like trees, sequences, or sets. Examples of problems with complex outputs are natural language parsing, sequence alignment in protein homology detection, and markov models for part-of-speech tagging.
SVMstruct can be thought of as an API for implementing different kinds of complex prediction algorithms. Currently, we have implemented the following learning tasks:
SVMmulticlass: Multi-class classification. Learns to predict one of k mutually exclusive classes. This is probably the simplest possible instance of SVMstruct and serves as a tutorial example of how to use the programming interface.
SVMcfg: Learns a weighted context free grammar from examples. Training examples (e.g. for natural language parsing) specify the sentence along with the correct parse tree. The goal is to predict the parse tree of new sentences.
SVMalign: Learning to align sequences. Given examples of how sequence pairs align, the goal is to learn the substitution matrix as well as the insertion and deletion costs of operations so that one can predict alignments of new sequences.
SVMhmm: Learns a Markov model from examples. Training examples (e.g. for part-of-speech tagging) specify the sequence of words along with the correct assignment of tags (i.e. states). The goal is to predict the tag sequences for new sentences.

IV. Misc:
1. Notepad++
: 一个开源编辑器,支持C#,perl,CSS等几十种语言的关键字,功能可与新版的UltraEdit,Visual Studio .NET媲美
http://notepad-plus.sourceforge.net

2. WinMerge: 用于文本内容比较,找出不同版本的两个程序的差异
winmerge.sourceforge.net/

3. OpenPerlIDE: 开源的perl编辑器,内置编译、逐行调试功能
open-perl-ide.sourceforge.net/
ps: 论起编辑器偶见过的最好的还是VS .NET了,在每个function前面有+/-号支持expand/collapse,支持区域copy/cut/paste,使用ctrl+ c/ctrl+x/ctrl+v可以一次选取一行,使用ctrl+k+c/ctrl+k+u可以comment/uncomment多行,还有还有...... Visual Studio .NET is really kool:D

4. Berkeley DB
http://www.sleepycat.com/
Berkeley DB不是一个关系数据库,它被称做是一个嵌入式数据库:对于c/s模型来说,它的client和server共用一个地址空间。由于数据库最初是从文件系统中发展起来的,它更像是一个key-value pair的字典型数据库。而且数据库文件能够序列化到硬盘中,所以不受内存大小限制。BDB有个子版本Berkeley DB XML,它是一个xml数据库:以xml文件形式存储数据?BDB已被包括microsoft、google、HP、ford、motorola等公司嵌入到自己的产品中去了
Berkeley DB (libdb) is a programmatic toolkit that provides embedded database support for both traditional and client/server applications. It includes b+tree, queue, extended linear hashing, fixed, and variable-length record access methods, transactions, locking, logging, shared memory caching, database recovery, and replication for highly available systems. DB supports C, C++, Java, PHP, and Perl APIs.
It turns out that at a basic level Berkeley DB is just a very high performance, reliable way of persisting dictionary style data structures - anything where a piece of data can be stored and looked up using a unique key. The key and the value can each be up to 4 gigabytes in length and can consist of anything that can be crammed in to a string of bytes, so what you do with it is completely up to you. The only operations available are "store this value under this key", "check if this key exists" and "retrieve the value for this key" so conceptually it's pretty simple - the complicated stuff all happens under the hood.
case study:
Ask Jeeves uses Berkeley DB to provide an easy-to-use tool for searching the Internet.
Microsoft uses Berkeley DB for the Groove collaboration software
AOL uses Berkeley DB for search tool meta-data and other services.
Hitachi uses Berkeley DB in its directory services server product.
Ford uses Berkeley DB to authenticate partners who access Ford's Web applications.
Hewlett Packard uses Berkeley DB in serveral products, including storage, security and wireless software.
Google uses Berkeley DB High Availability for Google Accounts.
Motorola uses Berkeley DB to track mobile units in its wireless radio network products.

11. R
http://www.r-project.org/
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
R统计软件与MatLab类似,都是用在科学计算领域的。



  • 1
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: Python3机器学习实战是一本介绍Python语言在机器学习领域应用的优秀教程。本书主要从机器学习的应用层面出发,对Python3语言在数据预处理、特征工程、模型训练和评估等方面进行系统和深入的探讨,旨在帮助读者掌握如何使用Python3语言进行机器学习。 本书首先简要介绍了机器学习、Python3语言和数据预处理的基础知识以及相关的工具和库。接着,针对数据预处理和特征工程这两个问题,本书详细介绍了数据清洗、数据转换、特征选择和特征提取等一系列关键技术,帮助读者理解如何从原始数据中提取出有用的信息。 随后,本书进一步介绍了机器学习的主要算法和模型,如线性回归、逻辑回归、支持向量机、决策树、随机森林、K近邻、贝叶斯分类器等。每个算法和模型都有详细的理论介绍和Python代码实现示例,读者可以通过实战项目掌握模型的训练和预测过程。 最后,本书还对模型评估和调优进行了介绍,打破了初学者在机器学习中容易犯的常见错误,让读者能够掌握如何评估和选择最佳的机器学习模型。 总之,Python3机器学习实战是一本深入浅出、实用性强的机器学习入门指南,适合有Python基础的读者阅读和学习。 ### 回答2: Python3机器学习实战是一本介绍Python3机器学习技术的书籍。它通过实际案例的方式,让读者了解Python3中常用的机器学习技术,以及如何使用它们来解决真实世界中的问题。 本书的作者将Python3中的机器学习技术分为三个部分:监督学习、无监督学习和深度学习。在第一部分中,读者将学习如何使用监督学习技术(如分类、回归和集成方法)来构建预测模型。第二部分中,作者介绍了无监督学习技术,例如聚类和降维方法,以寻找数据中的结构。在第三部分中,作者则讲解了Python3中的一些深度学习技术和库,例如Keras和TensorFlow,以及如何使用它们来构建神经网络和深度学习模型。 本书的优点在于,它不仅提供了大量的示例代码和数据集,还深入讲解了每个算法的原理和应用。此外,作者还介绍了一些机器学习中常见的问题和应对方法,例如过拟合、欠拟合以及特征提取等等。通过本书的学习,读者能够了解如何使用Python3来解决机器学习中常见的问题,使自己在这个领域中的技术和能力不断提高。 ### 回答3: Python3机器学习实战指的是使用Python3语言来实际操作和实践机器学习算法,以达到掌握机器学习相关知识和技能的目的。Python3是一种广泛应用于机器学习和深度学习领域的编程语言,具有易学易用、生态丰富、高效稳定等优点,成为了机器学习领域使用最广泛的语言之一。 Python3机器学习实战的步骤一般包括数据准备、数据分析、模型选择、模型训练和评估等环节。其中,数据准备是保证机器学习实战成功的基础,它包括数据收集、数据清洗、数据预处理等步骤。数据分析阶段则需要对数据进行可视化分析、统计分析等操作,对数据有深刻的理解并发现潜在的数据模式。模型选择是根据任务类型和需求选择合适的机器学习算法和模型,包括基于监督学习、非监督学习和强化学习的各类算法和模型。模型训练和评估则是通过训练样本数据训练模型,并根据测试集数据和交叉验证等方法评估模型的性能和表现,最终得到一个高质量的机器学习模型。 Python3机器学习实战对于从事机器学习技术研究和应用开发的人员来说,具有非常重要的意义。通过实战操作,可以加深对机器学习理论和方法的理解,掌握机器学习算法和模型的应用技能,提升自己的机器学习实践能力。同时,在实际应用中,python3机器学习实战也可以帮助我们解决很多实际问题,如图像识别、自然语言处理、推荐系统等领域的开发需求。总之,Python3机器学习实战对于提高机器学习技术水平和推动其在各个领域中的应用具有重要的推动作用。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值