R in Business Intelligence

1 篇文章 0 订阅
January 18, 2015
By Jan Górecki - R

(This article was first published on  Jan Gorecki - R, and kindly contributed to R-bloggers)     

Business Intelligence (BI) can be simply described as extracting useful informations from the data. This is quite a broad process as the source data structure (and quality) can vary, as well the useful information structure can vary. More technically process of such transformation can be described as ETL (extract, transform, load), plus presentation of the useful information

The idea to review usefulness of R in those processes came after I saw Wes McKinney's slides about python pandas and BI: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013).
I recommend to preview the slides, just for better understanding where pandas is (been in 2013) in BI. Python pandas is presented more or less as BI ready processing engine.

R in BI

In my opinion R is fully capable (even more than pandas) to serve as engine for BI related processes.
R has naturally (developed for decades) broad range of statistical tools available (multiple repositories with thousands of packages). I will skip this enormous feature of R and just focus on simple BI case of extraction, transformation, loading and presentation.
Below are listed packages which directly address the steps in basic BI process.

Extraction

  • DBI - native database drivers for multiple vendors, top performance.
  • RODBC - ODBC database driver connection.
  • RJDBC - JDBC database driver connection.
  • data.table's fread - very fast csv files reader.
  • tons other packages to support different format of data (e.g.: xlsx, xml, json, sas, spss, stata).

Good overview of R database interfaces available on Burns Statistics.

Transformation

  • data.table - powerful data transformation tool, uses from[ where, select|update, group by ][...] syntax.
  • dplyr - also powerful, but less scalable, data transformation tool, uses from %>% where %>% group by %>% select %>% ... syntax. Pivot and unpivot (cast and melt) are located in tidyr package.

Loading

  • DBI - native database drivers for multiple vendors, top performance.
  • RODBC - ODBC database driver connection.
  • RJDBC - JDBC database driver connection.

Presentation

Presentation of useful information is totallly different task than ETL process, it can be easily outsourced to any BI dashboard tool by simply populating the data structure expected by particular tool. Yet when using R you don't even need to push prepared data to external presentation tool. You can produce a web application dashboard directly from R.

  • shiny - Web Application Framework for R.
  • opencpu - HTTP API to R.
  • httpuv - HTTP and WebSocket server library, also the core of shiny package.
  • Rook - web server interface for R.

Using mentioned packages you are capable to host interactive web applications. Those can generate interactive plots, interactivly query the data. All is working on engine of R session, can execute R functions so it can also utilize huge plotting and statistical analysis capabilities available in any R packages. As of now I haven't heard of any open source end-to-end dashboard product which could allow to simply populate data marts and query against them in shiny, but as you can see in Extras section below it is not so hard to build a fully customized one. Yet there exists some big dashboard products for R, e.g. Mango Navigator, but I'm not sure how well they can cover BI needs.

Extras

Packages mentioned above are core "engines" for each of the section, there are multiple packages which simply wraps functions in those packages, unify or simplify its usage. Those are listed below, surely not all which exists:

  • ETLUtils - read (only) data from multiple database. Unify DBI, RODBC, RJDBC calls.
  • db.r - database schema explorer.
  • dplyr - it can use DBI database connections behind the scene.
  • sqldf - helper to operate on R data.frame using SQL statements, not so efficient as data.table or dplyr, yet may be quite useful for SQL folks.
  • dwtools - data warehouse related functions. Database queries details logging (statement, nrow in, nrow out, timing in ns, etc.), processing details logging. Unify DBI, RODBC calls. (self-promotion)
  • shinyBI - shiny application as package, can perform interactive aggregation on user provided dataset (using data.table as backend for top performance), and interactive plotting of the aggregated data. (self-promotion)

The sentence from pandas slides "Stop believing in the 'one tool to rule them all'" is very true, but if we consider R as container for tools, which are the packages, then we might end up with one container to rule them all.

Crucial factors for business to adopt R

Performance and scalability

Extraction / Loading

I doubt if DBI performance of extraction or loading can be easily beaten. Still the most common interface for ETL tools is ODBC connection which is dramatically slower than native drivers available via DBI. Similarly data.table's fread (fast read for csv files) can be hard to beat.
Detailed benchmark of DBI-compliant package ROracle is available at Oracle blogs:

Transformation

This may suprise you but R beats python pandas in terms of performance. There are benchmarks in the web which may state the opposite, but in fact they are old, or they were made against the bug in the opposite tool.
Linking the recent benchmarks of R vs pandas:

Presentation

Measuring performance of web apps seems to be totally different story which I will not cover. Important factor is that within you web application you can make all the computation in any R package you prefer as the web application is already working in R. Good example is shinyBI shiny app mentioned above in Extras which uses data.table backend for best performance.

Support and licensing

Open source is often considered as problem in terms of support and licensing.
On the enterprise level it is usually necessary to implement the solutions for which support is already contracted.
It is very resonable, but people should be aware of very good alternatives.
This gap has been filled by companies like Revolution Analytics or RStudio.
Besides of big R companies there are other ways to get (even more) effective support like contracting support from:

  • packages authors, contributors
  • stackoverflow top answerers for a tag
  • tons of R consultancy companies

In terms of licensing, there are many products which are available under dual license. If there is no such for a product you are interested in, then you can try to integrate to the current license, it is often possible, or ask the author for a dual license version.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值