Look! It’s your data!

转载 2015年11月17日 16:25:12
December 10, 2014
By 

(This article was first published on Learning Slowly » R, and kindly contributed to R-bloggers)

If a picture is worth a thousand words, then how many tables are a single visualization worth? Exploratory data analysis is a great way to see what is and is not in your dataset.

I work in a hospital research group. Most of my colleagues are more comfortable in SAS than in R. One of the first ideas I had to help the workflow here was to create an R script that generated a series of data plots to visualize variables within our analysis dataset. This allowed our statistical programmers to do some data checking for quality, outliers and missing data before handing the data set to our biostatisticians. This moved the error checking step up in the project workflow, hopefully saving some time and making us all more productive.

The problem was this script required some custom edits for every data set. Something that the stat programmers wanted to learn, but in the heat of (deadlines) battle, that time for learning new things rarely happens. Hence it gets put off until “later”.

I’ve been contemplating how to use Shiny apps for a long time, and this finally caused me jumped in. It was pretty simple to get started actually, I just dumped a bunch of code I had been using (read: copy and pasting into lots of jobs) and put it in the server.R file. I created a ui.R and BAM! the xportEDA Shiny app was born!

xportEDA A shiny app for data visualization

xportEDA is a Shiny app that generates graphics used to explore your data set. It started out for SAS xport files only, but we’ve added csv and some rdata file support. Written in [R](http://cran.r-project.org/), this shiny app requires the following packages:

  • shiny
  • foreign (to load SAS xport files)
  • ggplot2 (nice figures)
  • reshape2
  • RColorBrewer (color palettes are my friend)

Description

The xportEDA app makes it easy to visualize your data quickly, without requiring programming effort to get a jump on your data wrangling.

You supply the app with a data file. The app can read in a data.frame from a SAS xpt, csv or rdata file, and generates a set of data visualizations.

The app first classifies the variables as continuous, logical or categorical. Any variable with only 2 unique values is interpreted as logical. An ad-hoc definition for categorical variables is any factor plus any variable with more than 2 and less than 10 unique values. By default, character variables are converted to factors, however if we have more than 20 levels, we will not show a panel figure for that variable.

The app creates a faceted set of histograms for all categorical and logical variables, and another set of scatter plots for all continuous variables. Since we often are working in time-to-event settings, the app searches the variable names for some of our “standard” time related variable names to use for the x-axis. Typically, we use a “date of procedure” for this. However, if your data does not have a “time” variable name, we will select the first continuous variable for the x-axis. This variable can be changed though the Shiny interface.

A separate page is set up for visualizing individual variables, making it easy to export a single figure for use in reports or other communications. Useful for when your collaborators do not believe you are missing large chunks of data in a variable, or there are negative values for strictly positive variables, like height.

We also include a data summary page for further data debugging purposes.

Examples

This example is from our research data.

categorical
Categorical Data

The first figure shows all the categorical data in this dataset. Each of the histograms look pretty uniformly distributed, though there are some missing values shown in the hx_fcad variable.

continuous
Continuous variables

Continuous variables are also informative. Here we see the uniformly distributed data over the time of interest (about 7 years of follow up). However, there are a series of extreme values many variables like bun_pr or BMI. BMI is a variable made from height and weight, so the extremely short or extremely large people can make for very strange BMI measures. This is typically a problem with units of measurement. What should we do with these extreme values?

The bottom panel also shows a simple way to check goodness of follow up for time to event data. We expect the triangular shape as subjects that entered the study early should have the longest follow up. The internal part of the triangle should be mostly red xs, indicating an event occurred (death) and most of the blue circles should be on the hypotenuse of the triangle, as they indicate censored or alive case, occuring hopefully at the end of the study as opposed to lost to follow up cases in the interior.

single
Single variable

A single plot for bun_pr makes it pretty clear there are only 3 values that are suspect. Since there are quite a few observations, we may want to make these missing and use imputation, or return these observations for data correction.

summary
Data summary

This is not the optimal way to view a data summary, but it may help the user understand where issues with the app are coming from.

Running the App

To try it out, you can see it on the shinapps.io site
https://ehrlinger.shinyapps.io/xportEDA/

I have also posted the app code to a GitHub repository where you can download it, and try it out. Let me know how it goes, report bugs or contribute back. I’d love to make this better, and learn more Shiny tricks along the way.

The easiest way is to run the app locally is to download it from the https://github.com/ehrlinger/xportEDA repository.

Then run it from R
R> library(shiny)
R> runApp()

or run it in R directly from the repository:

R> shiny::runGitHub("ehrlinger/xportEDA")

Issues and Caveats

I could put the standard “use at your own risk” disclaimer here. I will also add:

The app is written with my specific problem domain in mind though I am open for suggestions on how to improve it.

We tend to use time to event data (working in a hospital after all).

Our group uses SAS predominantly, hence the “xport” functionality and the app naming structure.

xportEDA will have trouble with large p data sets, as I have not figured out how to make shiny extend the figures indefinitely down the page. I do dynamically set the number of columns in an effort to control how small the panel plots get. But if you get into the 75 categorical or continuous variable range, it may become illegible.

Progress not perfection! The best way to start, is to start.

相关文章推荐

It’s time to kiss goodbye to your implicit BroadcastReceivers

本译文介绍了即将到来的Android O对BroadcastReceiver的限制以及对应的项目适配方法,文中链接已经进行相应的国内开发者网站替换,可以直接浏览,希望大家喜欢。...

It Takes Two to Tango (myself, and your unprotected file share)

BananaStand learned from last time (to see last time, go here). Systems were patched, ACL's were l...

manage it your guide

  • 2012年12月16日 17:07
  • 10.23MB
  • 下载

Ways to Make Your IT Staff Unpoachable

防止IT技术人员被挖走的五大措施IT企业中的人员流动,尤其是IT技术人员极大的流动性,是上层领导很头疼的事。如何才能挽留住核心IT技术人员,是每个IT企业的领导者所一直探索的问题。著名作家Pam Ba...

elasticsearch之modeling your data(not flat)--Parent-child relationship

parent-child relationship跟nested objects在本质上是相似的,都是尸
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Look! It’s your data!
举报原因:
原因补充:

(最多只允许输入30个字)