Data Wrangling文摘:Non-tidy-data

原文地址:https://simplystatistics.org/2016/02/17/non-tidy-data/

During the discussion that followed the ggplot2 posts from David and I last week we started talking about tidy data and the man himself noted that matrices are often useful instead of “tidy data” and I mentioned there might be other data that are usefully “non tidy”. Here I will be using tidy/non-tidy according to Hadley’s definition. So tidy data have:

  • One variable per column
  • One observation per row
  • Each type of observational unit forms a table

I push this approach in my guide to data sharing and in a lot of my personal work. But note that non-tidy data can definitely be already processed, cleaned, organized and ready to use.

This led to a very specific blog request:

So I thought I’d talk about a couple of reasons why data are usefully non-tidy. The basic reason is that I usually take a problem first, not solution backward approach to my scientific research. In other words, the goal is to solve a particular problem and the format I chose is the one that makes it most direct/easy to solve that problem, rather than one that is theoretically optimal.   To illustrate these points I’ll use an example from my area.

Example data

Often you want data in a matrix format. One good example is gene expression data or data from another high-dimensional experiment. David talks about one such example in his post here. He makes the (valid) point that for students who aren’t going to do genomics professionally, it may be more useful to learn an abstract tool such as tidy data/dplyr. But for those working in genomics, this can make you do unnecessary work in the name of theory/abstraction.

He analyzes the data in that post by first tidying the data.

library(dplyr)
library(tidyr)
library(stringr)
library(readr)
library(broom)
 
original_data %
  separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
  mutate_each(funs(trimws), name:systematic_name) %>%
  select(-number, -GID, -YORF, -GWEIGHT) %>%
  gather(sample, expression, G0.05:U0.3) %>%
  separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE)

It isn’t 100% tidy as data of different types are in the same data frame (gene expression and metadata/phenotype data belong in different tables). But its close enough for our purposes. Now suppose that you wanted to fit a model and test for association between the “rate” variable and gene expression for each gene. You can do this with David’s tidy data set, dplyr, and the broom package like so:

rate_coeffs = cleaned_data %>% group_by(name) %>%
     do(fit = lm(expression ~ rate + nutrient, data = .)) %>%
     tidy(fit) %>% 
     dplyr::filter(term=="rate")

On my computer we get something like:

system.time( cleaned_data %>% group_by(name) %>%
+               do(fit = lm(expression ~ rate + nutrient, data = .)) %>%
+                tidy(fit) %>% 
+                dplyr::filter(term=="rate"))
|==========================================================|100% ~0 s remaining 
user  system elapsed 
 12.431   0.258  12.364

Let’s now try that analysis a little bit differently. As a first step, lets store the data in two separate tables. A table of “phenotype information” and a matrix of “expression levels”. This is the more common format used for these type of data. Here is the code to do that:

expr = original_data %>% 
  select(grep("[0:9]",names(original_data)))
 
rownames(expr) = original_data %>%
  separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
  select(systematic_name) %>% mutate_each(funs(trimws),systematic_name) %>% as.matrix()
 
vals = data.frame(vals=names(expr))
pdata = separate(vals,vals,c("nutrient", "rate"), sep = 1, convert = TRUE)
 
expr = as.matrix(expr)

If we leave the data in this format we can get the model fits and the p-values using some simple linear algebra

expr = as.matrix(expr)
 
mod = model.matrix(~ rate +  as.factor(nutrient),data=pdata)
rate_betas = expr %*% mod %*% solve(t(mod) %*% mod)

This gives the same answer after re-ordering

all(abs(rate_betas[,2]- rate_coeffs$estimate[ind]) < 1e-5,na.rm=T)
[1] TRUE

But this approach is much faster.

 system.time(expr %*% mod %*% solve(t(mod) %*% mod))
   user  system elapsed 
  0.015   0.000   0.015

This requires some knowledge of linear algebra and isn’t pretty. But it brings us to the first general point: you might not use tidy data because some computations are more efficient if the data is in a different format. 

Many examples from graphical models, to genomics, to neuroimaging, to social sciences rely on some kind of linear algebra based computations (matrix multiplication, singular value decompositions, eigen decompositions, etc.) which are all optimized to work on matrices, not tidy data frames. There are ways to improve performance with tidy data for sure, but they would require an equal amount of custom code to take advantage of say C, or vectorization properties in R.

Ok now the linear regressions here are all treated independently, but it is very well known that you get much better performance in terms of the false positive/true positive tradeoff if you use an empirical Bayes approach for this calculation where you pool variances.

If the data are in this matrix format you can do it with R like so:

library(limma)
fit_limma = lmFit(expr,mod)
ebayes_limma = eBayes(fit_limma)
topTable(ebayes_limma)

This approach is again very fast, optimized for the calculations being performed and performs much better than the one-by-one regression approach. But it requires the data in matrix or expression set format. Which brings us to the second general point: *you might not use tidy data because many functions require a different, also very clean and useful data format, and you don’t want to have to constantly be switching back and forth. *Again, this requires you to be more specific to your application, but the potential payoffs can be really big as in the case of limma.

I’m showing an example here with expression sets and matrices, but in NLP the data are often input in the form of lists, in graphical analyses as matrices, in genomic analyses as GRanges lists, etc. etc. etc. One option would be to rewrite all infrastructure in your area of interest to accept tidy data formats but that would be going against conventions of a community and would ultimately cost you a lot of work when most of that work has already been done for you.

The final point, which I won’t discuss here is that data are often usefully represented in a non-tidy way. Examples include the aforementioned GRanges list which consists of (potentially) ragged arrays of intervals and quantitative measurements about them. You could force these data to be tidy by the definition above, but again most of the infrastructure is built around a different format that is much more intuitive for that type of data. Similarly data from other applications may be more suited to application specific formats.

In summary, tidy data is a useful conceptual idea and is often the right way to go for general, small data sets, but may not be appropriate for all problems. Here are some examples of data formats (biased toward my area, but there are others) that have been widely adopted, have a ton of useful software, but don’t meet the tidy data definition above. I will define these as “processed data” as opposed to “tidy data”.

I’m sure there are a ton more I’m missing and would be happy to get some suggestions on Twitter too.

转载于:https://www.cnblogs.com/chickenwrap/p/10005278.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
数据整理是数据科学家或数据分析师工作中不可避免的一部分。使用Python进行数据整理可以大大简化这个过程,提高工作效率。以下是一些使用Python进行数据整理的技巧和工具,可以使你的工作更加轻松。 1. 使用pandas库:pandas是一个功能强大的数据处理库,它提供了灵活的数据结构和数据分析工具。你可以使用pandas读取、处理和分析各种类型的数据,例如CSV、Excel、SQL数据库等。 2. 缺失值处理:通过pandas,你可以轻松处理数据中的缺失值。使用fillna()函数可以用指定的值或方法填充缺失值,使用dropna()函数可以将包含缺失值的行或列删除。 3. 数据清洗:清洗数据是整理数据的重要步骤之一。通过使用pandas的字符串函数,你可以进行字符串处理和提取有用的信息。此外,使用正则表达式可以更方便地实现复杂的模式匹配和替换。 4. 数据转换:在处理数据时,你可能需要对数据进行转换,例如将数据类型转换为适用于分析的格式,或者按照需要的方式重塑数据框架。pandas提供了一系列内置函数,可帮助你轻松完成这些任务,例如astype()函数可以进行数据类型转换,pivot()函数可以进行数据透视。 5. 数据合并和连接:当你需要将多个数据集合并在一起时,pandas提供了多种方法,如concat()函数可以按行或列连接数据,merge()函数可以根据指定的列将多个数据集按照指定的方式合并。 6. 数据重复值处理:重复的数据可能会对分析造成影响。pandas的duplicated()和drop_duplicates()函数可以帮助你找到并删除重复的数据。 7. 数据分组和聚合:使用pandas的groupby()函数,你可以按照指定的标准对数据进行分组,并进行统计分析,如计算均值、求和、数量等。 总之,使用Python进行数据整理可以使用强大的工具和技巧,使你的工作更加轻松高效。无论是数据清洗、转换、合并还是聚合,pandas提供了丰富的函数和方法来满足你的需求。掌握这些技巧和工具,对于处理和分析数据将会非常有帮助。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值