Better modelling and visualisation of newspaper count data

47 篇文章 2 订阅
19 篇文章 1 订阅
(This article was first published on  Quantifying Memory, and kindly contributed to R-bloggers)     

  <!-- Styles for R syntax highlighter       

In this post I outline how count data may be modelled using a negative binomial distribution in order to more accurately present trends in time series count data than using linear methods. I also show how to use ANOVA to identify the point at which one model gains explanatory power, and how confidence intervals may be calculated and plotted around the predicted values. The resulting illustration gives a robust visualisation of how the Beslan Hostage crisis has taken on features of a memory event
Recently I wrote up a piece about quantifying memory and news, and proposed that two distinct linear models might be the way to go about it. However, the problem with linear models is they by their nature don't take into account the ways in which trends may be non-linear.They also lead to nonsense predictions, such as negative values.
Generally, then, linear models should be avoided when mapping count data. What are the alternatives? Typically, a Poisson distribution would be ideal way to capture the probability of a clustering of observations being non-random. A feature of the Poisson distribution is that it assumes the sample mean equals the sample variance; this is very frequently violated when dealing with news data, as a story will have a small number of large values, followed by a large number of small values, resulting in a low mean and a high variance. Instead a negative binomial distribution may be used, which takes a value theta specifying the degree to which the variance and mean are unequal. Estimates provided by a negative binomial model are the same as Poisson estimates, but probability values tend to be more conservative.
One strength of R is its ability to model so called generalised linear models. The negative binomial distribution comes from the package MASS; theta may be calculated using glm.nb:
library(MASS)
fmla <- as.formula("count~date+e_news+a1+a2+elections")
theta <- glm.nb(fmla, data = mdb)$theta
results <- glm(fmla, data = mdb, family = negative.binomial(theta))
The above results may be considered in ANOVA to identify which variables contribute significantly to the model.
anova(results)
Analysis of Deviance Table

Model: Negative Binomial(10.85), link: log

Response: count

Terms added sequentially (first to last)

          Df Deviance Resid. Df Resid. Dev
NULL                         96        994
date       1      731        95        263
e_news     1       69        94        194
a1         1       52        93        142
a2         1        1        92        141
elections  1       12        91        129
Details about the coding of the variables, and the logic behind models contrasting stories as news or memory events may be found here.
From the ANOVA results I can identify which group of variables are contributing the most substantially to describing the data distribution: memory variables, or news variables. As I am interested in distributions where memory effects are apparent, and these develop only over time, I loop through the data deleting the first month's values, until such a time as there is no data left, or the memory variables have greater explanatory power than the news estimator:
mdb2 <- mdb  #copy the data
n <- 0
news <- 0
memory <- 0
while (n == 0) {
    aov3 <- glm(fmla, data = mdb2, family = negative.binomial(glm.nb(fmla, data = mdb2)$theta))
    anova(aov3)
    t <- data.frame(t(anova(aov3)[2]))
    news <- sum(t[, grep("date|news", colnames(t))])
    memory <- sum(t[, grep("a1|a1", colnames(t))])
    if (news > memory) {
        n <- 0
        mdb2 <- mdb2[mdb2$date > min(mdb2$date), ]
    } else (n <- 1)
}
From the negative binomial model predictions and confidence intervals for the period identified as of potential memory significance may be created. The 95% confidence interval either side of the predicted value is calculated by multiplying the standard error by 1.96. I want to plot the whole data, but only predicted values for the period identified as significant, so next I removed predictions for the data up until the period with memory potential. Finally I created a data frame containing the interval for which no estimates were calculated (this will be used to blur out data in ggplot).
estimate <- (predict.glm(aov3, newdata = mdb, type = "response", se.fit = T))
mdb$estimate <- estimate$fit
mdb$se <- estimate$se.fit
mdb$estimate[mdb$date < min(mdb2$date)] <- NA
mdb$se[mdb$date < min(mdb2$date)] <- NA
mdb$date <- as.Date(mdb$date)
mdb$upper <- mdb$estimate + (1.96 * mdb$se)
mdb$lower <- mdb$estimate - (1.96 * mdb$se)
mdb$lower[mdb$lower < 0] <- 0
rect <- data.frame(min(mdb$date) - months(1), min(as.Date(mdb2$date)))
colnames(rect) <- c("one", "two")
In the plot below I visualise the square root of articles about the Beslan hostage tragedy in the Russian press. The square root is chosen to prevent the high initial interest from obscuring the trend that emerged over time. To create the plot I
  • add a ribbon representing the confidence interval
  • plot the observed values
  • add a dotted line representing the fitted values
  • edit the formatting and add a title
  • add a shaded rectangle over the area I wish to ignore:
ggplot(mdb, aes(date, sqrt(count), ymax = sqrt(upper), ymin = sqrt(lower)), 
    environment = environment()) + geom_ribbon(colour = "red", fill = "light grey", 
    alpha = 0.4, linetype = 2) + geom_point(colour = "dark green", size = 3, 
    alpha = 0.8) + geom_line(aes(date, sqrt(estimate))) + theme_bw() + ggtitle(paste0("Regression graph for the Beslan Hostage crisis, exhibiting possible features of memory event since ", 
    as.Date(min(mdb2$date)))) + geom_rect(aes(xmin = rect$one, xmax = rect$two, 
    ymin = -Inf, ymax = +Inf), fill = "light grey", colour = "grey", linetype = 2, 
    alpha = 0.015)
plot of chunk unnamed-chunk-7
anova(aov3)
Analysis of Deviance Table

Model: Negative Binomial(6.449), link: log

Response: count

Terms added sequentially (first to last)

          Df Deviance Resid. Df Resid. Dev
NULL                         70      135.3
date       1    19.18        69      116.1
e_news     1     0.73        68      115.4
a1         1    24.54        67       90.9
a2         1     0.65        66       90.2
elections  1     0.29        65       89.9
Notice in the above table how the anniversaries variables exceed the explanatory power of the news and date variables. This indicates that by the end of 2006 Beslan was increasingly featuring as a memory event and less as a news story. Also notice how the remaining deviance is quite large - this model apparently fits the data less well than the model for the entire data (it explained 85% of the deviance), but this is due to the original estimate being biased by the accurate prediction of a few outliers.
CSDN海神之光上传的代码均可运行,亲测可用,直接替换数据即可,适合小白; 1、代码压缩包内容 主函数:main.m; 调用函数:其他m文件;无需运行 运行结果效果图; 2、代码运行版本 Matlab 2019b或2023b;若运行有误,根据提示修改;若不会,私信博主; 3、运行操作步骤 步骤一:将所有文件放到Matlab的当前文件夹中; 步骤二:双击打开main.m文件; 步骤三:点击运行,等程序运行完得到结果; 4、仿真咨询 如需其他服务,可私信博主或扫描博客文章底部QQ名片; 4.1 博客或资源的完整代码提供 4.2 期刊或参考文献复现 4.3 Matlab程序定制 4.4 科研合作 功率谱估计: 故障诊断分析: 雷达通信:雷达LFM、MIMO、成像、定位、干扰、检测、信号分析、脉冲压缩 滤波估计:SOC估计 目标定位:WSN定位、滤波跟踪、目标定位 生物电信号:肌电信号EMG、脑电信号EEG、心电信号ECG 通信系统:DOA估计、编码译码、变分模态分解、管道泄漏、滤波器、数字信号处理+传输+分析+去噪(CEEMDAN)、数字信号调制、误码率、信号估计、DTMF、信号检测识别融合、LEACH协议、信号检测、水声通信 1. EMD(经验模态分解,Empirical Mode Decomposition) 2. TVF-EMD(时变滤波的经验模态分解,Time-Varying Filtered Empirical Mode Decomposition) 3. EEMD(集成经验模态分解,Ensemble Empirical Mode Decomposition) 4. VMD(变分模态分解,Variational Mode Decomposition) 5. CEEMDAN(完全自适应噪声集合经验模态分解,Complementary Ensemble Empirical Mode Decomposition with Adaptive Noise) 6. LMD(局部均值分解,Local Mean Decomposition) 7. RLMD(鲁棒局部均值分解, Robust Local Mean Decomposition) 8. ITD(固有时间尺度分解,Intrinsic Time Decomposition) 9. SVMD(逐次变分模态分解,Sequential Variational Mode Decomposition) 10. ICEEMDAN(改进的完全自适应噪声集合经验模态分解,Improved Complementary Ensemble Empirical Mode Decomposition with Adaptive Noise) 11. FMD(特征模式分解,Feature Mode Decomposition) 12. REMD(鲁棒经验模态分解,Robust Empirical Mode Decomposition) 13. SGMD(辛几何模态分解,Spectral-Grouping-based Mode Decomposition) 14. RLMD(鲁棒局部均值分解,Robust Intrinsic Time Decomposition) 15. ESMD(极点对称模态分解, extreme-point symmetric mode decomposition) 16. CEEMD(互补集合经验模态分解,Complementary Ensemble Empirical Mode Decomposition) 17. SSA(奇异谱分析,Singular Spectrum Analysis) 18. SWD(群分解,Swarm Decomposition) 19. RPSEMD(再生相移正弦辅助经验模态分解,Regenerated Phase-shifted Sinusoids assisted Empirical Mode Decomposition) 20. EWT(经验小波变换,Empirical Wavelet Transform) 21. DWT(离散小波变换,Discraete wavelet transform) 22. TDD(时域分解,Time Domain Decomposition) 23. MODWT(最大重叠离散小波变换,Maximal Overlap Discrete Wavelet Transform) 24. MEMD(多元经验模态分解,Multivariate Empirical Mode Decomposition) 25. MVMD(多元变分模态分解,Multivariate Variational Mode Decomposition
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值