2020 MCM Problem C A Wealth of Data Notes

2020 MCM Problem C A Wealth of Data

In the online marketplace it created, Amazon provides customers with anopportunity to rate and review purchases.(bk1) Individualratings - called “star ratings” – allow purchasers to express theirlevel of satisfaction with a product using a scale of 1 (low rated, lowsatisfaction) to 5 (highly rated, high satisfaction).(bk1’) Additionally,customers can submit text-based messages – called “reviews” – thatexpress further opinions and information about the product.(bk1’’) Other customerscan submit ratings on these reviews as being helpful or not – called a “helpfulnessrating” – towards assisting their own product purchasing decision.(bk1’’’) Companies usethese data to gain insights into the markets in which theyparticipate, the timing of that participation, and the potential success of product designfeature choices.(bk2)

Sunshine Company is planning to introduce and sell three new products inthe online marketplace: a microwave oven, a baby pacifier, and ahair dryer.(pr1) They have hired your team as consultants to identify key patterns, relationships,measures, and parameters in past customer- supplied ratings andreviews associated with other competing products to 1) inform their onlinesales strategy and 2) identify potentially important design features that wouldenhance product desirability.(pr1’) Sunshine Companyhas used data to inform sales strategies in the past, but they have notpreviously used this particular combination and type of data. Of particularinterest to Sunshine Company are time-based patterns in these data,and whether they interact in ways that will help the company craft successfulproducts.(imp1)

To assist you, Sunshine’s data center has provided you with three datafiles for this project: hair_dryer.tsvmicrowave.tsv,and pacifier.tsv. These data represent customer-supplied ratingsand reviews for microwave ovens, baby pacifiers, and hair dryers sold in theAmazon marketplace over the time period(s) indicated in the data. A glossary ofdata label definitions is provided as well. THE DATA FILES PROVIDED CONTAIN THEONLY DATA YOU SHOULD USE FOR THIS PROBLEM.(rsc1s)

Requirements

1.      Analyze the threeproduct data sets provided to identify, describe, and support with mathematicalevidence, meaningful quantitative and/or qualitative patterns, relationships,measures, and parameters within and between star ratings,reviews, and helpfulness ratings that will help Sunshine Company succeed intheir three new online marketplace product offerings.

2.      Use your analysisto address the following specific questions and requests from the SunshineCompany Marketing Director:

·        Identify data measures based on ratingsand reviews that are most informative for SunshineCompany to track, once their three products are placed on sale in the onlinemarketplace.

·        Identify and discuss time-based measures and patternswithin each data set that might suggest that a product’s reputation is increasingor decreasing in the online marketplace.

·        Determine combinations of text-basedmeasure(s) and ratings-based measures that best indicate a potentiallysuccessful or failing product.

·        Do specific star ratings incite more reviews? Forexample, are customers more likely to write some type of review after seeing aseries of low star ratings?

·        Are specific quality descriptors oftext-based reviews such as ‘enthusiastic’, ‘disappointed’, and others, stronglyassociated with rating levels?

3.      Write a one- totwo-page letter to the Marketing Director of Sunshine Company summarizing yourteam’s analysis and results. Include specific justification(s) for the resultthat your team most confidently recommends to the Marketing Director.(mss1)

Your submission should consist of:

·        One-page Summary Sheet

·        Table of Contents

·        One- to Two-page Letter

·        Your solution of no more than 20 pages,for a maximum of 24 pages with your summary sheet, table of contents, andtwo-page letter.

Note: Reference List and any appendices do not count toward the page limitand should appear after your completed solution. You should not make use ofunauthorized images and materials whose use is restricted by copyright laws.Ensure you cite the sources for your ideas and the materials used in yourreport.

Glossary

·        Helpfulness Rating: an indication ofhow valuable a particular product review is when making a decision whether ornot to purchase that product.

·        Pacifier: a rubber orplastic soothing device, often nipple shaped, given to a baby to suck or biteon.

·        Review: a writtenevaluation of a product.

·        Star Rating:a score given ina system that allows people to rate a product with a number of stars.

The Problem Datasets

Problem_C_Data.zip

The three data sets provided contain product user ratings and reviews extractedfrom the Amazon Customer Reviews Dataset thru Amazon Simple Storage Service(Amazon S3).

hair_dryer.tsv
microwave.tsv
pacifier.tsv

Data Set Definitions: Each row represents data partitionedinto the following columns.

·        marketplace (string): 2 letter countrycode of the marketplace where the review was written.

·        customer_id (string): Random identifierthat can be used to aggregate reviews written by a single author.

·        review_id (string): The unique ID of thereview.

·        product_id (string): The unique Product IDthe review pertains to.

·        product_parent (string): Random identifierthat can be used to aggregate reviews for the same product.

·        product_title (string): Title of theproduct.

·        product_category (string): The majorconsumer category for the product.

·        star_rating (int): The 1-5 star rating ofthe review.

·        helpful_votes (int): Number of helpfulvotes.

·        total_votes (int): Number of total votesthe review received.

·        vine (string): Customers are invited tobecome Amazon Vine Voices based on the trust that they have earned in theAmazon community for writing accurate and insightful reviews. Amazon providesAmazon Vine members with free copies of products that have been submitted tothe program by vendors. Amazon doesn't influence the opinions of Amazon Vinemembers, nor do they modify or edit reviews.

·        verified_purchase (string): A “Y”indicates Amazon verified that the person writing the review purchased theproduct at Amazon and didn't receive the product at a deep discount.

·        review_headline (string): The title of thereview.

·        review_body (string): The review text.

·        review_date (bigint): The date the reviewwas written.

 

 

Analysis:

这次美赛真的是破天荒了!这几乎是美赛历史上第一次出现如此纯粹的数据分析题目,而且还是文本分析!我本以为美赛会走上虚无缥缈的机理分析一去不复返,不过今年是个信号,以后基于数据分析的题目可能不会少了,虽然不指望它发展成kaggle但是,数据分析怎么也得占半壁江山不是!

 

另外,本题背景很像CUMCM2012的葡萄酒评价问题,可供参考。

 

前面三段的背景和数据条件就不多提了,两个暗示很明显,一个是要挖掘数据之间的关系,另一个是注意时间的影响。这个在后面的具体问题中都会提到。

 

第一问先问这些商品指标和反馈指标之间的关系,如果是数值可以算相关系数,回归后的拟合优度等;分类变量可以算互信息等等,算是一个基本量分析。

 

第二问开始大范围的要用统计语言来回答问题:

  1. 评价打分中对销售最具信息量的指标,可以是互信息或者,单个变量的拟合效果,相关系数;

  2. 从时间维度,构建样本,来分析评价打分对产品声誉的影响,其实还是关联分析,并没有要求从时间上去预测什么;

  3. 用文本和评价值两个因子估计成功与否,会神经网络的同学赶紧大发神威吧!

  4. 打分和评论数量的关系,和1的问题没什么两样;

  5. 评论中关键词和打分的关系,应该是关键词问题加上关联分析了。

 

第三问例行任务。

 

会统计机器学习,有点nlp背景的同学就赶紧选这个题目吧,你们的机会来了!

点击阅读原文,查看更多往届试题解析!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值