partⅣ Tweet Tweet
这个实验写起来很麻烦,首先理解题意就有些困难,因为没用过twitter,对于有些概念,如timestamp,getmention不知道什么意思,花了很长时间来理解。按照做实验的顺序回顾一下这个系列。
首先是实现Extract类,这个类是实现从twitter列表中获取信息的方法。
先来看一下Tweet类的内容,
private final long id;
private final String author;
private final String text;
private final Instant timestamp;
id是账户名,author就是写推文的作者,text是推文的内容,timestamp就是发推文的时间。
Timespan类的属性
private final Instant start;
private final Instant end;
就是两个时间点一个开始,一个结束。
getTimespan函数就是从一个tweet列表里找一个区间,包括所有发推时间。
/**
* Get the time period spanned by tweets.
*
* @param tweets
* list of tweets with distinct ids, not modified by this method.
* @return a minimum-length time interval that contains the timestamp of every
* tweet in the list.
*/
public static Timespan getTimespan(List<Tweet> tweets) {
int n = tweets.size();
Instant min = tweets.get(0).getTimestamp();
Instant max = tweets.get(0).getTimestamp();
for (int i = 1; i < n; i++) {
Instant now = tweets.get(i).getTimestamp();
if (now.isBefore(min))
min = now;
if (now.isAfter(max))
max = now;
}
Timespan timespan = new Timespan(min, max);
return timespan;
}
理解函数的功能就很容易实现它。
/**
* Get usernames mentioned in a list of tweets.
*
* @param tweets
* list of tweets with distinct ids, not modified by this method.
* @return the set of usernames who are mentioned in the text of the tweets. A
* username-mention is "@" followed by a Twitter username (as defined by
* Tweet.getAuthor()'s spec). The username-mention cannot be immediately
* preceded or followed by any character valid in a Twitter username.
* For this reason, an email address like bitdiddle@mit.edu does NOT
* contain a mention of the username mit. Twitter usernames are
* case-insensitive, and the returned set may include a username at most
* once.
*/
public static Set<String> getMentionedUsers(List<Tweet> tweets) {
Set<String> usernameInText = new HashSet<String>();
int n = tweets.size();
Pattern pattern = Pattern.compile("(^|[^A-Za-z0-9_-]+)@([A-Za-z0-9_-]+)");//满足条件的字符字串的正则表达式
for (int i = 0; i < n; i++) {
String text = tweets.get(i).getText();
Matcher username_mention = pattern.matcher(text);
while (username_mention.find()) {
String now = new String(username_mention.group(2).toString().toLowerCase());//全用小写。
usernameInText.add(now);//将匹配到的username都添加到usernameInText
}
}
return usernameInText;
}
该函数是从所有推文中找到所有被@的用户名,根据要求构造出一个正则表达式来对字符串进行匹配,usernameInText是Set不用考虑重复的问题,但是要考虑大小写,所以在添加之前我都转化成小写了。
接下来是Filter类,这个类实现对推特列表进行分类,根据时间,作者,推文内容等分类。这几个函数都很好写,writtenBy就遍历每篇推文的作者,inTimespan就遍历时间,containing要遍历推文内容,要忽略大小写。
SocialNetwork类不是很好写,这个类是要根据推文内容推测用户之间的关系,我采用了最简单的方法,即如果A的 推文中@了B,那么就认为A认识B。最终想要得到Map<String, Set<String>>的结构,即每个人映射一个他@过的所有人。用之前实现的函数writtenBy和getMentionedUsers可以很容易得到。
public static Map<String, Set<String>> guessFollowsGraph(List<Tweet> tweets) {
Map<String, Set<String>> followMap = new HashMap<String, Set<String>>();
Set<String> author = new HashSet<String>();
int n = tweets.size();
for (int i = 0; i < n; i++) {
Tweet now = tweets.get(i);
author.add(now.getAuthor().toLowerCase());
}
for (String str : author) {
List<Tweet> fromStr = Filter.writtenBy(tweets, str);
Set<String> usernameInText = Extract.getMentionedUsers(fromStr);
followMap.put(str, usernameInText);
}
return followMap;
}
首先把所有推特遍历一遍,找到所有作者,构成一个集合,然后遍历这个集合。对于每个user,先用writtenBy找到他写的文章,然后用getMentionedUsers得到他@的人。
然后是一个测试,要你根据给出的推特(很多),找出谁被最多的人@过(who have the greatest influence)。用Map<String, Integer>这样的映射记录每个人和他的influence值。得到这个映射之后就进行排序,我选择了插入排序,边生成边排序。
public static List<String> influencers(Map<String, Set<String>> followsGraph) {
Map<String, Integer> influenceMap = new HashMap();
List<String> influencers = new LinkedList();
Set<String> authors = followsGraph.keySet();
for (String author : authors) {
for (String user : followsGraph.get(author)) {
if (!influenceMap.containsKey(user))
influenceMap.put(user, 1);
else
influenceMap.put(user, influenceMap.get(user) + 1);
}
}
Set<String> users = influenceMap.keySet();
for (String user : users) {
int i = 0;
while (i < influencers.size() && influenceMap.get(user) < influenceMap.get(influencers.get(i))) {
i++;
}
influencers.add(i, user);
}
return influencers;
}