我如何使用Python查找有趣的人来关注Medium

最新推荐文章于 2024-11-17 19:39:42 发布

cumi6497

最新推荐文章于 2024-11-17 19:39:42 发布

阅读量241

点赞数

文章标签： python java 编程语言人工智能大数据

原文链接：https://www.freecodecamp.org/news/how-i-used-python-to-find-interesting-people-on-medium-be9261b924b0/

版权

by Radu Raicea

由Radu Raicea

我如何使用Python查找有趣的人来关注Medium (How I used Python to find interesting people to follow on Medium)

Medium has a large amount of content, a large number of users, and an almost overwhelming number of posts. When you try to find interesting users to interact with, you’re flooded with visual noise.

中型网站包含大量内容，大量用户，并且帖子数量几乎是压倒性的。当您尝试寻找有趣的用户进行交互时，您会被视觉噪音所淹没。

I define an interesting user as someone who is from your network, who is active, and who writes responses that are generally appreciated by the Medium community.

我将一个有趣的用户定义为来自您网络的，活跃的，并撰写了媒体社区普遍赞赏的回复的用户。

I was looking through the latest posts from users I follow to see who had responded to those users. I figured that if they responded to someone I’m following, they must have similar interests to mine.

我一直在浏览我关注的用户的最新帖子，以了解谁对这些用户做出了回应。我认为，如果他们对我所关注的人做出回应，那么他们必须具有与我相似的兴趣。

The process was tedious. And that’s when I remembered the most valuable lesson I learned during my last internship:

这个过程很乏味。那时，我想起了我在上次实习中学到的最有价值的一课：

Any tedious task can and should be automated.

任何繁琐的任务都可以并且应该自动化。

I wanted my automation to do the following things:

我希望自动化可以执行以下操作：

Get all the users from my “Followings” list
从“关注”列表中获取所有用户
Get the latest posts of each user
获取每个用户的最新帖子
Get all the responses to each post
获取每个帖子的所有回复
Filter out responses that are older than 30 days
过滤出超过30天的回复
Filter out responses that have less than a minimum number of recommendations
筛选出建议数量少于最小数量的回复
Get the username of the author of each response
获取每个回复作者的用户名

让我们开始吧 (Let’s start pokin’)

I initially looked at Medium’s API, but found it limiting. It didn’t give me much to work with. I could only get information about my account, not on other users.

我最初查看了Medium的API ，但发现它有局限性。它并没有给我太多工作。我只能获取有关我的帐户的信息，而不能获取其他用户的信息。

On top of that, the last change to Medium’s API was over a year ago. There was no sign of recent development.

最重要的是，Medium的API的最后一次更改是一年多以前。没有近期发展的迹象。

I realized that I would have to rely on HTTP requests to get my data, so I started to poke around using my Chrome DevTools.

我意识到我必须依靠HTTP请求来获取我的数据，因此我开始使用Chrome DevTools进行研究 。

The first goal was to get my list of Followings.

第一个目标是获得我的关注清单。

I opened up my DevTools and went on the Network tab. I filtered out everything but XHR to see where Medium gets my list of Followings from. I hit the reload button on my profile page and got nothing interesting.

我打开我的DevTools并进入“网络”选项卡。我过滤掉了XHR以外的所有内容，以查看Medium从何处获取我的关注列表。我点击了个人资料页面上的“重新加载”按钮，却没有任何有趣的事情。

What if I clicked the Followings button on my profile? Bingo.

如果我单击个人资料上的“关注”按钮怎么办？答对了。

Inside the link, I found a very big JSON response. It was a well-formatted JSON, except for a string of characters at the beginning of the response: ])}while(1);</x>

在链接内部，我发现了一个很大的JSON响应。它是格式正确的JSON，除了响应开头的字符串是： ])}while(1);< / x>

I wrote a function to clean that up and turn the JSON into a Python dictionary.

我编写了一个函数来清理它，然后将JSON转换为Python字典。

import json

def clean_json_response(response):    return json.loads(response.text.split('])}while(1);</x>')[1])

I had found an entry point. Let the coding begin.

我找到了一个切入点。让编码开始。

从我的关注列表中获取所有用户 (Getting all the users from my Followings list)

To query that endpoint, I needed my User ID (I know that I already had it, but this is for educational purposes).

要查询该端点，我需要我的用户ID(我知道我已经有了它，但这是出于教育目的)。

While looking for a way to get a user’s ID, I found out that you can add ?format=json to most Medium URLs to get a JSON response from that page. I tried that out on my profile page.

在寻找一种获取用户ID的方法时，我发现您可以向大多数Medium URL添加?format=json以获得该页面的JSON响应。我在个人资料页面上尝试过。

Oh look, there’s the user ID.

哦，有用户名。

])}while(1);</x>{"success":true,"payload":{"user":{"userId":"d540942266d0","name":"Radu Raicea","username":"Radu_Raicea",...

I wrote a function to pull the user ID from a given username. Again, I had to use clean_json_response to remove the unwanted characters at the beginning of the response.

我编写了一个函数，用于从给定的用户名中提取用户ID。同样，我必须使用clean_json_response在响应开始时删除不需要的字符。

I also made a constant called MEDIUM that contains the base for all the Medium URLs.

我还创建了一个名为MEDIUM的常量，其中包含所有Medium URL的基础。

import requests

MEDIUM = 'https://medium.com'

def get_user_id(username):

print('Retrieving user ID...')

url = MEDIUM + '/@' + username + '?format=json'    response = requests.get(url)    response_dict = clean_json_response(response)    return response_dict['payload']['user']['userId']

With the User ID, I queried the /_/api/users/<user_id>/following endpoint and got the list of usernames from my Followings list.

使用用户ID，我查询了/_/api/users/<user_id>/fol端点，并从我的关注列表中获取了用户名列表。

When I did it in DevTools, I noticed that the JSON response only had eight usernames. Weird.

当我在DevTools中进行操作时，我注意到JSON响应中只有八个用户名。奇怪的。

After I clicked on “Show more people,” I saw what was missing. Medium uses pagination for the list of Followings.

单击“显示更多人”后，我看到了丢失的内容。媒介使用分页作为关注列表。

Pagination works by specifying a limit (elements per page) and to (first element of the next page). I had to find a way to get the ID of that next element.

分页的工作方式是指定一个limit (每页元素)和to (下一页的第一个元素)。我必须找到一种获取下一个元素ID的方法。

At the end of the JSON response from /_/api/users/<user_id>/following, I saw an interesting key.

在/_/api/users/<user_id>/fol的JSON响应的结尾，我看到了一个有趣的键。

..."paging":{"path":"/_/api/users/d540942266d0/followers","next":{"limit":8,"to":"49260b62a26c"}}},"v":3,"b":"31039-15ed0e5"}

From here, writing a loop to get all the usernames from my Followings list was easy.

从这里开始，编写循环以从我的关注列表中获取所有用户名很容易。

def get_list_of_followings(user_id):

print('Retrieving users from Followings...')        next_id = False    followings = []

while True:

if next_id:            # If this is not the first page of the followings list            url = MEDIUM + '/_/api/users/' + user_id                  + '/following?limit=8&to=' + next_id        else:            # If this is the first page of the followings list            url = MEDIUM + '/_/api/users/' + user_id + '/following'

response = requests.get(url)        response_dict = clean_json_response(response)        payload = response_dict['payload']

for user in payload['value']:            followings.append(user['username'])

try:            # If the "to" key is missing, we've reached the end            # of the list and an exception is thrown            next_id = payload['paging']['next']['to']        except:            break

return followings

获取每个用户的最新帖子 (Getting the latest posts from each user)

Once I had the list of users I follow, I wanted to get their latest posts. I could do that with a request to https://medium.com/@<username>/latest?format=json

找到关注的用户列表后，我想获取他们的最新帖子。我可以向https://medium.com/@<username>/latest?forma <用户名> / latest？forma t = json发送请求

I wrote a function that takes a list of usernames and returns a list of post IDs for the latest posts from all the usernames on the input list.

我编写了一个函数，该函数接受用户名列表，并从输入列表中的所有用户名中返回最新帖子的帖子ID列表。

def get_list_of_latest_posts_ids(usernames):

print('Retrieving the latest posts...')

post_ids = []

for username in usernames:        url = MEDIUM + '/@' + username + '/latest?format=json'        response = requests.get(url)        response_dict = clean_json_response(response)

try:            posts = response_dict['payload']['references']['Post']        except:            posts = []

if posts:            for key in posts.keys():                post_ids.append(posts[key]['id'])

return post_ids

获取每个帖子的所有回复 (Getting all the responses from each post)

With the list of posts, I extracted all the responses using https://medium.com/_/api/posts/<post_id>/responses

在帖子列表中，我使用https://medium.com/_/api/posts/<post_id>/res ponses提取了所有回复

This function takes a list of post IDs and returns a list of responses.

此函数获取帖子ID列表，并返回响应列表。

def get_post_responses(posts):

print('Retrieving the post responses...')

responses = []

for post in posts:        url = MEDIUM + '/_/api/posts/' + post + '/responses'        response = requests.get(url)        response_dict = clean_json_response(response)        responses += response_dict['payload']['value']

return responses

过滤响应 (Filtering the responses)

At first, I wanted responses that had gotten a minimum number of claps. But I realized that this might not be a good representation of the community’s appreciation of the response: a user can give more than one clap for the same article.

起初，我希望得到的回应最少。但是我意识到这可能不能很好地代表社区对响应的赞赏：用户可以为同一篇文章提供多个鼓掌。

Instead, I filtered by the number of recommendations. It measures the same thing as claps, but it doesn’t take duplicates into account.

相反，我按建议的数量过滤。它和拍手一样，但没有考虑重复。

I wanted the minimum to be dynamic, so I passed a variable named recommend_min around.

我希望最小值是动态的，所以我在周围传递了一个名为recommend_min的变量。

The following function takes a response and the recommend_min variable. It checks if the response meets that minimum.

以下函数接受一个响应和recommend_min变量。它检查响应是否达到该最小值。

def check_if_high_recommends(response, recommend_min):    if response['virtuals']['recommends'] >= recommend_min:        return True

I also wanted recent responses. I filtered out responses that were older than 30 days using this function.

我还希望最近有回应。我使用此功能过滤了超过30天的回复。

from datetime import datetime, timedelta

def check_if_recent(response):    limit_date = datetime.now() - timedelta(days=30)    creation_epoch_time = response['createdAt'] / 1000    creation_date = datetime.fromtimestamp(creation_epoch_time)

if creation_date >= limit_date:        return True

获取每个回复作者的用户名 (Getting the username of the author of each response)

Once I had all the filtered responses, I grabbed all the authors’ user IDs using the following function.

一旦获得所有过滤的响应，便可以使用以下函数获取所有作者的用户ID。

def get_user_ids_from_responses(responses, recommend_min):

print('Retrieving user IDs from the responses...')

user_ids = []

for response in responses:        recent = check_if_recent(response)        high = check_if_high_recommends(response, recommend_min)

if recent and high:            user_ids.append(response['creatorId'])

return user_ids

User IDs are useless when you’re trying to access someone’s profile. I made this next function query the /_/api/users/<user_id> endpoint to get the usernames.

当您尝试访问某人的个人资料时，用户ID无用。我通过查询/_/api/users/<user_ user_id>端点来获取用户名。

def get_usernames(user_ids):

print('Retrieving usernames of interesting users...')

usernames = []

for user_id in user_ids:        url = MEDIUM + '/_/api/users/' + user_id        response = requests.get(url)        response_dict = clean_json_response(response)        payload = response_dict['payload']

usernames.append(payload['value']['username'])

return usernames

全部放在一起 (Putting it all together)

After I finished all the functions, I created a pipeline to get my list of recommended users.

完成所有功能后，我创建了一个管道以获取推荐用户列表。

def get_interesting_users(username, recommend_min):

print('Looking for interesting users for %s...' % username)

user_id = get_user_id(username)

usernames = get_list_of_followings(user_id)

posts = get_list_of_latest_posts_ids(usernames)

responses = get_post_responses(posts)

users = get_user_ids_from_responses(responses, recommend_min)

return get_usernames(users)

The script was finally ready! To run it, you have to call the pipeline.

脚本终于准备好了！要运行它，您必须调用管道。

interesting_users = get_interesting_users('Radu_Raicea', 10)print(interesting_users)

Finally, I added an option to append the results to a CSV with a timestamp.

最后，我添加了一个选项，将结果附加到带有时间戳的CSV上。

import csv

def list_to_csv(interesting_users_list):    with open('recommended_users.csv', 'a') as file:        writer = csv.writer(file)

now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')        interesting_users_list.insert(0, now)                writer.writerow(interesting_users_list)

interesting_users = get_interesting_users('Radu_Raicea', 10)list_to_csv(interesting_users)

The project’s source code is on GitHub.

该项目的源代码在GitHub上。

If you don’t know Python, go read TK’s Learning Python: From Zero to Hero.

如果您不了解Python，请阅读TK的学习Python：从零到英雄。

If you have suggestions on other criteria that make users interesting, please write them below!

如果您有其他一些使用户感兴趣的标准建议，请在下面写下！

综上所述… (In summary…)

I made a Python script for Medium.
我为Medium创建了Python脚本 。
The script returns a list of interesting users that are active and post interesting responses on the latest posts of people you are following.
该脚本返回活跃用户的有趣列表，并在您关注的人的最新帖子中发布有趣的回复 。
You can take users from the list and run the script with their username instead of yours.
您可以从列表中选择用户，并使用用户名(而不是您的用户名)运行脚本。