Project: crawl the title of all free games on steam (Part I)

最新推荐文章于 2024-11-15 10:17:44 发布

GAVIN FU

最新推荐文章于 2024-11-15 10:17:44 发布

阅读量312

点赞数 2

分类专栏： Tutorial Blogs 文章标签： python 正则表达式 html

本文链接：https://blog.csdn.net/GG_Fu/article/details/108563118

版权

Tutorial Blogs 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Project: crawl the title of all free games on steam (Part I)

Python modules: requests, re

Objective: get needed information from websites

When we are dealing with repeated but simple information on a website, it would be frustrating if we click, ctrl-c, and ctrl-v all the time. We can use easy crawling techniques to help us.

This blog involves only two python build-in modules: requests & re, requests can help us to get html file from the website and then we can use re (regular expression module) to help us extract the needed information.

Example: get titles of all free games on steam

import requests
import re

freegame_url = 'https://store.steampowered.com/genre/Free%20to%20Play/#p=0&tab=TopSellers'
#headers = {'User-Agent':''} 

freegame_content = requests.get(freegame_url)
#freegame_content = requests.get(freegame_url, headers=headers)

freegame_html = freegame_content.text
title_pat = '<div class="tab_item_name">(.*?)</div>'
titles = re.compile(title_pat).findall(freegame_html)
print(titles)

#headers is not necessary for steam for the time being, this is often added to get rid of the site's anti-crawling function

requests.get(url): the get() function give value output by default (you can print the get() output and it show 200 if the call is successful), we want to get the html file in text form, so we add .text after freegame_content to get the context.

I would recommand you to print each output or test them directly in python shell so you can know what is going on

Then, we write the regular expression pattern: we first need to know where are the titles we want in the html file, so we open the url with our browser and check it with f12, we select one title and we find it

So, according to what we find, we write the pattern. Apparently, all titles are in the div where class=title. We use (.*?) to indicate any lenth of string that ends immediately when the pattern is matched. Then we find strings match this pattern in the html file, then print the list.

However it is not the end yet. You may find that this only alows us to crawl the titles on the first page. How can we crawl down titles in all pages? I will teach you how in part II.