Project: crawl the title of all free games on steam (Part I)

Project: crawl the title of all free games on steam (Part I)

Python modules: requests, re 

Objective: get needed information from websites

When we are dealing with repeated but simple information on a website, it would be frustrating if we click, ctrl-c, and ctrl-v all the time. We can use easy crawling techniques to help us.

This blog involves only two python build-in modules: requests & re, requests can help us to get html file from the website and then we can use re (regular expression module) to help us extract the needed information.

Example: get titles of all free games on steam

import requests
import re

freegame_url = 'https://store.steampowered.com/genre/Free%20to%20Play/#p=0&tab=TopSellers'
#headers = {'User-Agent':''} 

freegame_content = requests.get(freegame_url)
#freegame_content = requests.get(freegame_url, headers=headers)

freegame_html = freegame_content.text
title_pat = '<div class="tab_item_name">(.*?)</div>'
titles = re.compile(title_pat).findall(freegame_html)
print(titles)

#headers is not necessary for steam for the time being, this is often added to get rid of the site's anti-crawling function

requests.get(url): the get() function give value output by default (you can print the get() output and it show 200 if the call is successful), we want to get the html file in text form, so we add .text after freegame_content to get the context.

I would recommand you to print each output or test them directly in python shell so you can know what is going on

Then, we write the regular expression pattern: we first need to know where are the titles we want in the html file, so we open the url with our browser and check it with f12, we select one title and we find it

So, according to what we find, we write the pattern. Apparently, all titles are in the div where class=title. We use (.*?) to indicate any lenth of string that ends immediately when the pattern is matched. Then we find strings match this pattern in the html file, then print the list.

However it is not the end yet. You may find that this only alows us to crawl the titles on the first page. How can we crawl down titles in all pages? I will teach you how in part II.

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值