Project: crawl the title of all free games on steam (Part I)
Python modules: requests, re
Objective: get needed information from websites
When we are dealing with repeated but simple information on a website, it would be frustrating if we click, ctrl-c, and ctrl-v all the time. We can use easy crawling techniques to help us.
This blog involves only two python build-in modules: requests & re, requests can help us to get html file from the website and then we can use re (regular expression module) to help us extract the needed information.
Example: get titles of all free games on steam
import requests
import re
freegame_url = 'https://store.steampowered.com/genre/Free%20to%20Play/#p=0&tab=TopSellers'
#headers = {'User-Agent':''}
freegame_content = requests.get(freegame_url)
#freegame_content = requests.get(freegame_url, headers=headers)
freegame_html = freegame_content.text
title_pat = '<div class="tab_item_name">(.*?)</div>'
titles = re.compile(title_pat).findall(freegame_html)
print(titles)
#headers is not necessary for steam for the time being, this is often added to get rid of the site's anti-crawling function
requests.get(url): the get() function give value output by default (you can print the get() output and it show 200 if the call is successful), we want to get the html file in text form, so we add .text after freegame_content to get the context.
I would recommand you to print each output or test them directly in python shell so you can know what is going on
Then, we write the regular expression pattern: we first need to know where are the titles we want in the html file, so we open the url with our browser and check it with f12, we select one title and we find it
So, according to what we find, we write the pattern. Apparently, all titles are in the div where class=title. We use (.*?) to indicate any lenth of string that ends immediately when the pattern is matched. Then we find strings match this pattern in the html file, then print the list.
However it is not the end yet. You may find that this only alows us to crawl the titles on the first page. How can we crawl down titles in all pages? I will teach you how in part II.