In this tutorial, we are going to talk about web scraping using python.


Firstly, we have to discuss about what is web scraping technique? Whenever we need the data (it can be text, images, links and videos) from web to our database. Lets discuss where we should need the web scraping in real world.

首先,我们必须讨论什么是网络抓取技术? 每当我们需要从Web到我们数据库的数据(可以是文本,图像,链接和视频)时。 让我们讨论一下在现实世界中我们应该在哪里需要网络抓取。

  1. Nowadays, we have so many competitors in each and every field for surpassing them we need their data from the website or Blogs to know about products, customers and their facilities.

  2. And Some Admin of Particular website, blogs and youtube channel want the reviews of their customers in database and want to update with this In, this condition they use web scraping


There are many other areas where we need web scraping, we discussed two points for precise this article for readers.


You just have basic knowledge of python nothing else so, get ready for learning web scraping.


Which technology we should use to achieve web scraping?


We can do this with JavaScript and python but according to me and most of the peoples, we can do it with python easily just you should know the basic knowledge of python nothing else rest of the things we will learn in this article.


Python Web爬网教程 (Python Web Scraping Tutorial)

1.通过网页抓取从网站和Youtube频道检索链接和文本 (1. Retrieving Links and Text from Website and Youtube Channel through Web Scraping)

  • In this first point, we will learn how to get the text and the links of any webpage with some methods and classes.


We are going to do this beautiful soup method.


1. Install BS4 and Install lxml parser


  • To install BS4 in windows open your command prompt or windows shell and type: pip install bs4

    要在Windows中安装BS4,请打开命令提示符或Windows Shell,然后键入: pip install bs4

  • To install lxml in windows open your command prompt or windows shell and type: pip install lxml

    要在Windows中安装lxml,请打开命令提示符或Windows Shell,然后键入: pip install lxml

Note: “pip is not recognized” if this error occurs, take help from any reference.


To install BS4 in ubuntu open your terminal:


  • If you are using python version 2 type: pip install bs4

    如果您使用的是python版本2,请输入: pip install bs4

  • If you are using python version 3 type: pip3 install bs4  

    如果您使用的是python版本3,请输入: pip3 install bs4   

To install lxml in ubuntu open your terminal


  • If you are using python version 2 type: pip install lxml

    如果您使用的是python版本2,请输入: pip install lxml

  • If you are using python version 3 type: pip3 install lxml

    如果您使用的是python版本3,请输入: pip3 install lxml

2. Open Pycharm and Import Modules


Import useful modules:


import bs4


import requests


Import useful modules

Then take url of particular website for example www.thecrazyprogrammer.com


url= "https://www.thecrazyprogrammer.com/"

And now you will get the html script with the help of these lines of code of particular link you provided to the program. This is the same data which is in the page source of the website webpage you can check it also.

现在,您将在提供给程序的特定链接的这些代码行的帮助下获得html脚本。 这与网站页面的页面源中的数据相同,您也可以检查它。

Python Web Scraping Tutorial 1Python Web Scraping Tutorial 1

Now we talk about find function() with the help of find function we can get the text, links and many more things from our webpage. We can achieve this thing through the python code which is written below of this line:

现在,我们在find函数的帮助下讨论find function() ,我们可以从我们的网页上获取文本,链接和更多内容。 我们可以通过以下代码编写的python代码来实现此目的:

We just take one loop in our program and comment the previous line.


for para in soup.find('p')
Python Web Scraping Tutorial 2

And we will get the first para of our webpage, you can see the output in the below image. See, this is the original website view and see the output of python code in the below image.

我们将获得网页的第一段,您可以在下图中看到输出。 请参阅,这是原始网站视图,并在下图中查看python代码的输出。

Python Web Scraping Tutorial 3

Pycharm Output


Python Web Scraping Tutorial 4

Now, if you want all the paragraph of this webpage you just need to do some changes in this code i.e.


Here, we should use find_all function() instead find function. Let’s do it practically

在这里,我们应该使用find_all function()代替find函数。 让我们实践一下

Python Web Scraping Tutorial 5

You will get all paragraphs of web page.


Now, one problem will occur that is the “<p>” tag will print with the text data for removing the <p> tag we have to again do changes in the code like this:

现在,将出现一个问题,即“ <p>”标记将与文本数据一起打印以删除<p>标记,我们必须再次对代码进行如下更改:

Python Web Scraping Tutorial 6

We just add “.text” in the print function with para. This will give us only text without any tags. Now see the output there <p> tag has removed with this code.

我们只需在带有para的打印功能中添加“ .text”即可 。 这将只给我们提供没有任何标签的文本。 现在,该代码已删除<p>标记的输出。

With the last line we have completed our first point i.e. how we can get the data (text) and the html script of our webpage. In the second point we will learn how we get the hyperlinks from webpage.

在最后一行中,我们完成了第一点,即如何获取数据(文本)和网页的html脚本。 第二点,我们将学习如何从网页获得超链接。

2.如何通过网页爬取获取网页的所有链接 (2. How to Get All the Links Of Webpage Through Web Scraping)



In this, we will learn how we can get the links of the webpage and the youtube channels also or any other web page you want.


All the import modules will be same some changes are there only that changes are:


Take one for loop with the condition of anchor tag ‘a’ and get all the links using href tag and assign them to the object (you can see in the below image) which taken under the for loop and then print the object. Now, you will get all the links of webpage. Practical work:

使用锚标记“ a”的条件进行一个for循环,并使用href标记获取所有链接,并将它们分配给在for循环下获取的对象(如下图所示),然后打印该对象。 现在,您将获得网页的所有链接。 实际工作:

Python Web Scraping Tutorial 7

You will get all the links with the extra stuff (like “../” and “#” in the starting of the link)

您将获得带有多余内容的所有链接(例如,链接开头的“ ../”和“#”)

Python Web Scraping Tutorial 8
  • There is only some valid links in this console screen rest of them are also link but because of some extra stuff are not treating like links for removing this bug we have to do change in our python code.

  • We need if and else condition and we will do slicing using python also, “../” if we replace it with our url (you can see the url above images) i.e.  https://www.thecrazyprogrammer.com/, we will get the valid links of the page in output console let see practically in below image.

    我们需要if和else条件,我们也使用python进行切片 ,如果将其替换为我们的网址(您可以在图片上方看到网址),即https://www.thecrazyprogrammer.com/ ,我们也会使用“ ../”进行切片在输出控制台中获取页面的有效链接,实际上请参见下图。

Python Web Scraping Tutorial 9

In the above image we take the if condition where the link or you can say that the string start with the “../” start with 3 position of the string using slice method and the extra stuff like “#” which is unuseful for us that’s why we don’t  include it in our output and we used the len() function also for printing the string to the last and with the prefix of our webpage url are also adding for producing the link.

在上面的图片中,我们采用if条件,其中链接或您可以说使用切片方法以字符串的“ ../”开头并以字符串的3个位置开头,而多余的内容(如“#”)对我们来说是无用的这就是为什么我们不将其包含在输出中的原因,我们还使用len()函数还将字符串打印到最后,并且还添加了带有网页网址前缀的字符串以生成链接。

In your case you can use your own condition according to your output.


Now you can see we get more than one link using if condition. We get so many links but there is also one problem that is we are not getting the links which are starting with “/” for getting these links also we have to do more changes in our code lets see what should we do.

现在,您可以看到我们使用if条件获得了多个链接。 我们有很多链接,但是还有一个问题,就是我们没有得到以“ /”开头的链接来获取这些链接,我们还必须对代码做更多的更改,看看应该怎么做。

So, we have to add the condition elif also with the condition of “/” and here also we should give “#” condition also otherwise we will get extra stuff again in below image we have done this.

因此,我们还必须在条件elif上加上条件“ /”,在这里还应该给“#”条件,否则我们将在下面的图像中再次得到多余的东西。

Python Web Scraping Tutorial 10

After putting this if and elif condition in our program to finding all the links in our particular webpage We have got the links without any error you can see in below image how we increased our links numbers since the program without the if and elif condition.


Python Web Scraping Tutorial 11

In this way we can get all the links the text of our particular page or website you can find the links in same manner of youtube channel also.


Note: If you have any problem to getting the links change the conditions in program as I have done with my problem you can use as your requirement.


So we have done how we can get the links of any webpage or youtube channel page.


3.通过网页搜刮登录Facebook (3. Log In Facebook Through Web Scraping)



In this method we can login any account of facebook using Scraping.


Conditions: How we can use this scarping into facebook because the security of Facebook we are unable to do it directly.


So, we can’t login facebook directly we should do change in url of facebook like we should use m.facebook.com or mbasic.facebook.com url instead of www.facebook.com because facebook has high security level we can’t scrap data directly.

因此,我们无法直接登录facebook,我们应该更改facebook的URL,就像我们应该使用m.facebook.commbasic.facebook.com的 url而不是www.facebook.com一样,因为facebook的安全级别很高,我们不能直接剪贴数据。

Let’s start scrapping.


This Is Webpage Of m.facebook.com URL

这是m.facebook.com URL的网页

Let’s start with python. So first import all these modules:

让我们从python开始。 因此,首先导入所有这些模块:

import http.cookiejar


import urllib.request


import requests


import bs4


Then create one object and use cookiejar method which provides you the cookie into your python browser.


Create another object known as opener and assign the request method to it.


Note: do all the things on your risk don’t hack someone id or else.


Python Web Scraping Tutorial 12

After this code, you have to find the link of particular login id through inspecting the page of m.facebook.com and then put the link into under commas and remove all the text after the login word and add “.php” with login word now type further code.

在此代码之后,您必须通过检查m.facebook.com的页面来找到特定登录ID的链接,然后将该链接放入逗号下并删除登录词后的所有文本,并在登录词后添加“ .php”现在输入更多代码。

Python Web Scraping Tutorial 13
payload= {
	'email':"[email protected]",
	'pass':"(enter the password of id)"

After this use get function give one cookie to it.


Python Web Scraping Tutorial 14

With this code we will login into facebook and the important thing I have written above also do it all things on your risk and don’t hack someone.


We can’t learn full concept of web scraping through this article only but still I hope you learned the basics of python web scrapping.

我们仅通过本文不能了解Web刮取的完整概念,但我仍然希望您学习了python Web刮取的基础知识。

翻译自: https://www.thecrazyprogrammer.com/2019/03/python-web-scraping-tutorial.html

