使用python构建数据库_如何使用python构建工作刮板

使用python构建数据库

Building a job scraper was quite technical and involved a little amount of brain work, with a little knowledge of HTML, CSS, and python, and it took me about 3- 5 days to completely get the codes running well. I never knew I could build something like this until I took the first step and with the help of my code coach. The job scraper can get the following from the website and store it an excel sheet ;

构建工作抓取工具是一项技术性很强的工作,需要少量的脑力劳动,对HTML,CSS和python的了解也很少,我花了大约3-5天的时间才能使代码完全正常运行。 在我迈出第一步并在代码教练的帮助下,我从来不知道自己可以构建这样的东西。 刮板可以从网站上获取以下内容并存储为excel表;

1)job title2)link that redirects you to apply for the job3)function4)timestamp5)salary6)location etc.

1)职位2)将您重定向到申请职位的链接3)功能4)时间戳5)工资6)地点等。

CHOOSE AN EDITOR The first thing you do is to choose the editor you are comfortable with, I used vs-code and sublime text, then you create a folder that will contain the project or files on the editor and save it (example.py).

选择编辑器首先要做的是选择您喜欢的编辑器,我使用vs-code和sublime文本,然后创建一个文件夹,其中将包含编辑器上的项目或文件并保存(example.py) 。

REVIEW THE WEBSITE Review the website you are about to scrap and write down the content you will like scrap like I listed above. To view the HTML content behind the website, right-click on any space on the website and you will get a drop-down menu and then click on “inspect". When you scroll through the HTML file you will notice that any line you click on, highlights something on the main website.

复查该网站复查您将要剪贴的网站,并像上面列出的那样写下您想要剪贴的内容。 要查看网站后面HTML内容,请在网站上的任意位置上单击鼠标右键,将出现一个下拉菜单,然后单击“检查”。滚动HTML文件时,您会注意到单击的任何行上,突出显示主网站上的内容。

Image for post

DOWNLOAD LIBRARIESthe first thing you will need in your code to run is python library such as beautiful soup, pandas, requests. so quickly go on you tube and check out how you can install them with your command prompt, so the editor will beable to recognize and scrap the website efficiently. so install beautiful soup, requests, and pandas

下载库运行代码所需的第一件事是python库,例如漂亮的汤,熊猫,请求。 如此快速地进行操作,并在命令提示符下查看如何安装它们,以便编辑器能够有效地识别和剪贴该网站。 所以安装漂亮的汤,要求和熊猫

APPLYING THE LIBRARY TO THE CODE I’m going to explain what is required and what every code written performsso the above project is broken down into 6 sections for better understanding ;

将库应用到代码中我将解释所需的内容和每个编写的代码的性能,因此,为了更好的理解,上述项目分为6个部分;

1) the first section brings out all the python library needed for the code to work. The library contains built-in modules that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers

1)第一部分介绍了使代码正常工作所需的所有python库。 该库包含内置模块,这些模块提供对系统功能(例如文件I / O)的访问,否则Python程序员将无法访问这些功能

Image for post

2)the second section of the code get the link of the website you want to scrap and save it in variable and can prints out the HTML file in the editor. The editor can do that because library like request helps to get the HTML file from the browser and the beautiful soup helps to pull out data from the HTML file

2)代码的第二部分获得您要剪贴的网站的链接并将其保存在变量中,并可以在编辑器中打印出HTML文件。 编辑器之所以能够这样做,是因为类似请求的库有助于从浏览器中获取HTML文件,而精美的汤品有助于从HTML文件中提取数据。

Image for post

3)the next part of the code gets the container( in my project the main container is ‘search-result’) holding all the content I wish to scrap from the website such as the job title, location, salary, timestamp, etc

3)代码的下一部分将获取容器(在我的项目中,主要容器为“搜索结果”),其中包含我希望从网站上抓取的所有内容,例如职称,位置,薪水,时间戳等

Image for post

4) In the fourth part of the project, a sub-container from the whole container on the website is reviewed and it has the index 0 because it is the first item in the container. From the code below the first line of code finds and prints out the items with job title from the sub-container in the main container “search-result” and this is done for others in the sub-content:

4)在项目的第四部分中,对网站上整个容器中的一个子容器进行了审核,该容器的索引为0,因为它是容器中的第一项。 在第一行代码下面的代码中,从主容器“搜索结果”中的子容器中查找并打印出具有工作标题的项目,这对于子内容中的其他工作完成:

Image for post

5)after getting all the required items for the first sub-container, the next thing you do is get every single item in the sub-container from the main container .you can do this using a for loop, while loop, or even list comprehension. This list comprehension loop through each item in each container, get them and puts it in the form of a list. ( artist, musician, teacher)

5)在获得第一个子容器的所有必需项目之后,接下来要做的就是从主容器中获取子容器中的每个项目。您可以使用for循环,while循环甚至列表来执行此操作理解。 该列表理解循环遍历每个容器中的每个项目,获取它们并将其以列表的形式放置。 (艺术家,音乐家,老师)

titles = [item.find(class_=’search-result__job-title’).get_text().strip() for item in items]

标题= [item.find(class _ ='search-result__job-title')。get_text()。strip()用于项目中的项目]

Image for post

the image above shows the code finds all the job title in the main container( “search_result “.) and gets every item that is a job title in the container and prints it out and this is done for the other contents like salaries and locations

上图显示代码在主容器中找到所有职位(“ search_result”。),并获取容器中每个职位的项目并将其打印出来,这是针对其他内容(例如薪金和地点)完成的

Image for post

6)the next thing to be done is putting each item in form of a dictionary and store it in a variable which will be converted to a CSV file which can in turn be opened in an excel sheet(to display it in a readable format)

6)接下来要做的是将每个项目以字典的形式存储并存储在变量中,该变量将转换为CSV文件,然后可以在excel工作表中打开该文件(以可读格式显示)

MAJOR PROBLEM ENCOUNTEREDone of the major problems I encountered was the code printed out excess white space and it was not readable, so I used the .strip() function to get rid of the excess space and I used Regex to substitute multiple white spaces with one white space.

主要的问题遇到了我遇到的是代码打印出多余空格的主要问题之一,它是无法读取,所以我用了.strip()函数来摆脱多余的空间,我用正则表达式来替代多个空格用一个空格。

IMAGE

图片

POINTS TO NOTE

注意事项

1) you can use the prettify() function to print out your HTML file in a readable form as shown in fig 2 line 10 IMAGE2)Ensure you install and import the required python library(pandas, beautifulsoup, etc.)

1)您可以使用prettify()函数以可读形式打印HTML文件,如图2第10行IMAGE2所示)确保您安装并导入了所需的python库(pandas,beautifulsoup等)

翻译自: https://medium.com/swlh/how-to-build-a-job-scraper-using-python-144620bf5ca3

使用python构建数据库

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值