求助-爬取天眼查上包含特定关键字的 所有的公司的名称对应的联系方式

把chat GPT 生成的自己的cookie

import requests
import pandas as pd
from lxml import etree

headers = {
    "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7","Accept-Encoding":"gzip, deflate, br","Accept-Language":"zh-CN,zh;q=0.9","Cache-Control":"max-age=0","Connection":"keep-alive","Cookie":"TYCID=cd3d5790932011ee882895fe7d7a2c87; ssuid=6862026551; bannerFlag=true; _ga=GA1.2.2117757969.1701747960; _gid=GA1.2.1331589730.1701747960; HWWAFSESID=85ca94d5f851ae5a2e9; HWWAFSESTIME=1701755123973; csrfToken=EhED4KLQR4rKmWIy4IBH91iA; Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1701755130; jsid=http%3A%2F%2Fwww.tianyancha.com%2F%3Fjsid%3DSEM-BAIDU-PZ-SY-2021112-BEIJING; tyc-user-info-save-time=1701844372763; searchSessionId=1701845421.29031029; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22297365569%22%2C%22first_id%22%3A%2218c3815267d9da-01b50fa97903052-26031051-2073600-18c3815267e91f%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22identities%22%3A%22eyIkaWRlbnRpdHlfY29va2llX2lkIjoiMThjMzgxNTI2N2Q5ZGEtMDFiNTBmYTk3OTAzMDUyLTI2MDMxMDUxLTIwNzM2MDAtMThjMzgxNTI2N2U5MWYiLCIkaWRlbnRpdHlfbG9naW5faWQiOiIyOTczNjU1NjkifQ%3D%3D%22%2C%22history_login_id%22%3A%7B%22name%22%3A%22%24identity_login_id%22%2C%22value%22%3A%22297365569%22%7D%2C%22%24device_id%22%3A%2218c3815267d9da-01b50fa97903052-26031051-2073600-18c3815267e91f%22%7D; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1701845422","Host":"www.tianyancha.com","Referer":"https://www.tianyancha.com/?jsid=SEM-BAIDU-PZ-SY-2021112-BEIJING","Sec-Fetch-Dest":"document","Sec-Fetch-Mode":"navigate","Sec-Fetch-Site":"same-origin","Sec-Fetch-User":"?1","Upgrade-Insecure-Requests":"1","User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36","sec-ch-ua":"\"Google Chrome\";v=\"119\", \"Chromium\";v=\"119\", \"Not?A_Brand\";v=\"24\"","sec-ch-ua-mobile":"?0","sec-ch-ua-platform":"\"Windows\""
}
url = 'https://www.tianyancha.com/search?key=%E7%BB%A7%E7%BB%AD%E6%95%99%E8%82%B2'
# req=requests.get(url=url,headers=headers).status_code
# print(req)


def get_company_info(key):
    url = "https://www.tianyancha.com/search?key=" + key
    res = requests.get(url, headers=headers).text
    res = etree.HTML(res)
    result = res.xpath("//a[@class='index_alink__zcia5 link-click']//span")

    if result:
        company_url = "https://www.tianyancha.com/company/" + result[0].split("-c")[-1]
        return company_url
    else:
        return None


def get_company_contact(url):
    res = requests.get(url, headers=headers).text
    res = etree.HTML(res)
    #选取所有class属性为f0的div元素
    contact_info = res.xpath("//div[@class='f0']//span")[3].text
    return contact_info

def main():
    keyword = "继续教育"  # 设置你想要的关键字
    output_data = []

    # 遍历搜索结果的页面
    for page in range(0, 250):  # 根据实际的搜索结果页面数量调整范围
        search_url = f"https://www.tianyancha.com/search/p{page}?key={keyword}"
        search_res = requests.get(search_url, headers=headers).text
        search_html = etree.HTML(search_res)
        company_links = search_html.xpath("//div[@class='scroll-list']//a/@href")
        # 遍历公司链接并获取联系信息
        for link in company_links:
            company_url = "https://www.tianyancha.com/company/" + link.split("-c")[-1]
            contact_info = get_company_contact(company_url)
            output_data.append({"公司链接": company_url, "联系信息": contact_info})

    # 将数据转换为DataFrame并保存到Excel文件中
    df = pd.DataFrame(output_data)
    df.to_excel("company_contacts.xlsx", index=False)
    print("数据提取并保存到Excel完成")

if __name__ == '__main__':
    main()
Scrapy是一个基于Python的开源网络爬虫框架,它提供了一简单而强大的API,可以帮助开发者快速高地爬取网页数据。使用Scrapy可以轻松地编写爬虫程序,实现对网页的自动化访问、数据提取和持久化等功能。 要使用Scrapy爬取天眼的数据,首先需要安装Scrapy库。可以使用pip命令进行安装,如下所示: ``` pip install scrapy ``` 接下来,创建一个Scrapy项目,可以使用命令行工具scrapy startproject来创建一个新的项目,如下所示: ``` scrapy startproject project_name ``` 其中,project_name是你自定义的项目名称。 创建完项目后,进入项目目录,可以看到一些自动生成的文件和文件夹。其中,spiders文件夹是用于编写爬虫程序的地方。 在spiders文件夹中创建一个新的Python文件,例如tianyancha_spider.py。在该文件中,可以定义一个Spider类,并继承自scrapy.Spider类。在Spider类中,可以定义要爬取的网站URL、数据提取规则等。 以下是一个简单的示例代码,用于爬取天眼公司信息: ```python import scrapy class TianyanchaSpider(scrapy.Spider): name = 'tianyancha' start_urls = ['https://www.tianyancha.com/'] def parse(self, response): # 在这里编写数据提取的代码 pass ``` 在parse方法中,可以使用XPath或CSS选择器等方式提取网页中的数据,并进行处理和存储。 运行爬虫程序,可以使用命令行工具scrapy crawl来启动爬虫,如下所示: ``` scrapy crawl tianyancha ``` 以上是一个简单的Scrapy爬取天眼的示例,你可以根据自己的需求进行进一步的开发和优化。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

斜躺青年

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值