python爬取论坛帖子_网络抓取每个论坛帖子(Python,Beautifulsoup)

Hello once again fellow stack'ers. Short description.. I am web scraping some data from an automotive forum using Python and saving all data into CSV files. With some help from other stackoverflow members managed to get as far as mining through all pages for certain topic, gathering the dates, title and link for each post.

I also have a seperate script I am now sturggling with implementing (For every link found, python creates a new soup for it, scrapes through all the posts and then goes back to previous link).

Would really appreciate any other tips or advice on how to make this better as it's my first time working with python, I think it might be my nested loop logic that's messed up, but checking through multiple times seems right to me.

Heres the code snippet :

link += (div.get('href'))

savedData += "\n" + title + ", " + link

tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)

while tempNumber < 3:

for tempRow in tempSoup.find_all(id=re.compile("^td_post_")):

for tempNext in tempSoup.find_all(title=re.compile("^Next Page -")):

tempNextPage = ""

tempNextPage += (tempNext.get('href'))

post = ""

post += tempRow.get_text(strip=True)

postData += post + "\n"

tempNumber += 1

tempNewUrl = "http://www.automotiveforums.com/vbulletin/" + tempNextPage

tempSoup = make_soup(tempNewUrl)

print(tempNewUrl)

tempNumber = 1

number += 1

print(number)

newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage

soup = make_soup(newUrl)

My main issue with it so far is that tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)

Does not seem to create a new soup after it has done scraping all the posts for forum thread.

This is the output I'm getting :

http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=2

http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=3

1

So it does seem to find the correct links for new pages and scrape them, however for next itteration it prints the new dates AND the same exact pages. There's also a reaaly weird 10-12 seconds delays after the last link is printed and only then it hops down to print number 1 and then bash out all the new dates..

But after going for the next forum threads link, it scrapes same exact data every time.

Sorry if it looks really messy, it is sort of a side project, and my first attempt at doing some useful things, so I am very new at this, any advice or tips would be much appreciated. I'm not asking you to solve the code for me, even some pointers for my possibly wrong logic would be greatly appreciated!

Kind regards, and thanks for reading such annoyingly long post!

EDIT: I've cut out majority of the post / code snippet as I believe people were getting overwhelmed. Just left the essential bit I am trying to work with. Any help would be much appreciated!

解决方案

So after spending a little bit more time, I have managed to ALMOST crack it. It's now at the point where python finds every thread and it's link on the forum, then goes onto each link, reads all pages and continues on with next link.

This is the fixed code for it if anyone will make any use of it.

link += (div.get('href'))

savedData += "\n" + title + ", " + link

soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + link)

while tempNumber < 4:

for postScrape in soup3.find_all(id=re.compile("^td_post_")):

post = ""

post += postScrape.get_text(strip=True)

postData += post + "\n"

print(post)

for tempNext in soup3.find_all(title=re.compile("^Next Page -")):

tempNextPage = ""

tempNextPage += (tempNext.get('href'))

print(tempNextPage)

soup3 = ""

soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + tempNextPage)

tempNumber += 1

tempNumber = 1

number += 1

print(number)

newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage

soup = make_soup(newUrl)

All I've had to do was to seperate the 2 for loops that were nested within each other, into own loops. Still not a perfect solution, but hey, it ALMOST works.

The non working bit: First 2 threads of provided link have multiple pages of posts. The following 10+ more threads Do not. I cannot figure out a way to check the for tempNext in soup3.find_all(title=re.compile("^Next Page -")):

value outside of loop to see if it's empty or not. Because if it does not find a next page element / href, it just uses the last one. But if I reset the value after each run, it no longer mines each page =l A solution that just created another one problem :D.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值