python怎么过滤停用词_Python的HTML与美丽的汤和过滤停用词解析

I am parsing out specific information from a website into a file. Right now the program I have looks at a webpage, and find the right HTML tag and parses out the right contents. Now I want to further filter these "results".

I am parsing out the ingredients which are located in < div class="ingredients"...> tag. This parser does the job nicely but I want to further process these results.

When I run this parser, it removes numbers, symbols, commas, and slash(\ or /) but leaves all text. When I run it on the website I get results like:

cup olive oil

cup chicken broth

cloves garlic minced

tablespoon paprika

Now I want to further process this by removing stop words like "cup", "cloves", "minced", "tablesoon" among others. How exactly do I do this? This code is written in python and I am not very good at it, and I am just using this parser to get information which I can manually enter but I would rather not.

Any help on how to do this in detail would be appreciated! My code is below: how would I do this?

Code:

import urllib2

import BeautifulSoup

def main():

url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"

data = urllib2.urlopen(url).read()

bs = BeautifulSoup.BeautifulSoup(data)

ingreds = bs.find('div', {'class': 'ingredients'})

ingreds = [s.getText().strip('123456789.,/\ ') for s in ingreds.findAll('li')]

fname = 'PorkRecipe.txt'

with open(fname, 'w') as outf:

outf.write('\n'.join(ingreds))

if __name__=="__main__":

main()

解决方案import urllib2

import BeautifulSoup

import string

badwords = set([

'cup','cups',

'clove','cloves',

'tsp','teaspoon','teaspoons',

'tbsp','tablespoon','tablespoons',

'minced'

])

def cleanIngred(s):

# remove leading and trailing whitespace

s = s.strip()

# remove numbers and punctuation in the string

s = s.strip(string.digits + string.punctuation)

# remove unwanted words

return ' '.join(word for word in s.split() if not word in badwords)

def main():

url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"

data = urllib2.urlopen(url).read()

bs = BeautifulSoup.BeautifulSoup(data)

ingreds = bs.find('div', {'class': 'ingredients'})

ingreds = [cleanIngred(s.getText()) for s in ingreds.findAll('li')]

fname = 'PorkRecipe.txt'

with open(fname, 'w') as outf:

outf.write('\n'.join(ingreds))

if __name__=="__main__":

main()

results in

olive oil

chicken broth

garlic,

paprika

garlic powder

poultry seasoning

dried oregano

dried basil

thick cut boneless pork chops

salt and pepper to taste

? I don't know why it's left the comma in it - s.strip(string.punctuation) should have taken care of that.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值