python怎么过滤停用词_Python的HTML与美丽的汤和过滤停用词解析

最新推荐文章于 2024-03-26 13:59:48 发布

weixin_39863155

最新推荐文章于 2024-03-26 13:59:48 发布

阅读量85

点赞数

文章标签： python怎么过滤停用词

本文链接：https://blog.csdn.net/weixin_39863155/article/details/112839665

版权

I am parsing out specific information from a website into a file. Right now the program I have looks at a webpage, and find the right HTML tag and parses out the right contents. Now I want to further filter these "results".

I am parsing out the ingredients which are located in < div class="ingredients"...> tag. This parser does the job nicely but I want to further process these results.

When I run this parser, it removes numbers, symbols, commas, and slash(\ or /) but leaves all text. When I run it on the website I get results like:

cup olive oil

cup chicken broth

cloves garlic minced

tablespoon paprika

Now I want to further process this by removing stop words like "cup", "cloves", "minced", "tablesoon" among others. How exactly do I do this? This code is written in python and I am not very good at it, and I am just using this parser to get information which I can manually enter but I would rather not.

Any help on how to do this in detail would be appreciated! My code is below: how would I do this?

Code:

import urllib2

import BeautifulSoup

def main():

url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"

data = urllib2.urlopen(url).read()

bs = BeautifulSoup.BeautifulSoup(data)

ingreds = bs.find('div', {'class': 'ingredients'})

ingreds = [s.getText().strip('123456789.,/\ ') for s in ingreds.findAll('li')]

fname = 'PorkRecipe.txt'

with open(fname, 'w') as outf:

outf.write('\n'.join(ingreds))

if __name__=="__main__":

main()

解决方案import urllib2

import BeautifulSoup

import string

badwords = set([

'cup','cups',

'clove','cloves',

'tsp','teaspoon','teaspoons',

'tbsp','tablespoon','tablespoons',

'minced'

])

def cleanIngred(s):

# remove leading and trailing whitespace

s = s.strip()

# remove numbers and punctuation in the string

s = s.strip(string.digits + string.punctuation)

# remove unwanted words

return ' '.join(word for word in s.split() if not word in badwords)

def main():

url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"

data = urllib2.urlopen(url).read()

bs = BeautifulSoup.BeautifulSoup(data)

ingreds = bs.find('div', {'class': 'ingredients'})

ingreds = [cleanIngred(s.getText()) for s in ingreds.findAll('li')]

fname = 'PorkRecipe.txt'

with open(fname, 'w') as outf:

outf.write('\n'.join(ingreds))

if __name__=="__main__":

main()

results in

olive oil

chicken broth

garlic,

paprika

garlic powder

poultry seasoning

dried oregano

dried basil

thick cut boneless pork chops

salt and pepper to taste

? I don't know why it's left the comma in it - s.strip(string.punctuation) should have taken care of that.

weixin_39863155

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫