Applied Text Mining in Python Week 1(notes)

本文深入探讨了如何使用Python进行高效文本处理,包括文本分割、查找特定单词、使用正则表达式解析复杂文本、以及利用Pandas进行数据清洗和分析等关键技能。通过实例展示了如何提取、筛选和操作文本数据,适用于自然语言处理、社交媒体分析和数据科学项目。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Working With Text

text1 = "Ethics are built right into the ideals and objectives of the United Nations "
len(text1) # The length of text1
76
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.
len(text2)
14
text2
['Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations', '']
List comprehension allows us to find specific words:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2
['Ethics', 'built', 'right', 'into', 'ideals', 'objectives', 'United', 'Nations']
[w for w in text2 if w.istitle()] # Capitalized words in text2
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'
We can find unique words using set().
text3 = 'To be or not to be'
text4 = text3.split(' ')
len(text4)
6
len(set(text4))	
5
set(text4)
{'To', 'be', 'not', 'or', 'to'}
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.
4
set([w.lower() for w in text4])
{'be', 'not', 'or', 'to'}

Processing free-text

text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')
text6
['"Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations"', '#UNSG', '@', 'NY', 'Society', 'for', 'Ethical', 'Culture', 'bit.ly/2guVelr']
Finding hastags:
[w for w in text6 if w.startswith('#')]
['#UNSG']
Finding callouts:
[w for w in text6 if w.startswith('@')]	
['@']
We can use regular expressions to help us with more complex parsing.

For example ‘@[A-Za-z0-9_]+’ will return all words that:

  • start with ‘@’ and are followed by at least one:
  • capital letter (‘A-Z’)
  • lowercase letter (‘a-z’)
  • number (‘0-9’)
  • or underscore (’_’)
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')
import re # import re - a module that provides support for regular expressions
[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]
['@UN', '@UN_Women']

Working with Text Data in pandas

import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

在这里插入图片描述

# find the number of characters for each string in df['text']
df['text'].str.len()
0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

# find the number of tokens for each string in df['text']
df['text'].str.split().str.len()
0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')
0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

# find how many times a digit occurs in each string
df['text'].str.count(r'\d')
0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

# find all occurances of the digits
df['text'].str.findall(r'\d')
0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')
0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

# replace weekdays with '???'
df['text'].str.replace(r'\w+day\b', '???')
0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

# replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])
0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')
    0   1
0   2  45
1  11  30
2   7  00
3  11  15
4  08  10

# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

在这里插入图片描述

# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值