Applied Text Mining in Python Week 1(notes)

Working With Text

text1 = "Ethics are built right into the ideals and objectives of the United Nations "
len(text1) # The length of text1
76
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.
len(text2)
14
text2
['Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations', '']
List comprehension allows us to find specific words:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2
['Ethics', 'built', 'right', 'into', 'ideals', 'objectives', 'United', 'Nations']
[w for w in text2 if w.istitle()] # Capitalized words in text2
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'
We can find unique words using set().
text3 = 'To be or not to be'
text4 = text3.split(' ')
len(text4)
6
len(set(text4))	
5
set(text4)
{'To', 'be', 'not', 'or', 'to'}
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.
4
set([w.lower() for w in text4])
{'be', 'not', 'or', 'to'}

Processing free-text

text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')
text6
['"Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations"', '#UNSG', '@', 'NY', 'Society', 'for', 'Ethical', 'Culture', 'bit.ly/2guVelr']
Finding hastags:
[w for w in text6 if w.startswith('#')]
['#UNSG']
Finding callouts:
[w for w in text6 if w.startswith('@')]	
['@']
We can use regular expressions to help us with more complex parsing.

For example ‘@[A-Za-z0-9_]+’ will return all words that:

  • start with ‘@’ and are followed by at least one:
  • capital letter (‘A-Z’)
  • lowercase letter (‘a-z’)
  • number (‘0-9’)
  • or underscore (’_’)
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')
import re # import re - a module that provides support for regular expressions
[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]
['@UN', '@UN_Women']

Working with Text Data in pandas

import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

在这里插入图片描述

# find the number of characters for each string in df['text']
df['text'].str.len()
0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

# find the number of tokens for each string in df['text']
df['text'].str.split().str.len()
0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')
0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

# find how many times a digit occurs in each string
df['text'].str.count(r'\d')
0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

# find all occurances of the digits
df['text'].str.findall(r'\d')
0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')
0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

# replace weekdays with '???'
df['text'].str.replace(r'\w+day\b', '???')
0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

# replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])
0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')
    0   1
0   2  45
1  11  30
2   7  00
3  11  15
4  08  10

# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

在这里插入图片描述

# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

在这里插入图片描述

Text mining, also known as text analytics, is the process of extracting useful information from unstructured or semi-structured text data. This involves using various natural language processing (NLP) techniques to analyze and understand the content of the text. Text mining can be applied to a wide range of text data sources, including social media posts, customer reviews, news articles, and scientific papers. The primary goal of text mining is to uncover insights and patterns that can be used to inform decision-making and improve business outcomes. For example, a company may use text mining to analyze customer feedback and identify common themes and issues that need to be addressed. A healthcare organization may use text mining to analyze patient records and identify patterns in disease diagnosis and treatment. Text mining involves several steps, including data collection, preprocessing, analysis, and visualization. The data is usually first cleaned and preprocessed to remove noise and irrelevant information. NLP techniques are then used to tokenize the text, identify parts of speech, and extract entities and sentiment. The resulting data is analyzed using statistical and machine learning techniques to uncover patterns and relationships. Text mining has numerous applications in industries such as marketing, finance, healthcare, and government. It helps organizations to gain insights into customer behavior, market trends, and public opinion. It is also used to detect fraud, identify security threats, and monitor social media for crisis management.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值