1.安装依赖的包
```
"# 读取docx\n",
"!pip install python-docx\n",
"!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple python-docx\n",
"# 中英文分词\n",
"!pip install jieba\n",
"!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple jieba\n",
"# 输出到excel\n",
"!pip install pandas"
"!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas"
```
2.读取docx文件到一个大字符串
```python
import docx
from docx import Document
document = docx.Document("Python.docx")
content = " ".join([para.text for para in document.paragraphs])
```
3. 中文分词
```
import jieba
seg_list = jieba.cut(content,cut_all=False)
print(type(seg_list))
# 过滤标点符号,无意义的单个字