参考资料:Knowledge Graph: Data Science Technique to Mine Information from Text (with Python code)
链接需要挂梯子。
一篇写的比较容易理解的文章,根据作者的思路和展示能实现成功。
数据集地址:
代码实现时要注意IDE版本和一些库的版本适配。尤其是spacy库,新版和旧版的参数会有所不同,附上我使用的版本:Python 3.8.19
如果想要构建新环境可以使用以下的内容写入requirements.txt并在终端运行 pip install -r requirements.txt。
Python Version: 3.8.19
absl-py==2.0.0
accelerate==0.23.0
aiofiles==23.2.1
aiohttp==3.8.6
aiosignal==1.3.1
aliyun-python-sdk-core==2.14.0
aliyun-python-sdk-kms==2.16.2
altair==5.1.2
annotated-types==0.6.0
anyio==3.7.1
asgiref==3.7.2
astor==0.8.1
async-timeout==4.0.3
attrdict==2.0.1
attrs==23.1.0
Babel==2.13.1
backports.zoneinfo==0.2.1
bce-python-sdk==0.8.95
beautifulsoup4==4.12.2
blinker==1.6.3
blis==0.7.11
boto3==1.28.82
botocore==1.31.82
bottle==0.12.25
cachetools==5.3.1
catalogue==2.0.10
certifi==2023.7.22
cffi==1.16.0
charset-normalizer==3.3.0
click==8.1.7
cloudpathlib==0.16.0
colorama==0.4.6
common==0.1.2
confection==0.1.3
ConfigArgParse==1.7
contourpy==1.1.1
cpm-kernels==1.0.11
crcmod==1.7
cryptography==41.0.7
cssselect==1.2.0
cssutils==2.9.0
ctranslate2==3.20.0
cycler==0.12.1
cymem==2.0.8
Cython==3.0.5
data==0.4
datasets==2.19.0
decorator==4.4.2
dill==0.3.7
docopt==0.6.2
dual==0.0.10
dynamo3==0.4.10
easydict==1.11
en-core-web-sm==3.7.1
et-xmlfile==1.1.0
evaluate==0.4.1
exceptiongroup==1.1.3
faiss-cpu==1.7.1.post2
fastapi==0.103.2
fasttext-wheel==0.9.2
ffmpy==0.3.1
filelock==3.12.4
fire==0.5.0
Flask==3.0.0
flask-babel==4.0.0
flatbuffers==23.5.26
flywheel==0.5.4
fonttools==4.43.1
frozenlist==1.4.0
fsspec==2023.6.0
funcsigs==1.0.2
future==0.18.3
gast==0.3.3
gitdb==4.0.10
GitPython==3.1.37
google-auth==2.23.4
google-auth-oauthlib==1.0.0
gradio==3.47.1
gradio_client==0.6.0
grpcio==1.59.2
h11==0.14.0
httpcore==0.18.0
httpx==0.25.0
huggingface-cli==0.1
huggingface-hub==0.22.2
icetk==0.0.4
idna==3.4
imageio==2.32.0
imbalanced-learn==0.12.2
imgaug==0.4.0
importlib-metadata==6.8.0
importlib-resources==6.1.0
iopath==0.1.10
itsdangerous==2.1.2
jieba==0.42.1
Jinja2==3.1.2
jmespath==0.10.0
joblib==1.3.2
jsonify==0.5
jsonschema==4.19.1
jsonschema-specifications==2023.7.1
kiwisolver==1.4.5
langcodes==3.3.0
latex2mathml==3.75.2
layoutparser==0.3.4
Levenshtein==0.23.0
libaio==0.9.1
llvmlite==0.41.1
lmdb==1.4.1
lxml==4.9.3
Markdown==3.5
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matplotlib==3.7.3
mdtex2html==1.2.0
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.15
murmurhash==1.0.10
networkx==3.1
nltk==3.8.1
numba==0.58.1
numpy==1.21.0
oauthlib==3.2.2
onnxruntime==1.10.0
opencv-contrib-python==4.2.0.32
opencv-python==4.6.0.66
OpenNMT-py==2.3.0
openpyxl==3.1.2
openxlab==0.0.29
opt-einsum==3.3.0
orjson==3.9.7
oss2==2.17.0
packaging==23.2
paddle==1.0.2
paddle-bfloat==0.1.7
paddleclas==2.5.1
paddleocr==2.7.0.3
paddlepaddle==2.4.1
pandas==2.0.3
pdf2docx==0.5.5
pdf2image==1.17.0
pdfminer.six==20231228
pdfplumber==0.11.0
peewee==3.17.0
peft==0.5.0
Pillow==10.0.0
pip==23.3.1
pipreqs==0.4.13
pkgutil_resolve_name==1.3.10
portalocker==2.8.2
premailer==3.10.0
preshed==3.0.9
prettytable==3.9.0
protobuf==3.20.0
prox==0.0.17
psutil==5.9.5
pyahocorasick==2.0.0
pyarrow==13.0.0
pyarrow-hotfix==0.6
pyasn1==0.5.0
pyasn1-modules==0.3.0
pybind11==2.11.1
pyclipper==1.3.0.post5
pycparser==2.21
pycryptodome==3.19.0
pydantic==2.4.2
pydantic_core==2.10.1
pydeck==0.8.1b0
pydub==0.25.1
Pygments==2.16.1
PyMuPDF==1.20.2
PyMuPDFb==1.23.6
pynndescent==0.5.12
pyonmttok==1.37.1
pyparsing==3.1.1
pypdfium2==4.29.0
PySocks==1.7.1
python-dateutil==2.8.2
python-docx==1.1.0
python-geoip-python3==1.3
python-Levenshtein==0.23.0
python-multipart==0.0.6
pytz==2023.3.post1
PyWavelets==1.4.1
pywin32==306
PyYAML==6.0.1
rapidfuzz==3.5.2
rarfile==4.1
referencing==0.30.2
regex==2023.10.3
requests==2.28.2
requests-oauthlib==1.3.1
responses==0.18.0
rich==13.4.2
rouge-chinese==1.0.3
rpds-py==0.10.4
rsa==4.9
s3transfer==0.7.0
sacrebleu==2.3.1
safetensors==0.4.3
scikit-image==0.17.2
scikit-learn==1.3.2
scipy==1.10.1
semantic-version==2.10.0
sentencepiece==0.1.95
setuptools==60.2.0
shapely==2.0.2
six==1.16.0
smart-open==6.4.0
smmap==5.0.1
sniffio==1.3.0
soupsieve==2.5
spacy==3.7.2
spacy-legacy==3.0.12
spacy-loggers==1.0.5
sqlparse==0.4.4
srsly==2.4.8
sse-starlette==1.6.5
starlette==0.27.0
streamlit==1.27.2
sympy==1.12
tabulate==0.9.0
tenacity==8.2.3
tensorboard==2.14.0
tensorboard-data-server==0.7.2
termcolor==2.3.0
thinc==8.2.1
threadpoolctl==3.2.0
tifffile==2023.7.10
tight==0.1.0
tokenizers==0.13.3
toml==0.10.2
toolz==0.12.0
torch==2.1.0+cu121
torchaudio==2.1.0
torchtext==0.5.0
torchvision==0.16.0
tornado==6.3.3
tqdm==4.65.2
transformers==4.26.1
typer==0.9.0
typing_extensions==4.8.0
tzdata==2023.3
tzlocal==5.1
ujson==5.8.0
umap==0.1.1
umap-learn==0.5.6
urllib3==1.26.18
uvicorn==0.23.2
validators==0.22.0
visualdl==2.5.3
waitress==2.1.2
wasabi==1.1.2
watchdog==3.0.0
wcwidth==0.2.9
weasel==0.3.3
websockets==11.0.3
Werkzeug==3.0.1
wheel==0.41.2
xxhash==3.4.1
yarg==0.1.9
yarl==1.9.2
zipp==3.17.0
还需要一个预训练的英文语言模型en_core_web_sm:
可以在终端直接pip install en_core_web_sm,模型版本要和spacy库对应。
或者下载模型到本地:
third-party-oneoffs/en-core-web-sm: spacy-models en-core-web-sm (github.com)
或使用顶部的资源。
然后终端运行命令:
pip install en_core_web_sm-2.3.0.tar.gz
可以运行测试代码查看:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("The 22-year-old recently won ATP Challenger tournament.")
for tok in doc:
print(tok.text, "...", tok.dep_)
然后可以从顶部文章链接扒代码运行,记得要仔细看文章内容,黑框白框都有相关代码。
有何使用体验和心得欢迎私信交流~