python pptx 从中间加几页_python-pptx从幻灯片标题中提取文本(python-pptx Extract text from slide titles)...

I am building a document retrieval engine in python which returns documents ranked by their relevance with respect to a user submitted query. I have a collection of documents which also include PowerPoint files. For the PPTs, on the results page I want to show the first few slide titles to the user to give him/her a clearer picture(kinda like we see in Google searches).

So basically, I want to extract the text from the slide titles from the PPT files using python. I am using the python-pptx package for that. Currently my implementation looks something like this

from pptx import Presentation

prs = Presentation(filepath) # load the ppt

slide_titles = [] # container foe slide titles

for slide in prs.slides: # iterate over each slide

title_shape = slide.shapes[0] # consider the zeroth indexed shape as the title

if title_shape.has_text_frame: # is this shape has textframe attribute true then

# check if the slide title already exists in the slide_title container

if title_shape.text.strip(""" !@#$%^&*)(_-+=}{][:;.?"'/

slide_titles.append(title_shape.text.strip(""" !@#$%^&*)(_-+=}{][:;.?"'/

But as you can see I am assuming the zero indexed shape on each slide to be the slide title which is obviously not the case everytime. Any ideas on how to accomplish this?

Thanks in advance.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值