python pptx 从中间加几页_python-pptx从幻灯片标题中提取文本(python-pptx Extract text from slide titles)...

最新推荐文章于 2021-12-08 12:00:00 发布

weixin_39819576

最新推荐文章于 2021-12-08 12:00:00 发布

阅读量143

点赞数

文章标签： python pptx 从中间加几页

I am building a document retrieval engine in python which returns documents ranked by their relevance with respect to a user submitted query. I have a collection of documents which also include PowerPoint files. For the PPTs, on the results page I want to show the first few slide titles to the user to give him/her a clearer picture(kinda like we see in Google searches).

So basically, I want to extract the text from the slide titles from the PPT files using python. I am using the python-pptx package for that. Currently my implementation looks something like this

from pptx import Presentation

prs = Presentation(filepath) # load the ppt

slide_titles = [] # container foe slide titles

for slide in prs.slides: # iterate over each slide

title_shape = slide.shapes[0] # consider the zeroth indexed shape as the title

if title_shape.has_text_frame: # is this shape has textframe attribute true then

# check if the slide title already exists in the slide_title container

if title_shape.text.strip(""" !@#$%^&*)(_-+=}{][:;.?"'/

slide_titles.append(title_shape.text.strip(""" !@#$%^&*)(_-+=}{][:;.?"'/

But as you can see I am assuming the zero indexed shape on each slide to be the slide title which is obviously not the case everytime. Any ideas on how to accomplish this?

Thanks in advance.