извлечь абзац между заголовком с определенным набором слов

У меня есть текстовый файл, содержащий данные следующим образом:

History

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application

Application 

In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms

Теперь я хотел бы извлечь абзац или первичный раздел, которые содержат определенный набор слов, таких как {" Software", opensource" }

Я пробовал regexp и if loop, но не смог извлечь необходимый вывод, может ли кто-нибудь мне помочь.


person surya vamsi    schedule 18.09.2017    source источник


Ответы (2)


Используйте регулярное выражение:

import re
my_string = """History

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application

Application 

In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms
"""
pattern = '\n.+(?:software|open\s?source).+\n'
paragraph_list = re.findall(pattern, my_string)
print(paragraph_list)

В итоге вы получите все абзацы с ключевыми словами, которые вы упомянули в списке paragraph_list

ИЗМЕНИТЬ

Если вы хотите, чтобы ключевые слова были динамическими или предоставлялись списком/кортежем:

import re
keywords = ('software', 'open source')

my_string = """History

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application

Application 

In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms
"""
pattern = '\n.+(?:' + '|'.join(keywords) + ').+\n'
paragraph_list = re.findall(pattern, my_string)
print(paragraph_list)
person francisco sollima    schedule 18.09.2017

вы можете легко определить, является ли подстрока частью большей:

>>> str='In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms'
>>> "software" in str
True

вы можете извлечь строки ваших файлов, которые содержат определенное слово:

>>> f = open('yourfile.txt','r')
>>> result=[i for i in data if 'software' in i]
person Dadep    schedule 18.09.2017