извлечь абзац между заголовком с определенным набором слов

У меня есть текстовый файл, содержащий данные следующим образом:

History

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application

Application 

In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms

Теперь я хотел бы извлечь абзац или первичный раздел, которые содержат определенный набор слов, таких как {" Software", opensource" }

Я пробовал regexp и if loop, но не смог извлечь необходимый вывод, может ли кто-нибудь мне помочь.

python information-extraction grep

surya vamsi 18.09.2017 источник

Ответы (2)

arrow_upward
1
arrow_downward

Используйте регулярное выражение:

import re
my_string = """History

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application

Application 

In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms
"""
pattern = '\n.+(?:software|open\s?source).+\n'
paragraph_list = re.findall(pattern, my_string)
print(paragraph_list)

В итоге вы получите все абзацы с ключевыми словами, которые вы упомянули в списке paragraph_list

ИЗМЕНИТЬ

Если вы хотите, чтобы ключевые слова были динамическими или предоставлялись списком/кортежем:

import re
keywords = ('software', 'open source')

my_string = """History

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application

Application 

In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms
"""
pattern = '\n.+(?:' + '|'.join(keywords) + ').+\n'
paragraph_list = re.findall(pattern, my_string)
print(paragraph_list)

francisco sollima 18.09.2017

arrow_upward
0
arrow_downward

вы можете легко определить, является ли подстрока частью большей:

>>> str='In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms'
>>> "software" in str
True

вы можете извлечь строки ваших файлов, которые содержат определенное слово:

>>> f = open('yourfile.txt','r')
>>> result=[i for i in data if 'software' in i]

Dadep 18.09.2017

извлечь абзац между заголовком с определенным набором слов

Ответы (2)

Похожие вопросы