Частота слов по словарю

Моя проблема в том, что я не могу понять, как отображать количество слов с помощью словаря и ссылаться на длину ключей. Например, рассмотрим следующий фрагмент текста:

   "This is the sample text to get an idea!. "

Тогда требуемый вывод будет

3 2
2 3
0 5

поскольку в данном примере текста есть 3 слова длиной 2, 2 слова длиной 3 и 0 слов длиной 5.

Я дошел до отображения списка частоты встречаемости слова:

def word_frequency(filename):
    word_count_list = []
    word_freq = {}
    text = open(filename, "r").read().lower().split()
    word_freq = [text.count(p) for p in text]
    dictionary = dict(zip(text,word_freq))
    return dictionary

print word_frequency("text.txt")

который отображает диктант в этом формате:

{'all': 3, 'show': 1, 'welcomed': 1, 'not': 2, 'availability': 1, 'television,': 1, '28': 1, 'to': 11, 'has': 2, 'ehealth,': 1, 'do': 1, 'get': 1, 'they': 1, 'milestone': 1, 'kroes,': 1, 'now': 3, 'bringing': 2, 'eu.': 1, 'like': 1, 'states.': 1, 'them.': 1, 'european': 2, 'essential': 1, 'available': 4, 'because': 2, 'people': 3, 'generation': 1, 'economic': 1, '99.4%': 1, 'are': 3, 'eu': 1, 'achievement,': 1, 'said': 3, 'for': 3, 'broadband': 7, 'networks': 2, 'access': 2, 'internet': 1, 'across': 2, 'europe': 1, 'subscriptions': 1, 'million': 1, 'target.': 1, '2020,': 1, 'news': 1, 'neelie': 1, 'by': 1, 'improve': 1, 'fixed': 2, 'of': 8, '100%': 1, '30': 1, 'affordable': 1, 'union,': 2, 'countries.': 1, 'products': 1, 'or': 3, 'speeds': 1, 'cars."': 1, 'via': 1, 'reached': 1, 'cloud': 1, 'from': 1, 'needed': 1, '50%': 1, 'been': 1, 'next': 2, 'households': 3, 'commission': 5, 'live': 1, 'basic': 1, 'was': 1, 'said:': 1, 'more': 1, 'higher.': 1, '30mbps': 2, 'that': 4, 'but': 2, 'aware': 1, '50mbps': 1, 'line': 1, 'statement,': 1, 'with': 2, 'population': 1, "europe's": 1, 'target': 1, 'these': 1, 'reliable': 1, 'work': 1, '96%': 1, 'can': 1, 'ms': 1, 'many': 1, 'further.': 1, 'and': 6, 'computing': 1, 'is': 4, 'it': 2, 'according': 1, 'have': 2, 'in': 5, 'claimed': 1, 'their': 1, 'respective': 1, 'kroes': 1, 'areas.': 1, 'responsible': 1, 'isolated': 1, 'member': 1, '100mbps': 1, 'digital': 2, 'figures': 1, 'out': 1, 'higher': 1, 'development': 1, 'satellite': 4, 'who': 1, 'connected': 2, 'coverage': 2, 'services': 2, 'president': 1, 'a': 1, 'vice': 1, 'mobile': 2, "commission's": 1, 'points': 1, '"access': 1, 'rural': 1, 'the': 16, 'agenda,': 1, 'having': 1}

Jakubee 05.05.2014 источник

comment

Ваш пример вывода не кажется правильным. - Adam Smith 06.05.2014

comment

Это всего лишь первые три ключа. Я получаю их из текстового файла в качестве ввода. - Jakubee 06.05.2014

comment

как 3 2\n2 3\n0 5 каким-либо образом соотносится с вводом из вашего примера? - Adam Smith 06.05.2014

comment

Я ввел его как образец. Просто заметил, что я не исправил as there are 2 words of length 2, 3 words of length 3, and 3 words of length 5 in the given sample text., который может вас запутать (изменю как можно скорее). Извини за это. Я просто хотел дать вам краткое представление! - Jakubee 06.05.2014

comment

Понятия не имею, что вы тогда получите. Если я правильно вас понял, "This is the sample text to get an idea!. " должен вернуть {2: 3, 3: 2, 4: 3, 6: 1}. Почему 5: 0 будет ключевым в этом словаре? - Adam Smith 06.05.2014

comment

Поскольку в этой строке нет слова длиной 5. Программа запрашивает всего 3 ключа: 2, 3 и 5. - Jakubee 07.05.2014

Ответы (3)

arrow_upward
2
arrow_downward

def freqCounter(infilepath):
    answer = {}
    with open(infilepath) as infile:
        for line in infilepath:
            for word in line.strip().split():
                l = len(word)
                if l not in answer:
                    answer[l] = 0
                answer[l] += 1
    return answer

Альтернативно L

import collections
def freqCounter(infilepath):
    with open(infilepath) as infile:
        return collections.Counter(len(word) for line in infile for word in line.strip().split())

inspectorG4dget 05.05.2014

comment

делать .strip() раньше .split() бессмысленно. .split() не работает, если во вводимом тексте есть знаки препинания, как в примере в вопросе. Вы забыли .lower() или лучше .casefold() звонок. Вы можете использовать регулярное выражение для извлечения слов из текста - jfs; 06.05.2014

arrow_upward
1
arrow_downward

Используйте collections.Counter

import collections

sentence = "This is the sample text to get an idea"

Count = collections.Counter([len(a) for a in sentence.split()])

print Count

Steve Barnes 05.05.2014

arrow_upward
1
arrow_downward

Чтобы подсчитать, сколько слов в тексте имеют заданную длину: size -> frequency distribution, вы можете использовать регулярное выражение для извлечения слов:

#!/usr/bin/env python3
import re
from collections import Counter

text = "This is the sample text to get an idea!. "
words = re.findall(r'\w+', text.casefold())
frequencies = Counter(map(len, words)).most_common() 
print("\n".join(["%d word(s) of length %d" % (n, length) 
                 for length, n in frequencies]))

Выход

3 word(s) of length 2
3 word(s) of length 4
2 word(s) of length 3
1 word(s) of length 6

Примечание. Он автоматически игнорирует знаки препинания, такие как !. после 'idea', в отличие от решений на основе .split().

Чтобы читать слова из файла, вы можете читать строки и извлекать из них слова так же, как это было сделано для text в первом примере кода:

from itertools import chain

with open(filename) as file:
    words = chain.from_iterable(re.findall(r'\w+', line.casefold())
                                for line in file)
    # use words here.. (the same as above)
    frequencies = Counter(map(len, words)).most_common()

print("\n".join(["%d word(s) of length %d" % (n, length) 
                 for length, n in frequencies]))

На практике вы можете использовать список, чтобы найти частотное распределение длин, если игнорируете слова, длина которых превышает пороговое значение:

def count_lengths(words, maxlen=100):
    frequencies = [0] * (maxlen + 1)
    for length in map(len, words):
        if length <= maxlen:
            frequencies[length] += 1
    return frequencies

Пример

import re

text = "This is the sample text to get an idea!. "
words = re.findall(r'\w+', text.casefold())
frequencies = count_lengths(words)
print("\n".join(["%d word(s) of length %d" % (n, length) 
                 for length, n in enumerate(frequencies) if n > 0]))

Выход

3 word(s) of length 2
2 word(s) of length 3
3 word(s) of length 4
1 word(s) of length 6

jfs 06.05.2014

Частота слов по словарю

Ответы (3)

Выход

Пример

Выход

Похожие вопросы