Previous word length

Question

Previous word length

I need to make a function that takes one word argument and returns the average length (in characters) of the word preceding the word in the text. If the word is the first word occurring in the text, the length of the previous word for this occurrence must be equal to zero. for instance

>>> average_length("the") 4.4 >>> average_length('whale') False average_length('ship.') 3.0

This is what I wrote so far,

 def average_length(word): text = "Call me Ishmael. Some years ago - never mind how long..........." words = text.split() wordCount = len(words) Sum = 0 for word in words: ch = len(word) Sum = Sum + ch avg = Sum/wordCount return avg

I know that this is not at all true, but I am having problems with how to approach this correctly. This question asks me to find in the text the text of each word , and when you do this, calculate the length of the word immediately before it in the text. Not every word from the beginning to this word, only one.

I should also mention that all tests will only validate my code using the first paragraph from Moby Dick:

“Call me Ishmael. A few years ago - it doesn’t matter what time it is — having little money or money in my wallet and nothing special to interest me on the shore, I thought that I would swim a bit and see the watery part of the world, this is the way that I can run away from the spleen and regulate blood circulation. Whenever I find myself gloomy in my mouth, whenever my soul is wet, frosty November, when I unwittingly stopping in front of the graves, and bring up the rear of each funeral, which i meet and oh in fact, whenever my gypsum receives such an upper arm from me that it requires a strong moral principle to prevent me from consciously going out and methodically knocking people off my hats - then, I think it's time to get to the sea, as soon as I can. This is my replacement with a pistol and ball. With a philosophical peak, Caton throws himself on his sword, I quietly board the ship. This is not surprising. If they knew this, almost all people in their degree, at some point , really love almost the same feelings for the ocean with me. "

+6

python

Roadrunner Mar 21 '16 at 9:41

source share

5 answers

It seems that you can save a lot of time calculating by going to your data only once:

 from collections import defaultdict prec = defaultdict(list) text = "Call me Ishmael. Some years ago..".split()

Create two iterators above the list. In the second, we call next , so that from now on whenever we get an element from an iterator, we get the word and its successor.

 first, second = iter(text), iter(text) next(second)

Crossing two iterators ( "abc","def" → "ad", "be", "cf" ), we add the length of the first word to the list of lengths of the predecessors of the second. This works because we use defaultdict(list) , which returns an empty list for any key that does not yet exist.

 for one, two in zip(first, second): # pairwise prec[two].append(len(one))

Finally, we can create a new dictionary from words to the average of their predecessors: the sum divided by the length. Instead of understanding vocabulary code, you can also use a regular loop.

 # avg_prec_len = {key: sum(prec[key]) / len(prec[key]) for key in prec} avg_prec_len = {} for key in prec: # prec[key] is a list of lengths avg[key] = sum(prec[key]) / len(prec[key])

Then you can just find it in this dictionary.

(If you are using Python 2, use izip instead of zip and from __future__ import division ).

+3

L3viathan Mar 21 '16 at 10:01

source share

I suggest breaking this task down into some atomic parts:

 from __future__ import division # int / int should result in float # Input data: text = "Lorem ipsum dolor sit amet dolor ..." word = "dolor" # First of all, let extract words from string words = text.split() # Find indices of picked word in words indices = [i for i, some_word in enumerate(words) if some_word == word] # Find indices of preceding words preceding_indices = [i-1 for i in indices] # Find preceding words, handle first word case preceding_words = [words[i] if i != -1 else "" for i in preceding_indices] # Calculate mean of words length mean = sum(len(w) for w in preceding_words) / len(preceding_words) # Check if result is correct # (len('ipsum') + len('amet')) / 2 = 9 / 2 = 4.5 assert mean == 4.5

Obviously, we can bind it to a function. I added comments here:

 def mean_length_of_preceding_words(word, text): words = text.split() indices = [i for i, some_word in enumerate(words) if some_word == word] preceding_indices = [i-1 for i in indices] preceding_words = [words[i] if i != -1 else "" for i in preceding_indices] mean = sum(len(w) for w in preceding_words) / len(preceding_words) return mean

Obviously, performance is not the key here - I tried to use only the built-in modules ( from __future__... is built-in, in my opinion), and make the intermediate steps clean and clear.

Some test cases:

 assert mean_length_of_preceding_words("Lorem", "Lorem ipsum dolor sit amet dolor ...") == 0.0 assert mean_length_of_preceding_words("dolor", "Lorem ipsum dolor sit amet dolor ...") == 4.5 mean_length_of_preceding_words("E", "ABCD") # ZeroDivisionError - average length of zero words does not exist

The separation process ( words = ... ) needs to be changed if you want to handle punctuation in some way. The specification does not mention this, so I kept it simple and simple.

I don’t like changing the return type for a special case, but if you insist, you can make an early exit.

 def mean_length_of_preceding_words(word, text): words = text.split() if word not in words: return False indices = [i for i, some_word in enumerate(words) if some_word == word] preceding_indices = [i-1 for i in indices] preceding_words = [words[i] if i != -1 else "" for i in preceding_indices] mean = sum(len(w) for w in preceding_words) / len(preceding_words) return mean

The last test case changes to:

 assert mean_length_of_preceding_words("E", "ABCD") is False

+1

Łukasz Rogalski Apr 01 '16 at 13:02

source share

This answer is based on the assumption that you want to remove all punctuation to only have words ...

I play dirty by adding a zero line to the list of words, so that your requirement for the predecessor of the first word of the text is met.

The result is calculated using some smart indexing, which makes numpy possible.

 class Preceding_Word_Length(): def __init__(self, text): import numpy as np self.words = np.array( ['']+[w.strip(''',.?!'":''') for w in text.split() if w != '-']) self.indices = np.arange(len(self.words)) self.lengths = np.fromiter((len(w) for w in self.words), float) def mean(self, word): import numpy as np if word not in self.words: return 0.0 return np.average(self.lengths[self.indices[word==self.words]-1]) text = '''Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people hats off - then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.''' ishmael = Preceding_Word_Length(text) print(ishmael.mean('and')) # -> 6.28571428571 print(ishmael.mean('Call')) # -> 0.0 print(ishmael.mean('xyz')) # -> 0.0

I would like to emphasize that the implementation of this behavior inside the class leads to a simple caching of some calculations that are repeated for sequential analysis of the same text.

+1

gboffi Apr 01 '16 at 13:14

source share

Very similar to my previous answer without numpy import

 def average_length(text, word): words = ['']+[w.strip(''',.?!'":''') for w in text.split() if w != '-'] if word not in words: return False match = [len(prev) for prev, curr in zip(words[:-1],words[1:]) if curr==word] return 1.0*sum(match)/len(match)

+1

gboffi Apr 01 '16 at 13:42

source share

Padraic cunningham · Accepted Answer · 2016-04-01T11:04:31+0000

Based on your requirements without import and a simple approach, the following function does this without any changes, comments and variable names should make the function logic pretty clear:

 def match_previous(lst, word): # keep matches_count of how many times we find a match and total lengths matches_count = total_length_sum = 0.0 # pull first element from list to use as preceding word previous_word = lst[0] # slice rest of words from the list # so we always compare two consecutive words rest_of_words = lst[1:] # catch where first word is "word" and add 1 to matches_count if previous_word == word: matches_count += 1 for current_word in rest_of_words: # if the current word matches our "word" # add length of previous word to total_length_sum # and increase matches_count. if word == current_word: total_length_sum += len(previous_word) matches_count += 1 # always update to keep track of word just seen previous_word = current_word # if matches_count is 0 we found no word in the text that matched "word" return total_length_sum / matches_count if matches_count else False

Two arguments are required: a broken list of words and a search word:

 In [41]: text = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to previous_wordent me from deliberately stepping into the street, and methodically knocking people hats off - then, I acmatches_count it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me." In [42]: match_previous(text.split(),"the") Out[42]: 4.4 In [43]: match_previous(text.split(),"ship.") Out[43]: 3.0 In [44]: match_previous(text.split(),"whale") Out[44]: False In [45]: match_previous(text.split(),"Call") Out[45]: 0.0

You obviously can do the same as your own function, take one argument, split the text into functions. The only way to return False is if we did not find a match for this word, you can see that call returns 0.0, since this is the first word in the text.

If we add some fingerprints to the code and use the enumeration:

 def match_previous(lst, word): matches_count = total_length_sum = 0.0 previous_word = lst[0] rest_of_words = lst[1:] if previous_word == word: print("First word matches.") matches_count += 1 for ind, current_word in enumerate(rest_of_words, 1): print("On iteration {}.\nprevious_word = {} and current_word = {}.".format(ind, previous_word, current_word)) if word == current_word: total_length_sum += len(previous_word) matches_count += 1 print("We found a match at index {} in our list of words.".format(ind-1)) print("Updating previous_word from {} to {}.".format(previous_word, current_word)) previous_word = current_word return total_length_sum / matches_count if matches_count else False

And run it with a short list of samples, we will see what happens:

 In [59]: match_previous(["bar","foo","foobar","hello", "world","bar"],"bar") First word matches. On iteration 1. previous_word = bar and current_word = foo. Updating previous_word from bar to foo. On iteration 2. previous_word = foo and current_word = foobar. Updating previous_word from foo to foobar. On iteration 3. previous_word = foobar and current_word = hello. Updating previous_word from foobar to hello. On iteration 4. previous_word = hello and current_word = world. Updating previous_word from hello to world. On iteration 5. previous_word = world and current_word = bar. We found a match at index 4 in our list of words. Updating previous_word from world to bar. Out[59]: 2.5

The advantage of using iter is that we do not need to create a new list by slicing the remainder to use it in the code that you just need to change to run the function:

 def match_previous(lst, word): matches_count = total_length_sum = 0.0 # create an iterator _iterator = iter(lst) # pull first word from iterator previous_word = next(_iterator) if previous_word == word: matches_count += 1 # _iterator will give us all bar the first word we consumed with next(_iterator) for current_word in _iterator:

Each time you consume an element from an iterator, we move on to the next element:

 In [61]: l = [1,2,3,4] In [62]: it = iter(l) In [63]: next(it) Out[63]: 1 In [64]: next(it) Out[64]: 2 # consumed two of four so we are left with two In [65]: list(it) Out[65]: [3, 4]

The only way it really makes sense is to take a few words into your function, which you can do with * args :

 def sum_previous(text): _iterator = iter(text.split()) previous_word = next(_iterator) # set first k/v pairing with the first word # if "total_lengths" is 0 at the end we know there # was only one match at the very start avg_dict = {previous_word: {"count": 1.0, "total_lengths": 0.0}} for current_word in _iterator: # if key does not exist, it creates a new key/value pairing avg_dict.setdefault(current_word, {"count": 0.0, "total_lengths": 0.0}) # update value adding word length and increasing the count avg_dict[current_word]["total_lengths"] += len(previous_word) avg_dict[current_word]["count"] += 1 previous_word = current_word # return the dict so we can use it outside the function. return avg_dict def match_previous_generator(*args): # create our dict mapping words to sum of all lengths of their preceding words. d = sum_previous(text) # for every word we pass to the function. for word in args: # use dict.get with a default of an empty dict. # to catch when a word is not in out text. count = d.get(word, {}).get("count") # yield each word and it avg or False for non existing words. yield (word, d[word]["total_lengths"] / count if count else False)

Then just pass the text and all the words you want to find, you can call the list in the generator function

Or repeat it:

 In [70]: for tup in match_previous_generator("the","Call", "whale", "ship."): ....: print(tup) ....: ('the', 4.4) ('Call', 0.0) ('whale', False) ('ship.', 3.0)

Previous word length

More articles: