Unable to understand line in Collective Intelligence

I work through collective intelligence programming . In chapter 4, Toby Segaran builds an artificial neural network. The following function appears on the book page:

def generatehiddennode(self,wordids,urls): if len(wordids)>3: return None # Check if we already created a node for this set of words sorted_words=[str(id) for id in wordids] sorted_words.sort() createkey='_'.join(sorted_words) res=self.con.execute( "select rowid from hiddennode where create_key='%s'" % createkey).fetchone() # If not, create it if res==None: cur=self.con.execute( "insert into hiddennode (create_key) values ('%s')" % createkey) hiddenid=cur.lastrowid # Put in some default weights for wordid in wordids: self.setstrength(wordid,hiddenid,0,1.0/len(wordids)) for urlid in urls: self.setstrength(hiddenid,urlid,1,0.1) self.con.commit() 

What I cannot understand is the reason for the first line in this function: "if len (wordids> 3): return None`. Is this the debugging code that needs to be removed later?

PS this is not homework

+4
source share
3 answers

For a published book, this pretty awful code! (Here you can download all the examples for the book, the corresponding file chapter4/nn.py )

  • No docstring. What should this function do? From his name it can be assumed that he generates one of the nodes in the "hidden layer" of the neural network, but what role do wordids and urls ?
  • The database query uses string substitution and is therefore vulnerable to SQL injection attacks (especially since it is related to web search, therefore wordids probably come from a user query and therefore may be unreliable, but maybe they are identifiers , not words, so in practice itโ€™s normal, but still a very bad habit to join).
  • Do not use the expressive power of the database: if all you want to do is determine if the key exists in the database, then you probably want to use SELECT EXISTS(...) rather than asking the database to send you a bunch of records that you will then ignore.
  • The function does nothing if there was already a record with createkey . There is no mistake. It's right? Who can say?
  • The weight expression for words is scaled to the number of words, but the weight value for URLs is constant 0.1 (maybe there are always 10 URLs, but it would be better to scale len(urls) here).

I could go on and on, but itโ€™s better not to.

In any case, to answer your question, it looks like this function adds a database entry for node in a hidden layer of a neural network . This neural network has, I think, words in the input layer and URLs at the output level. The idea of โ€‹โ€‹the application is to try to train the neural network to find good search results (URLs) based on the words in the query. See trainquery function, which takes arguments (wordids, urlids, selectedurl) . Presumably (since there is no docking I have to guess) the wordids were the words the user was looking for, urlids are the URLs that the search engine suggested to the user, and selectedurl is the one that the user selected. The idea is to train the neural network in order to better predict which users will select URLs and therefore place these URLs in future search results.

Thus, a mysterious line of code prevents the creation of nodes in a hidden layer with links to more than three nodes in the input layer. In the context of a search application, this makes sense: there is no point in training the network for too specialized queries, because these queries will not be repeated often enough for training to be worth it.

+6
source

You should probably have placed a little more context for the code. Here is the paragraph in Collective Programming Intelligence that immediately precedes this code:

This function will create a new node in the hidden layer every time it passes a combination of words that it has never seen before. the function then creates the default weighted links between words and hidden nodes, as well as between the node request and the URL results returned by that request.

I understand that this still does not help answer your question, but it would help Gareth Rhys with his answer, giving less hunch. In any case, Gareth still understood correctly, as he is smart. The goal is to limit the number of word nodes with which the hidden node can be associated, and the author chose an arbitrary number from 3.

Just to agree with Gareth, again, this item should have been fully included in the docstring, and the purpose of this line should have been in the comment above the line. I hope the next edition is not so messy.

+1
source

To dwell on the comments above, take a look at this simple script ...

 def doSomething(wordids): if len(wordids)>3: return None print("The rest of the function executes") blah = [2,3,4]; doSomething(blah) blah = [2,3,4,5]; doSomething(blah) 

., therefore, if the length of the dictionary is longer than 3, then the function does nothing. It is common to check inputs for functions, but errors are usually handled using exceptions in more complex cases.

0
source

All Articles