Add items from shuffled list to new list

For a text classification project (age) I am making a subset of my data. I made 3 lists with file names sorted by age. I want to shuffle these lists, and then add 5000 file names from each shuffled list to the new list. The result should be a subset of the data with 15,000 files (5000 10, 5000 20, 5000 30 s). Below you can see what I have written so far. But I know that random.shuffle returns none, and an object of type none is not iterable. How can I solve this problem?

 def seed(): return 0.47231099848 teens = [list of files] tweens = [list of files] thirthies = [list of files] data = [] for categorie in random.shuffle([teens, tweens, thirthies],seed): data.append(teens[:5000]) data.append(tweens[:5000]) data.append(thirthies[:5000]) 
+7
python list random append shuffle
source share
4 answers

You are correct that random.shuffle returns None. This is because it shuffles its list argument in place, and this Python convention, which functions mutable arg and mutate, returns None . However, you do not understand the argument random arg random.shuffle : it should be a random number generator, not a function like your seed , which always returns the same number.

BTW, you can use the standard random number generator provided by the random module using its seed function. random.seed takes any hashed object as its argument, although it is customary to pass a number or a string to it. You can also pass it None (which is equivalent here to not pass it at all to arg), and it will seed the randomizer with a system random source (if there is no system random source, then use system time as a seed). Unless you explicitly call seed after importing a random module, which is equivalent to calling seed()

The advantage of delivering a seed is that every time you run a program with the same seed, the random numbers generated by the various functions of the random modules will be exactly the same. This is very useful when developing and debugging your code: it can be difficult to track errors when the output keeps changing. :)


There are two main ways to do what you want. You can shuffle the lists and then slice the first 5000 file names from them. Or you can use the random.sample function to get 5,000 random samples. This way you don't have to shuffle the whole list.

 import random random.seed(0.47231099848) # teens, tweens, thirties are lists of file names file_lists = [teens, tweens, thirties] # Shuffle data = [] for flist in file_lists: random.shuffle(flist) data.append(flist[:5000]) 

Using sample

 # Sample data = [] for flist in file_lists: data.append(random.sample(flist, 5000)) 

I have not performed speed tests on this code, but I suspect that sample will be faster, since it just needs to randomly select items, rather than moving all the items in the list. shuffle pretty efficient, so you probably won't notice a big difference in runtime if your teens, twins and thirty files have a list of files with more than 5000 file names.

Both of these loops make a data nested list containing 3 subscriptions, with 5,000 file names in each sublist. However, if you want this to be a flat list of 15,000 file names, you just need to use the list.extend method instead of list.append . For example,

 data = [] for flist in file_lists: data.extend(random.sample(flist, 5000)) 

Or we can do this using a double- for list comprehension:

 data = [fname for flist in file_lists for fname in random.sample(flist, 5000)] 

If you need to filter the contents of data to create a final list of files, the easiest way is to add an if condition to the list comprehension.

Let's say we have a function that can check if there is a file name that we want to save:

 def keep_file(fname): # if we want to keep fname, return True, otherwise return False 

Then we can do

 data = [fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname)] 

and data will only contain the names of the files that pass the keep_file test.

Another way to do this is to create file names using a generator expression instead of understanding the list, and then pass this to the filter built-in function:

 data_gen = filter(keep_file, (fname for flist in file_lists for fname in random.sample(flist, 5000))) 

data_gen itself is an iterator. You can create a list from it as follows:

 data_final = list(data_gen) 

Or, if you really do not need all the names in a collection, and you can just process them one by one, you can put it in a for loop, for example:

 for fname in data_gen: print(fname) # Do other stuff with fname 

In this case, less RAM is used, but the disadvantage is that it "consumes" the file names, so after the cycle for data_gen is data_gen will be empty.

Suppose you wrote a function that retrieves the necessary data from each file:

 def age_and_text(fname): # Do stuff that extracts the age and desired text from the file return fname, age, text 

You can create a list of these tuples (filename, age, text) as follows:

 data_gen = (fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname)) final_data = [age_and_text(fname) for fname in data_gen] 

Notice the snippet in my first snippet: flist[:5000] . This takes the first 5000 elements in flist , elements with indices from 0 to 4999 inclusive. Your version had teens[:5001] , which is a "one by one" error. Slices work just like ranges. Thus, range(5000) gives 5000 numbers from 0 to 4999. This works because Python (like most modern programming languages) uses zero-indexing.

+7
source share

The first problem is that you are shuffling a list of 3 elements [teens, teens, thirds] (even each of the list items) instead of shuffling each sublist

Secondly, you can use random.sample instead of random.shuffle

 for categ in [teens, tweens, thirthies]: data.append(random.sample(categ,5000]) 

or as @JonClements suggested in the comments, you can use list comprehension

 categories = [teens, tweens, thirthies] data = [e for categ in categories for e in random.sample(categ, 5000)] 
+9
source share

shuffle returns None , which is not iterable

you have to do

 data = [] for category in [teens, tweens, thirthies]: category_copy = category[:] random.shuffle(category_copy, seed) data.append(category_copy[:5000]) 
+6
source share

random.shuffle the list of changes itself (drags it). It sounds like you want something like this:

 teens = [list of files] tweens = [list of files] thirthies = [list of files] random.shuffle(teens) random.shuffle(tweens) random.shuffle(thirthies) data = [] for categorie in [teens, tweens, thirthies] : data.append(categorie[:5000]) 

BTW somelist[:n] will be truncated to elements n , check this:

 >>> [1,2,3,4,5][:3] [1, 2, 3] 
+2
source share

All Articles