You are correct that random.shuffle returns None. This is because it shuffles its list argument in place, and this Python convention, which functions mutable arg and mutate, returns None . However, you do not understand the argument random arg random.shuffle : it should be a random number generator, not a function like your seed , which always returns the same number.
BTW, you can use the standard random number generator provided by the random module using its seed function. random.seed takes any hashed object as its argument, although it is customary to pass a number or a string to it. You can also pass it None (which is equivalent here to not pass it at all to arg), and it will seed the randomizer with a system random source (if there is no system random source, then use system time as a seed). Unless you explicitly call seed after importing a random module, which is equivalent to calling seed()
The advantage of delivering a seed is that every time you run a program with the same seed, the random numbers generated by the various functions of the random modules will be exactly the same. This is very useful when developing and debugging your code: it can be difficult to track errors when the output keeps changing. :)
There are two main ways to do what you want. You can shuffle the lists and then slice the first 5000 file names from them. Or you can use the random.sample function to get 5,000 random samples. This way you don't have to shuffle the whole list.
import random random.seed(0.47231099848)
Using sample
# Sample data = [] for flist in file_lists: data.append(random.sample(flist, 5000))
I have not performed speed tests on this code, but I suspect that sample will be faster, since it just needs to randomly select items, rather than moving all the items in the list. shuffle pretty efficient, so you probably won't notice a big difference in runtime if your teens, twins and thirty files have a list of files with more than 5000 file names.
Both of these loops make a data nested list containing 3 subscriptions, with 5,000 file names in each sublist. However, if you want this to be a flat list of 15,000 file names, you just need to use the list.extend method instead of list.append . For example,
data = [] for flist in file_lists: data.extend(random.sample(flist, 5000))
Or we can do this using a double- for list comprehension:
data = [fname for flist in file_lists for fname in random.sample(flist, 5000)]
If you need to filter the contents of data to create a final list of files, the easiest way is to add an if condition to the list comprehension.
Let's say we have a function that can check if there is a file name that we want to save:
def keep_file(fname):
Then we can do
data = [fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname)]
and data will only contain the names of the files that pass the keep_file test.
Another way to do this is to create file names using a generator expression instead of understanding the list, and then pass this to the filter built-in function:
data_gen = filter(keep_file, (fname for flist in file_lists for fname in random.sample(flist, 5000)))
data_gen itself is an iterator. You can create a list from it as follows:
data_final = list(data_gen)
Or, if you really do not need all the names in a collection, and you can just process them one by one, you can put it in a for loop, for example:
for fname in data_gen: print(fname)
In this case, less RAM is used, but the disadvantage is that it "consumes" the file names, so after the cycle for data_gen is data_gen will be empty.
Suppose you wrote a function that retrieves the necessary data from each file:
def age_and_text(fname):
You can create a list of these tuples (filename, age, text) as follows:
data_gen = (fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname)) final_data = [age_and_text(fname) for fname in data_gen]
Notice the snippet in my first snippet: flist[:5000] . This takes the first 5000 elements in flist , elements with indices from 0 to 4999 inclusive. Your version had teens[:5001] , which is a "one by one" error. Slices work just like ranges. Thus, range(5000) gives 5000 numbers from 0 to 4999. This works because Python (like most modern programming languages) uses zero-indexing.