What is the fastest performance tuple for large datasets in python?

Right now, I'm basically running through an excel sheet.

I have about 20 names, and then I have 50k common values ​​that correspond to one of these 20 names, so the excel sheet has a length of 50,000 rows, column B shows any random value, and column A shows one of the 20 names.

I am trying to get a string for each of the names that show all the values.

Name A: 123,244,123,523,123,5523,12505,142... etc etc. Name B: 123,244,123,523,123,5523,12505,142... etc etc. 

Currently, I have created a dictionary that goes through an excel sheet, checks if everything is ready in the dictionary, if it is, then it does

 strA = strA + "," + foundValue 

He then inserts strA back into the dictionary for that particular name. If the name does not exist, it creates a dictionary key and then adds that value to it.

Now it was working well at first ... but it was about 15 or 20 minutes, and only up to 5 kilobytes of values ​​have been added to the dictionary so far, and over time it becomes slower and it continues to work.

I wonder if there is a better way to do this or a faster way to do it. I thought about creating new dictionaries every 1 thousand values, and then combine them all together at the end ... but it will be 50 dictionaries, and it sounds complicated .. although maybe not .. I'm not sure maybe he may work better So it doesn't seem to work.

I need a line that shows each value with a comma between each value. That's why I'm doing string things right now.

0
python dictionary
source share
5 answers

There are several things that can cause your program to run slowly.

String concatenation in python can be extremely inefficient when used with large strings.

Strings in Python are immutable. This fact often sneaks up and bites novice Python programmers to the croup. Consistency provides some advantages and disadvantages. In the plus column, rows can be used as keys in dictionaries, and individual copies can be shared between multiple variable bindings. (Python automatically separates one- and two-character lines.) In the minus column, you cannot say something like “change all“ a ”to“ b "in any given line. Instead, you need to create a new line with the required properties. This persistent copying can lead to significant inefficiencies in Python programs.

Given that each line in your example can contain thousands of characters, each time you perform concatenation, python must copy this giant line into memory to create a new object.

This will be much more efficient:

 strings = [] strings.append('string') strings.append('other_string') ... ','.join(strings) 

In your case, instead of each dictionary key storing a massive string, it should store a list, and you just add each match to the list, and only at the very end you will perform string concatenation with str.join .

In addition, printing to stdout is also sadly slow . If you print to stdout at each iteration of your massive 50,000 element cycles, each iteration is held by an unbuffered entry in stdout. Consider just printing each nth iteration, or perhaps writing to a file (file writing is usually buffered), and then completing the file from another terminal.

+2
source share

This answer is based on OP's answer to my comment. I asked what he would do with the dicton, suggesting that perhaps he did not need to build it in the first place. @simon replies:

I will add it to the excel sheet, so I take the KEY, which is the name, and put it in A1, then I take the VALUE value, which is 1345,345,135,346,3451,35 .. etc. Etc., And put it in A2. then I do the rest of my programming with this information ...... but I need these values ​​separated by commas and accessible inside this sheet like this!

So it seems that dict does not need to be built in the end. Here is an alternative: for each name, create a file and save these files in a dict :

 files = {} name = 'John' # let say if name not in files: files[name] = open(name, 'w') 

Then, when you loop into the excel 50k line, you do something like this (pseudocode):

 for row in 50k_rows: name, value_string = rows.split() # or whatever file = files[name] file.write(value_string + ',') # if already ends with ',', no need to add 

Since your value_string already separated by a comma, your file will be csv-like, without any additional configuration on your part (except perhaps you want to remove the last trailing comma after you finish). Then when you need the values ​​of, say, John, just value = open('John').read() .

Now I have never worked with excellent 50k strings, but I would be very surprised if it weren’t much faster than you currently have. Having persistent data is also (well, maybe) a plus.


EDIT:

Above is a memory oriented solution. Writing to files is much slower than adding lists (but probably even faster than re-creating many large lines). But if the lists are huge (which seems likely) and you are faced with a memory problem (don't say you want), you can try the file approach.

An alternative, similar to performance lists (at least for the game test I tried), is to use StringIO :

 from io import StringIO # python 2: import StringIO import StringIO string_ios = {'John': StringIO()} # a dict to store StringIO objects for value in ['ab', 'cd', 'ef']: string_ios['John'].write(value + ',') print(string_ios['John'].getvalue()) 

This will exit 'ab,cd,ef,'

+1
source share

Instead of creating a string that looks like a list, use the actual list and make the string representation you want after you finish.

0
source share

Depending on how you read the excel file, but let it say that the lines are read as delimited tuples or something else:

 d = {} for name, foundValue in line_tuples: try: d[name].append(foundValue) except KeyError: d[name] = [foundValue] d = {k: ",".join(v) for k, v in d.items()} 

Alternatively using pandas :

 import pandas as pd df = pd.read_excel("some_excel_file.xlsx") d = df.groupby("A")["B"].apply(lambda x: ",".join(x)).to_dict() 
0
source share

The right way is to put together in lists and join the end, but if for some reason you want to use strings, you can speed up the expansion of strings. Pull the line from the dict so that there is only one link, and thus optimization can hit.

Demo:

 >>> timeit( = d.pop(k); s = s + "y"; d[k] = s', 'k = "x"; d = {k: ""}') 0.8417842664330237 >>> timeit( = d[k]; s = s + "y"; d[k] = s', 'k = "x"; d = {k: ""}') 294.2475278390723 
0
source share

All Articles