Python - A hack to process a long list of data faster
I've learned a hack to process a long list of data faster from a colleague. It's just great. Here is the case:
Assuming I have list of thousands of users info to send email and do some other stuffs with those users.
1. First, when I loop through each user, translate the username to number with:
def str_to_num(input_str):
return_num = 0
for ch in input_str:
return_num += ord(ch)
return return_num
2. Get the result of:
num = translated_username % 10
3. If the result of the previous calculation is in the input condition (when I run the script), then execute the main function (email,...)
if num in condition_number:
main_function()
The main function:
...
def get_dict_data_from_csv_file(csv_file_path):
csv_file = open(csv_file_path, 'rb')
csv_file.seek(0)
sniffdialect = csv.Sniffer().sniff(csv_file.read(10000), delimiters='\t,;')
csv_file.seek(0)
dict_reader = csv.DictReader(csv_file, dialect=sniffdialect)
csv_file.seek(0)
dict_data = []
for record in dict_reader:
dict_data.append(record)
csv_file.close()
return dict_data
def main_function(src_data, condition_number):
for dat in src_data:
num = str_to_num(dat['username']) % 10
if num in condition_number:
process()
...
if __name__ == '__main__':
src_data = get_dict_data_from_csv(sys.argv[1])
condition_number = map(int, sys.argv[2].split(','))
main_function(src_data, condition_number)
Now I can split the whole list of data and process simultaneously with the input condition range from 0 to 9:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9
For example, I can run the script in 5 different terminal windows with 5 different groups of input condition such as:
Terminal #0:
$ python myscript /home/trinh/Documents/data.csv 0,1
Terminal #1:
$ python myscript /home/trinh/Documents/data.csv 2,3
Terminal #2:
$ python myscript /home/trinh/Documents/data.csv 4,5
Terminal #3:
$ python myscript /home/trinh/Documents/data.csv 6,7
Terminal #4:
$ python myscript /home/trinh/Documents/data.csv 8,9
It means that I can do the job 5 times faster than normal.
Awesome!
Assuming I have list of thousands of users info to send email and do some other stuffs with those users.
1. First, when I loop through each user, translate the username to number with:
def str_to_num(input_str):
return_num = 0
for ch in input_str:
return_num += ord(ch)
return return_num
2. Get the result of:
num = translated_username % 10
3. If the result of the previous calculation is in the input condition (when I run the script), then execute the main function (email,...)
if num in condition_number:
main_function()
The main function:
...
def get_dict_data_from_csv_file(csv_file_path):
csv_file = open(csv_file_path, 'rb')
csv_file.seek(0)
sniffdialect = csv.Sniffer().sniff(csv_file.read(10000), delimiters='\t,;')
csv_file.seek(0)
dict_reader = csv.DictReader(csv_file, dialect=sniffdialect)
csv_file.seek(0)
dict_data = []
for record in dict_reader:
dict_data.append(record)
csv_file.close()
return dict_data
def main_function(src_data, condition_number):
for dat in src_data:
num = str_to_num(dat['username']) % 10
if num in condition_number:
process()
...
if __name__ == '__main__':
src_data = get_dict_data_from_csv(sys.argv[1])
condition_number = map(int, sys.argv[2].split(','))
main_function(src_data, condition_number)
Now I can split the whole list of data and process simultaneously with the input condition range from 0 to 9:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9
For example, I can run the script in 5 different terminal windows with 5 different groups of input condition such as:
Terminal #0:
$ python myscript /home/trinh/Documents/data.csv 0,1
Terminal #1:
$ python myscript /home/trinh/Documents/data.csv 2,3
Terminal #2:
$ python myscript /home/trinh/Documents/data.csv 4,5
Terminal #3:
$ python myscript /home/trinh/Documents/data.csv 6,7
Terminal #4:
$ python myscript /home/trinh/Documents/data.csv 8,9
It means that I can do the job 5 times faster than normal.
Awesome!
Comments
Post a Comment