Why append list is slower when using multiprocess?

Total
1
Shares

I want to append list. Each element to be append is a large dataframe.
I try to use Multiprocessing mudule to speed up appending list. My code as follows:

import pandas as pd
import numpy as np
import time
import multiprocessing

from multiprocessing import Manager

def generate_df(size):
    df = pd.DataFrame()
    for x in list('abcdefghi'):
        df[x] = np.random.normal(size=size)
    return df

def do_something(df_list,size,k):
    df = generate_df(size)
    df_list[k] = df

if __name__ == '__main__':
    size = 200000
    num_df = 30
    start = time.perf_counter()
    with Manager() as manager:
        df_list = manager.list(range(num_df))

        processes = []
        for k in range(num_df):
            p = multiprocessing.Process(target=do_something, args=(df_list,size,k,)) 
        p.start()
        processes.append(p)

    for process in processes:
        process.join()
    
    final_df = pd.concat(df_list)

    print(final_df.head())
    finish = time.perf_counter()
    print(f'Finished in {round(finish-start,2)} second(s)')
    print(len(final_df))

The elapsed time is 7 secs.

I try to append list without Multiprocessing.

df_list = []
for _ in range(num_df):
    df_list.append(generate_df(size))

final_df = pd.concat(df_list)

But, this time the elapsed time is 2 secs! Why append list with multiprocessing is slower than without that?


Solution

When you use manager.list, you’re not using a normal Python list. You’re using a special list proxy object that has a whole lot of other stuff going on. Every operation on that list will involve locking and interprocess communication so that every process with access to the list will see the same data in it at all times. It’s slow because it’s a non-trivial problem to keep everything consistent in that way.

You probably don’t need all of that synchronization, it’s just slowing you down. A much more natural way to do what you’re attempting is to use a process pool and it’s map method. The pool will handle creating and shutting down the processes, and map will call a target function with an argument from an iterable.

Try something like this, which will use a number of worker processes equal to the number of CPUs your system has:

if __name__ == '__main__':
    size = 200000
    num_df = 30
    start = time.perf_counter()

    with multiprocessing.pool() as pool:
        df_list = pool.map(generate_df, [size]*num_df)

    final_df = pd.concat(df_list)
    print(final_df.head())
    finish = time.perf_counter()
    print(f'Finished in {round(finish-start,2)} second(s)')
    print(len(final_df))

This will still have some overhead, since the interprocess communication used to pass the dataframes back to the main process is not free. It may still be slower than running everything in a single process.

Leave a Reply

Your email address will not be published. Required fields are marked *