I am trying to load a large number of parquet files in python pandas and noticed a notable performance difference between two different approaches. Specifically
pd.read_parquet("/path/to/directory/")
Is more than twice as fast than something like:
filelist = glob.glob("/path/to/directory/*")
pd.concat([pd.read_parquet(i) for i in filelist])
The reason for want to use the 2nd approach include to pre-filter the parquet files to be loaded, or to load from multiple directories (that contain parquet files with same format etc).
Any tips / guidance appreciated - basically looking to understand how to make the 2nd approach as performant as the first (and/or understanding what kind of magic might be making the 1st approach faster).