I have a directory full of csv files that are each 1m rows or more. Each csv file has an identical structure and has a "Date" column with no obvious ordering of these dates.
I want to read in the csv files and then split them up by month-year combination and then write them out again.
lazy_df = pl.scan_csv( data/file_*.tsv , separator= )
lazy_df = lazy_df.with_columns(pl.col("Date").str.to_datetime("%Y-%m-%d").alias("Date"))
lazy_df = lazy_df.with_columns(pl.col("Date").apply(lambda date: date.strftime("%Y-%m")).alias("year_month"))
year_months = lazy_df.select([ year_month ]).unique().collect()
for year_month in year_months:
df_month = lazy_df.filter(col("year_month") == year_month).collect()
# Write this dataframe to a new TSV file
output_filename = f"data_{year_month}.tsv"
df_month.write_csv(output_filename, delimiter= )
My jupyter kernel keeps crashing so I m not sure if I m using Polars the right way.
The code I tried is above.