I have tried the below approach in pyspark:
from pyspark.sql.functions import concat, lpad, regexp_extract, length, when
dilip_df = spark.createDataFrame(
[("1:15:00",), ("07:45:00",), ("0 2:23:00",), ("0 6:50:00",)],
[ travel_time ]
)
dilip_df.select(
concat(
when(
length(dilip_df[ travel_time ]) == 1,
concat(
lpad(
regexp_extract(dilip_df[ travel_time ], r d+ , 0),
2, 0
),
regexp_extract(dilip_df[ travel_time ], r :d+:d+ , 0)
)
).otherwise(
regexp_extract(dilip_df[ travel_time ], r d:?d:d+ , 0)
)
).alias( travel_time )
).show()
+-----------+
|travel_time|
+-----------+
| 15:00|
| 07:45|
| 23:00|
| 50:00|
+-----------+
以上代码:travel_time栏
If the length of travel_time is 1, it means there is only a single digit so
Extract the numeric part from the travel_time column using regexp_extract
.
Left pad
the extracted numeric part with zeros to make it a two-digit number using lpad.
Concatenate the padded numeric part with the rest of the original time string obtained by using regexp_extract
.
If the length is not 1, it means there is more than one digit so
Extract the time part using regexp_extract
.
The final result is an alias column named travel_time containing the transformed values.
采用代号:
from pyspark.sql.functions import expr
dilip_df = spark.createDataFrame(
[("1:15:00",), ("07:45:00",), ("02:23:00",), ("06:50:00",)],
[ travel_time ]
)
dilip_df.createOrReplaceTempView("dilip_table")
query = """
SELECT
travel_time,
CASE
WHEN REGEXP_EXTRACT(travel_time, ^\d{1,2}:\d{2}:\d{2}$ ) IS NOT NULL THEN HH:MM:SS
WHEN REGEXP_EXTRACT(travel_time, ^0\d{1}:\d{2}:\d{2}$ ) IS NOT NULL THEN 0H:MM:SS
WHEN REGEXP_EXTRACT(travel_time, ^0\d{1} \d{1,2}:\d{2}:\d{2}$ ) IS NOT NULL THEN 0H HM:SS
ELSE NULL
END AS format
FROM dilip_table
"""
result = spark.sql(query)
result.show()
In spark.sql code It selects the travel_time column with transformations applied using the CONCAT
and CASE
WHEN
expressions based on the length of the travel_time column.
If the length is 1, it extracts and manipulates the numeric part of the travel_time using regexp_extract
, lpad
, and concat
.
If the length is not 1, it extracts the time part using regexp_extract
.
<>Results>
+-----------+--------+
|travel_time| format|
+-----------+--------+
| 1:15:00|HH:MM:SS|
| 07:45:00|HH:MM:SS|
| 02:23:00|HH:MM:SS|
| 06:50:00|HH:MM:SS|
+-----------+--------+