English 中文(简体)
what happens when we delete spark managed tables?
原标题:

I recently started learning about spark. I was studying about spark managed tables. so as per docs " spark manages the both the data and metadata". Assume that i have a csv file in s3 and I read it into data frame like below.

df = spark.read
.format("csv")
.option("header", "true") 
.option("inferSchema", "true") 
.load("s3a://databricks-learning-s333/temp/flights.csv")

now i created a spark managed table in data bricks as below..

spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")

spark.sql("CREATE TABLE managed_us_delay_flights_tbl (date STRING, delay INT,  
  distance INT, origin STRING, destination STRING)")

df.write.saveAsTable("managed_us_delay_flights_tbl")

now it is a spark managed table, so spark manages both the data and metadata.

as per docs, if we delete managed table spark deletes managed table it will delete the both metadata and actual data (docs)

Here are my questions:

  1. the below code deletes the spark managed table, so does it mean it will delete my s3 original data or what does it mean that spark deletes the data and metadata.

    spark.sql( DROP TABLE managed_us_delay_flights_tbl )
    
  2. I read here that when we create managed tables, spark uses the delta format, actually my original data in csv format in s3, does it mean it will change csv to delta format or it will duplicate the same with and write it in delta format somewhere ?

  3. if I create spark managed tables, does it use the same underlying storage or something new location, please explain in detail.

问题回答

暂无回答




相关问题
how to use phoenix5.0 with spark 3.0 preview

case "phoenix" =>{ outputStream.foreachRDD(rdd=>{ val spark=SparkSession.builder().config(rdd.sparkContext.getConf).getOrCreate() val ds=spark.createDataFrame(rdd,...

同一S3bucket使用多位证书

I'm using 2.1.1 with Hadoop 2.7.3 and I'm use data from different S3 sites in one管线。

热门标签