Hi i 正在学习pysstart
现在和现在,它正在为v数据工作,但如果将其转换成json <>/code>数据,则正在出现错误。
*Since Spark 2.3, the queries from raw JSON/CSV
files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show()
.
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache()
and then
df.filter($"_corrupt_record".isNotNull).count().*
样本json数据为
[
{
"student_id": 1,
"name": "John Doe",
"age": 18,
"grade": "A"
},
{
"student_id": 2,
"name": "Jane Smith",
"age": 17,
"grade": "B"
},
{
"student_id": 3,
"name": "Bob Johnson",
"age": 19,
"grade": "C"
},
{
"student_id": 4,
"name": "Alice Williams",
"age": 18,
"grade": "A"
},
{
"student_id": 5,
"name": "Charlie Brown",
"age": 17,
"grade": "B"
},
{
"student_id": 6,
"name": "Emma Davis",
"age": 19,
"grade": "C"
},
{
"student_id": 7,
"name": "James Miller",
"age": 18,
"grade": "A"
},
{
"student_id": 8,
"name": "Sophie Taylor",
"age": 17,
"grade": "B"
},
{
"student_id": 9,
"name": "David White",
"age": 19,
"grade": "C"
}
]
以及i 已经使用的假冒代码
mydata = spark.read.json("/original.csv")
mydata.show()