您可以使用一组类别选择(标题__text),然后做些后处理。 然而,年、期限和评级比较困难,因为评级的间隔并不总是存在。 因此,如果存在下列情况,你可以选择另一个班级的选任人(sc-b0691f29-7.hrgukm.cli-title-metadata),并处理每一期(年)、第二个期限(期限)和第三个周期(评级):
library(rvest)
library(purrr)
library(tibble)
titles <- read_html( https://www.imdb.com/chart/top/?ref_=nv_mv_250 ) %>%
html_elements(".ipc-title__text") %>%
html_text() %>%
`[`(grepl("^\d", .)) %>%
sub("^\d+\. ", "", .)
ylr <- read_html( https://www.imdb.com/chart/top/?ref_=nv_mv_250 ) %>%
html_elements(".sc-b0691f29-7.hrgukm.cli-title-metadata")
years <- map_chr(ylr, ~ html_elements(., "span")[1] %>% html_text())
durations <- map_chr(ylr, ~ html_elements(., "span")[2] %>% html_text())
ratings <- ylr %>%
map_chr(
~ ifelse(
length(html_elements(., "span")) == 3,
html_elements(., "span")[3] %>% html_text(),
NA_character_
)
)
tibble(
title = titles,
year = years,
duration = durations,
rating = ratings
)
#> # A tibble: 250 × 4
#> title year duration rating
#> <chr> <chr> <chr> <chr>
#> 1 The Shawshank Redemption 1994 2h 22m R
#> 2 The Godfather 1972 2h 55m R
#> 3 The Dark Knight 2008 2h 32m PG-13
#> 4 The Godfather Part II 1974 3h 22m R
#> 5 12 Angry Men 1957 1h 36m Approved
#> 6 Schindler s List 1993 3h 15m R
#> 7 The Lord of the Rings: The Return of the King 2003 3h 21m PG-13
#> 8 Pulp Fiction 1994 2h 34m R
#> 9 The Lord of the Rings: The Fellowship of the Ring 2001 2h 58m PG-13
#> 10 The Good, the Bad and the Ugly 1966 2h 58m Approved
#> # ℹ 240 more rows
Created on 2024-03-26 with reprex v2.1.0