Question

I m 带R的网络报废,并试图为IMDB的250部顶部电影提取数据。我的法典迄今为止非常简短:

library(tidyverse)
library(rvest)

page = read_html( https://www.imdb.com/chart/top/?ref_=nv_mv_250 )

base = html_elements(page,  li )
base %>% html_elements( h3 ) %>% html_text2() %>% str_remove( ^[0-9]+\.  )
base %>% html_element( .sc-b0691f29-7 hrgukm cli-title-metadata )

每当我尝试使用<条码>html_element时,我似乎只拿到NAs: 最后一行的情况就是这样,该行本应摘取电影的年、持续时间和年龄评定,但只收回新教徒。

同样的情况是第二行到最后一行,试图提取h3内容,在这种情况下,这些内容带有电影名称。如果我使用<条码>html_element,则我收到一份NAs清单,如果我使用<条码>html_elements/code>。我取得了预期的结果(这一替代方式对最后一行没有工作)。我做了什么错误?

Answer 1

您可以使用一组类别选择(标题__text),然后做些后处理。然而,年、期限和评级比较困难,因为评级的间隔并不总是存在。因此,如果存在下列情况,你可以选择另一个班级的选任人(sc-b0691f29-7.hrgukm.cli-title-metadata),并处理每一期(年)、第二个期限(期限)和第三个周期(评级):

library(rvest)
library(purrr)
library(tibble)

titles <- read_html( https://www.imdb.com/chart/top/?ref_=nv_mv_250 ) %>% 
  html_elements(".ipc-title__text") %>% 
  html_text() %>% 
  `[`(grepl("^\d", .)) %>% 
  sub("^\d+\. ", "", .)

ylr <- read_html( https://www.imdb.com/chart/top/?ref_=nv_mv_250 ) %>% 
  html_elements(".sc-b0691f29-7.hrgukm.cli-title-metadata")

years <- map_chr(ylr, ~ html_elements(., "span")[1] %>% html_text())

durations <- map_chr(ylr, ~ html_elements(., "span")[2] %>% html_text())

ratings <- ylr %>% 
  map_chr(
    ~ ifelse(
      length(html_elements(., "span")) == 3, 
      html_elements(., "span")[3] %>% html_text(), 
      NA_character_
    )
  )

tibble(
  title = titles,
  year = years,
  duration = durations,
  rating = ratings
)
#> # A tibble: 250 × 4
#>    title                                             year  duration rating  
#>    <chr>                                             <chr> <chr>    <chr>   
#>  1 The Shawshank Redemption                          1994  2h 22m   R       
#>  2 The Godfather                                     1972  2h 55m   R       
#>  3 The Dark Knight                                   2008  2h 32m   PG-13   
#>  4 The Godfather Part II                             1974  3h 22m   R       
#>  5 12 Angry Men                                      1957  1h 36m   Approved
#>  6 Schindler s List                                  1993  3h 15m   R       
#>  7 The Lord of the Rings: The Return of the King     2003  3h 21m   PG-13   
#>  8 Pulp Fiction                                      1994  2h 34m   R       
#>  9 The Lord of the Rings: The Fellowship of the Ring 2001  2h 58m   PG-13   
#> 10 The Good, the Bad and the Ugly                    1966  2h 58m   Approved
#> # ℹ 240 more rows

^{Created on 2024-03-26 with reprex v2.1.0}

友情链接