English 中文(简体)
html_element Back NA, I candhen
原标题:html_element returning NA and I cand understand why

I m 带R的网络报废,并试图为IMDB的250部顶部电影提取数据。 我的法典迄今为止非常简短:

library(tidyverse)
library(rvest)

page = read_html( https://www.imdb.com/chart/top/?ref_=nv_mv_250 )

base = html_elements(page,  li )
base %>% html_elements( h3 ) %>% html_text2() %>% str_remove( ^[0-9]+\.  )
base %>% html_element( .sc-b0691f29-7 hrgukm cli-title-metadata )

每当我尝试使用<条码>html_element时,我似乎只拿到NAs: 最后一行的情况就是这样,该行本应摘取电影的年、持续时间和年龄评定,但只收回新教徒。

同样的情况是第二行到最后一行,试图提取h3内容,在这种情况下,这些内容带有电影名称。 如果我使用<条码>html_element,则我收到一份NAs清单,如果我使用<条码>html_elements/code>。 我取得了预期的结果(这一替代方式对最后一行没有工作)。 我做了什么错误?

问题回答

您可以使用一组类别选择(标题__text),然后做些后处理。 然而,年、期限和评级比较困难,因为评级的间隔并不总是存在。 因此,如果存在下列情况,你可以选择另一个班级的选任人(sc-b0691f29-7.hrgukm.cli-title-metadata),并处理每一期(年)、第二个期限(期限)和第三个周期(评级):

library(rvest)
library(purrr)
library(tibble)

titles <- read_html( https://www.imdb.com/chart/top/?ref_=nv_mv_250 ) %>% 
  html_elements(".ipc-title__text") %>% 
  html_text() %>% 
  `[`(grepl("^\d", .)) %>% 
  sub("^\d+\. ", "", .)

ylr <- read_html( https://www.imdb.com/chart/top/?ref_=nv_mv_250 ) %>% 
  html_elements(".sc-b0691f29-7.hrgukm.cli-title-metadata")

years <- map_chr(ylr, ~ html_elements(., "span")[1] %>% html_text())

durations <- map_chr(ylr, ~ html_elements(., "span")[2] %>% html_text())

ratings <- ylr %>% 
  map_chr(
    ~ ifelse(
      length(html_elements(., "span")) == 3, 
      html_elements(., "span")[3] %>% html_text(), 
      NA_character_
    )
  )

tibble(
  title = titles,
  year = years,
  duration = durations,
  rating = ratings
)
#> # A tibble: 250 × 4
#>    title                                             year  duration rating  
#>    <chr>                                             <chr> <chr>    <chr>   
#>  1 The Shawshank Redemption                          1994  2h 22m   R       
#>  2 The Godfather                                     1972  2h 55m   R       
#>  3 The Dark Knight                                   2008  2h 32m   PG-13   
#>  4 The Godfather Part II                             1974  3h 22m   R       
#>  5 12 Angry Men                                      1957  1h 36m   Approved
#>  6 Schindler s List                                  1993  3h 15m   R       
#>  7 The Lord of the Rings: The Return of the King     2003  3h 21m   PG-13   
#>  8 Pulp Fiction                                      1994  2h 34m   R       
#>  9 The Lord of the Rings: The Fellowship of the Ring 2001  2h 58m   PG-13   
#> 10 The Good, the Bad and the Ugly                    1966  2h 58m   Approved
#> # ℹ 240 more rows

Created on 2024-03-26 with reprex v2.1.0





相关问题
How to plot fitted model over observed time series

This is a really really simple question to which I seem to be entirely unable to get a solution. I would like to do a scatter plot of an observed time series in R, and over this I want to plot the ...

REvolution for R

since the latest Ubuntu release (karmic koala), I noticed that the internal R package advertises on start-up the REvolution package. It seems to be a library collection for high-performance matrix ...

R - capturing elements of R output into text files

I am trying to run an analysis by invoking R through the command line as follows: R --no-save < SampleProgram.R > SampleProgram.opt For example, consider the simple R program below: mydata =...

R statistical package: wrapping GOFrame objects

I m trying to generate GOFrame objects to generate a gene ontology mapping in R for unsupported organisms (see http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/...

Changing the order of dodged bars in ggplot2 barplot

I have a dataframe df.all and I m plotting it in a bar plot with ggplot2 using the code below. I d like to make it so that the order of the dodged bars is flipped. That is, so that the bars labeled "...

Strange error when using sparse matrices and glmnet

I m getting a weird error when training a glmnet regression. invalid class "dgCMatrix" object: length(Dimnames[[2]]) must match Dim[2] It only happens occasionally, and perhaps only under larger ...

Generating non-duplicate combination pairs in R

Sorry for the non-descriptive title but I don t know whether there s a word for what I m trying to achieve. Let s assume that I have a list of names of different classes like c( 1 , 2 , 3 , 4 ) ...

Per panel smoothing in ggplot2

I m plotting a group of curves, using facet in ggplot2. I d like to have a smoother applied to plots where there are enough points to smooth, but not on plots with very few points. In particular I d ...

热门标签