Question

tl; dr: 我在找一个方法, 找到我们数据库中缺少信息的条目, 从网站获取这些信息, 并将其添加到数据库条目中。

我们有一个媒体管理程序, 使用 MySQL 表格存储信息。当员工下载媒体( 视频文件、图像、音频文件) 并将其导入媒体经理时, 他们“ 表示 < / em > 还会复制媒体描述( 来源网站), 并将其添加到媒体经理的描述中。但是对于 < 坚固 > 千个 < / 坚固 > 文件, 还没有这样做。

文件名( 例如, < enger > file123 .mov) 是独一无二的, 可通过访问源网站的 URL 访问该文件的详细页面 :

网站:com/content/file123

我们想要从该页面上提取的信息有一个元素识别码,该识别码始终是相同的。

在我的脑海里,这个过程将是:

Connect to database and Load table

Filter: "format" is "Still Image (JPEG)"

Filter: "description" is "NULL"

Get first result

Get "FILENAME" without extension)

Load the URL: website.com/content/FILENAME

Copy contents of the element "description" (on website)

Paste contents into the "description" (SQL entry)

Get 2nd result

Rinse and repeat until last result is reached

我的问题是:"/强"

Is there software that could perform such a task or is this something that would need to be scripted?
If scripted, what would be the best type of script (eg could I achieve this using AppleScript or would it need to be made in java or php etc.)

Answer 1

我也不清楚是否有任何软件包可以做你所要做的事情。然而, Python 可以连接到您的数据库, 容易地提出网络请求, 并且处理肮脏的 html 。假设您已经安装了 Python, 您需要三个软件包 :

MySQLdb for connecting to the database.
Requests for easily making http web requests.
BeautifulSoup for robust parsing of html.

您可以用管道命令或 Windows 安装器安装这些软件包。每个站点都有适当的指示。整个过程不会超过 10 分钟。

import MySQLdb as db
import os.path
import requests
from bs4 import BeautifulSoup

# Connect to the database. Fill in these fields as necessary.

con = db.connect(host= hostname , user= username , passwd= password ,
                 db= dbname )

# Create and execute our SELECT sql statement.

select = con.cursor()
select.execute( SELECT filename FROM table_name 
                WHERE format = ? AND description = NULL ,
               ( Still Image (JPEG) ,))

while True:
    # Fetch a row from the result of the SELECT statement.

    row = select.fetchone()
    if row is None: break

    # Use Python s built-in os.path.splitext to split the extension
    # and get the url_name.

    filename = row[0]
    url_name = os.path.splitext(filename)[0]
    url =  http://www.website.com/content/  + url_name

    # Make the web request. You may want to rate-limit your requests
    # so that the website doesn t get angry. You can slow down the
    # rate by inserting a pause with:
    #               
    # import time   # You can put this at the top with other imports
    # time.sleep(1) # This will wait 1 second.

    response = requests.get(url)
    if response.status_code != 200:

        # Don t worry about skipped urls. Just re-run this script
        # on spurious or network-related errors.

        print  Error accessing: , url,  SKIPPING 
        continue

    # Parse the result. BeautifulSoup does a great job handling
    # mal-formed input.

    soup = BeautifulSoup(response.content)
    description = soup.find( div , { id :  description }).contents

    # And finally, update the database with another query.

    update = db.cursor()
    update.execute( UPDATE table_name SET description = ? 
                    WHERE filename = ? ,
                   (description, filename))

我会警告我,我做了一个很好的努力使该代码“看对了” 但我还没有实际测试它。你需要填写私人细节。

Answer 2

是否有软件可以执行这种任务,还是需要编脚?

我并不知道会有什么事情做你想从盒子里得到的东西(即使有的话, 所需要的配置也不会比滚动你自己的解决方案的脚本要差很多)。

如果有脚本,什么是最佳的脚本类型(例如,我可以用苹果Script实现这一点,或者需要用java或php等制作)

AppleScript 无法连接到数据库, 所以您绝对需要将其它东西放到组合中。如果选择是 Java 和 PHP 之间( 您对两者都同样熟悉 ), 我绝对建议为此使用 PHP, 因为所涉及的代码会少得多。

您的 PHP 脚本会看起来像这样 :

$BASEURL  =  http://website.com/content/ ;

// connect to the database
$dbh = new PDO($DSN, $USERNAME, $PASSWORD);

// query for files without descriptions
$qry = $dbh->query("
  SELECT FILENAME FROM mytable
  WHERE  format =  Still Image (JPEG)  AND description IS NULL
");

// prepare an update statement
$update = $dbh->prepare( 
  UPDATE mytable SET description = :d WHERE FILENAME = :f
 );

$update->bindParam( :d , $DESCRIPTION);
$update->bindParam( :f , $FILENAME);

// loop over the files
while ($FILENAME = $qry->fetchColumn()) {
  // construct URL
  $i = strrpos($FILENAME,  . );
  $url = $BASEURL . (($i === false) ? $FILENAME : substr($FILENAME, 0, $i));

  // fetch the document
  $doc = new DOMDocument();
  $doc->loadHTMLFile($url);

  // get the description
  $DESCRIPTION = $doc->getElementsById( description )->nodeValue;

  // update the database
  $update->execute();
}

Answer 3

PHP 是一个不错的“ 坚固” 剪切机 < / 坚固” 。我做了一个类, 将 PHP 的 CURL 端口包装在这里 :

http://semlabs.co.uk/journal/object-fool-curl-level-with-mul-threading" rel=“不跟随 no follown noreferrer'>http://semlabs.co.uk/journal/object-fool-curl-level-with-mul-threadting

您可能需要使用一些选项 :

http://www.php.net/manual/en/conference.curl-setopt.php" rel=“no follown noreferrer>>http://www.php.net/manual/en/confer.curl-setopt.php

对于刮过 HTML, 我通常使用常规表达式, 但这里是我制作的一个类, 应该可以查询 HTML 而不引起问题 :

http://pastebin.com/Jm9jKAU>http://pastebin.com/Jm9jKjAU

用途是:

$h = new HTMLQuery();
$h->load( $string_containing_html );
$h->getElements(  p ,  id  ); // Returns all p tags with an id attribute

最好的刮刮办法是 XPath, 但它无法处理肮脏的 HTML 。您可以用它来做一些事情, 比如:

//div[@ class = itm]/p[Last()和文本() =Hello World] < - 选择含有内部HTML Hello World的 div 元素中的最后 p

您可以在使用 DOM 类( 内建) 的 PHP 中使用此功能。

友情链接