Question

I built a site a long time ago and now I want to place the data into a database without copying and pasting the 400+ pages that it has grown to so that I can make the site database driven.

My site has meta tags like this (each page different):

<meta name="clan_name" content="Dark Mage" />

So what I m doing is using cURL to place the entire HTML page in a variable as a string. I can also do it with fopen etc..., but I don t think it matters.

I need to shift through the string to find Dark Mage and store it in a variable (so i can put into sql)

Any ideas on the best way to find Dark Mage to store in a variable? I was trying to use substr and then just subtracting the number of characters from the e in clan_name, but that was a bust.

Answer 1

Just parse the page using the PHP DOM functions, specifically loadHTML(). You can then walk the tree or use xpath to find the nodes you are looking for.

<?
$doc = new DomDocument;
$doc->loadHTML($html);
$meta = $doc->getElementsByTagName( meta );
foreach ($meta as $data) {
  $name = $meta->getAttribute( name );
  if ($name ==  clan_name ) {
    $content = $meta->getAttribute( content );
    // TODO handle content for clan_name
  }
} 
?>

EDIT If you want to remove certain tags (such as <script>) before you load your HTML string into memory, try using the strip_tags() function. Something like this will keep only the meta tags:

<?
  $html = strip_tags($html,  <meta> );
?>

Answer 2

Use a regular expression like the following, with PHP s preg_match():

/<meta name="clan_name" content="([^"]+)"/

If you re not familiar with regular expressions, read on.

The forward-slashes at the beginning and end delimit the regular expression. The stuff inside the delimiters is pretty straightforward except toward the end.

The square-brackets delimit a character class, and the caret at the beginning of the character-class is a negation-operator; taken together, then, this character class:

[^"]

means "match any character that is not a double-quote".

The + is a quantifier which requires that the preceding item occur at least once, and matches as many of the preceding item as appear adjacent to the first. So this:

[^"]+

means "match one or more characters that are not double-quotes".

Finally, the parentheses cause the regular-expression engine to store anything between them in a subpattern. So this:

([^"]+)

means "match one or more characters that are not double-quotes and store them as a matched subpattern.

In PHP, preg_match() stores matches in an array that you pass by reference. The full pattern is stored in the first element of the array, the first sub-pattern in the second element, and so forth if there are additional sub-patterns.

So, assuming your HTML page is in the variable "$page", the following code:

$matches = array();
$found = preg_match( /<meta name="clan_name" content="([^"]+)"/ , $page, $matches);

if ($found) {
    $clan_name = $matches[1];
}

Should get you what you want.

Answer 3

Use preg_match. A possible regular expression pattern is /clan_name.+content="([^"]+)"/

友情链接