Question

I have a URL which can be any of the following formats:

http://example.com
https://example.com
http://example.com/foo
http://example.com/foo/bar
www.example.com
example.com
foo.example.com
www.foo.example.com
foo.bar.example.com
http://foo.bar.example.com/foo/bar
example.net/foo/bar

Essentially, I need to be able to match any normal URL. How can I extract example.com (or .net, whatever the tld happens to be. I need this to work with any TLD.) from all of these via a single regex?

Answer 1

Well you can use parse_url to get the host:

$info = parse_url($url);
$host = $info[ host ];

Then, you can do some fancy stuff to get only the TLD and the Host

$host_names = explode(".", $host);
$bottom_host_name = $host_names[count($host_names)-2] . "." . $host_names[count($host_names)-1];

Not very elegant, but should work.

If you want an explanation, here it goes:

First we grab everything between the scheme (http://, etc), by using parse_url s capabilities to... well.... parse URL s. :)

Then we take the host name, and separate it into an array based on where the periods fall, so test.world.hello.myname would become:

array("test", "world", "hello", "myname");

After that, we take the number of elements in the array (4).

Then, we subtract 2 from it to get the second to last string (the hostname, or example, in your example)

Then, we subtract 1 from it to get the last string (because array keys start at 0), also known as the TLD

Then we combine those two parts with a period, and you have your base host name.

Answer 2

It is not possible to get the domain name without using a TLD list to compare with as their exist many cases with completely the same structure and length:

 nas.db.de (Subdomain)
 bbc.co.uk (Top-Level-Domain)

 www.uk.com (Subdomain)
 big.uk.com (Second-Level-Domain)

Mozilla s public suffix list should be the best option as it is used by all major browsers:
https://publicsuffix.org/list/public_suffix_list.dat

Feel free to use my function:

function tld_list($cache_dir=null) {
    // we use "/tmp" if $cache_dir is not set
    $cache_dir = isset($cache_dir) ? $cache_dir : sys_get_temp_dir();
    $lock_dir = $cache_dir .  /public_suffix_list_lock/ ;
    $list_dir = $cache_dir .  /public_suffix_list/ ;
    // refresh list all 30 days
    if (file_exists($list_dir) && @filemtime($list_dir) + 2592000 > time()) {
        return $list_dir;
    }
    // use exclusive lock to avoid race conditions
    if (!file_exists($lock_dir) && @mkdir($lock_dir)) {
        // read from source
        $list = @fopen( https://publicsuffix.org/list/public_suffix_list.dat ,  r );
        if ($list) {
            // the list is older than 30 days so delete everything first
            if (file_exists($list_dir)) {
                foreach (glob($list_dir .  * ) as $filename) {
                    unlink($filename);
                }
                rmdir($list_dir);
            }
            // now set list directory with new timestamp
            mkdir($list_dir);
            // read line-by-line to avoid high memory usage
            while ($line = fgets($list)) {
                // skip comments and empty lines
                if ($line[0] ==  /  || !$line) {
                    continue;
                }
                // remove wildcard
                if ($line[0] . $line[1] ==  *. ) {
                    $line = substr($line, 2);
                }
                // remove exclamation mark
                if ($line[0] ==  ! ) {
                    $line = substr($line, 1);
                }
                // reverse TLD and remove linebreak
                $line = implode( . , array_reverse(explode( . , (trim($line)))));
                // we split the TLD list to reduce memory usage
                touch($list_dir . $line);
            }
            fclose($list);
        }
        @rmdir($lock_dir);
    }
    // repair locks (should never happen)
    if (file_exists($lock_dir) && mt_rand(0, 100) == 0 && @filemtime($lock_dir) + 86400 < time()) {
        @rmdir($lock_dir);
    }
    return $list_dir;
}
function get_domain($url=null) {
    // obtain location of public suffix list
    $tld_dir = tld_list();
    // no url = our own host
    $url = isset($url) ? $url : $_SERVER[ SERVER_NAME ];
    // add missing scheme      ftp://            http:// ftps://   https://
    $url = !isset($url[5]) || ($url[3] !=  :  && $url[4] !=  :  && $url[5] !=  : ) ?  http://  . $url : $url;
    // remove "/path/file.html", "/:80", etc.
    $url = parse_url($url, PHP_URL_HOST);
    // replace absolute domain name by relative (http://www.dns-sd.org/TrailingDotsInDomainNames.html)
    $url = trim($url,  . );
    // check if TLD exists
    $url = explode( . , $url);
    $parts = array_reverse($url);
    foreach ($parts as $key => $part) {
        $tld = implode( . , $parts);
        if (file_exists($tld_dir . $tld)) {
            return !$key ?    : implode( . , array_slice($url, $key - 1));
        }
        // remove last part
        array_pop($parts);
    }
    return   ;
}

What it makes special:

it accepts every input like URLs, hostnames or domains with- or without scheme
the list is downloaded row-by-row to avoid high memory usage
it creates a new file per TLD in a cache folder so get_domain() only needs to check through file_exists() if it exists so it does not need to include a huge database on every request like TLDExtract does it.
the list will be automatically updated every 30 days

Test:

$urls = array(
     http://www.example.com ,// example.com
     http://subdomain.example.com ,// example.com
     http://www.example.uk.com ,// example.uk.com
     http://www.example.co.uk ,// example.co.uk
     http://www.example.com.ac ,// example.com.ac
     http://example.com.ac ,// example.com.ac
     http://www.example.accident-prevention.aero ,// example.accident-prevention.aero
     http://www.example.sub.ar ,// sub.ar
     http://www.congresodelalengua3.ar ,// congresodelalengua3.ar
     http://congresodelalengua3.ar ,// congresodelalengua3.ar
     http://www.example.pvt.k12.ma.us ,// example.pvt.k12.ma.us
     http://www.example.lib.wy.us ,// example.lib.wy.us
     com ,// empty
     .com ,// empty
     http://big.uk.com ,// big.uk.com
     uk.com ,// empty
     www.uk.com ,// www.uk.com
     .uk.com ,// empty
     stackoverflow.com ,// stackoverflow.com
     .foobarfoo ,// empty
      ,// empty
    false,// empty
       ,// empty
    1,// empty
     a ,// empty    
);

Recent version with explanations (German):
http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm

Answer 3

My solution in https://gist.github.com/pocesar/5366899

and the tests are here http://codepad.viper-7.com/GAh1tP

It works with any TLD, and hideous subdomain patterns (up to 3 subdomains).

There s a test included with many domain names.

Won t paste the function here because of the weird indentation for code in StackOverflow (could have fenced code blocks like github)

Answer 4

echo getDomainOnly("http://example.com/foo/bar");

function getDomainOnly($host){
    $host = strtolower(trim($host));
    $host = ltrim(str_replace("http://","",str_replace("https://","",$host)),"www.");
    $count = substr_count($host,  . );
    if($count === 2){
        if(strlen(explode( . , $host)[1]) > 3) $host = explode( . , $host, 2)[1];
    } else if($count > 2){
        $host = getDomainOnly(explode( . , $host, 2)[1]);
    }
    $host = explode( / ,$host);
    return $host[0];
}

Answer 5

I recommend using TLDExtract library for all operations with domain name.

Answer 6

I think the best way to handle this problem is:

$second_level_domains_regex =  /.asn.au$|.com.au$|.net.au$|.id.au$|.org.au$|.edu.au$|.gov.au$|.csiro.au$|.act.au$|.nsw.au$|.nt.au$|.qld.au$|.sa.au$|.tas.au$|.vic.au$|.wa.au$|.co.at$|.or.at$|.priv.at$|.ac.at$|.avocat.fr$|.aeroport.fr$|.veterinaire.fr$|.co.hu$|.film.hu$|.lakas.hu$|.ingatlan.hu$|.sport.hu$|.hotel.hu$|.ac.nz$|.co.nz$|.geek.nz$|.gen.nz$|.kiwi.nz$|.maori.nz$|.net.nz$|.org.nz$|.school.nz$|.cri.nz$|.govt.nz$|.health.nz$|.iwi.nz$|.mil.nz$|.parliament.nz$|.ac.za$|.gov.za$|.law.za$|.mil.za$|.nom.za$|.school.za$|.net.za$|.co.uk$|.org.uk$|.me.uk$|.ltd.uk$|.plc.uk$|.net.uk$|.sch.uk$|.ac.uk$|.gov.uk$|.mod.uk$|.mil.uk$|.nhs.uk$|.police.uk$/ ;
$domain = $_SERVER[ HTTP_HOST ];
$domain = explode( . , $domain);
$domain = array_reverse($domain);
if (preg_match($second_level_domains_regex, $_SERVER[ HTTP_HOST ]) {
    $domain = "$domain[2].$domain[1].$domain[0]";
} else {
    $domain = "$domain[1].$domain[0]";
}

Answer 7

$onlyHostName = implode( . , array_slice(explode( . , parse_url($link, PHP_URL_HOST)), -2));

Using https://subdomain.domain.com/some/path as example

parse_url($link, PHP_URL_HOST) returns subdomain.domain.com

explode( . , parse_url($link, PHP_URL_HOST)) then breaks subdomain.domain.com into an array:

array(3) {
  [0]=>
  string(5) "subdomain"
  [1]=>
  string(7) "domain"
  [2]=>
  string(3) "com"
}

array_slice then slices the array so only the last 2 values are in the array (signified by the -2):

array(2) {
  [0]=>
  string(6) "domain"
  [1]=>
  string(3) "com"
}

implode then combines those two array values back together, ultimately giving you the result of domain.com

Note: this will only work when end domain you re expecting only has one . in it, like something.domain.com or else.something.domain.net

It will not work for something.domain.co.uk where you would expect domain.co.uk

Answer 8

There are two ways to extract subdomain from a host:

The first method that is more accurate is to use a database of tlds (like public_suffix_list.dat) and match domain with it. This is a little heavy in some cases. There are some PHP classes for using it like php-domain-parser and TLDExtract.

The second way is not as accurate as the first one, but is very fast and it can give the correct answer in many case, I wrote this function for it:

function get_domaininfo($url) {
    // regex can be replaced with parse_url
    preg_match("/^(https|http|ftp)://(.*?)//", "$url/" , $matches);
    $parts = explode(".", $matches[2]);
    $tld = array_pop($parts);
    $host = array_pop($parts);
    if ( strlen($tld) == 2 && strlen($host) <= 3 ) {
        $tld = "$host.$tld";
        $host = array_pop($parts);
    }

    return array(
         protocol  => $matches[1],
         subdomain  => implode(".", $parts),
         domain  => "$host.$tld",
         host =>$host, tld =>$tld
    );
}

Example:

print_r(get_domaininfo( http://mysubdomain.domain.co.uk/index.php ));

Returns:

Array
(
    [protocol] => https
    [subdomain] => mysubdomain
    [domain] => domain.co.uk
    [host] => domain
    [tld] => co.uk
)

Answer 9

Here s a function I wrote to grab the domain without subdomain(s), regardless of whether the domain is using a ccTLD or a new style long TLD, etc... There is no lookup or huge array of known TLDs, and there s no regex. It can be a lot shorter using the ternary operator and nesting, but I expanded it for readability.

// Per Wikipedia: "All ASCII ccTLD identifiers are two letters long, 
// and all two-letter top-level domains are ccTLDs."

function topDomainFromURL($url) {
  $url_parts = parse_url($url);
  $domain_parts = explode( . , $url_parts[ host ]);
  if (strlen(end($domain_parts)) == 2 ) { 
    // ccTLD here, get last three parts
    $top_domain_parts = array_slice($domain_parts, -3);
  } else {
    $top_domain_parts = array_slice($domain_parts, -2);
  }
  $top_domain = implode( . , $top_domain_parts);
  return $top_domain;
}

Answer 10

function getDomain($url){
    $pieces = parse_url($url);
    $domain = isset($pieces[ host ]) ? $pieces[ host ] :   ;
    if(preg_match( /(?P<domain>[a-z0-9][a-z0-9-]{1,63}.[a-z.]{2,6})$/i , $domain, $regs)){
        return $regs[ domain ];
    }
    return FALSE;
}

echo getDomain("http://example.com"); // outputs  example.com 
echo getDomain("http://www.example.com"); // outputs  example.com 
echo getDomain("http://mail.example.co.uk"); // outputs  example.co.uk

Answer 11

I had problems with the solution provided by pocesar. When I would use for instance subdomain.domain.nl it would not return domain.nl. Instead it would return subdomain.domain.nl Another problem was that domain.com.br would return com.br

I am not sure but i fixed these issues with the following code (i hope it will help someone, if so I am a happy man):

function get_domain($domain, $debug = false){
    $original = $domain = strtolower($domain);
    if (filter_var($domain, FILTER_VALIDATE_IP)) {
        return $domain;
    }
    $debug ? print( <strong style="color:green">&raquo;</strong> Parsing:  .$original) : false;
    $arr = array_slice(array_filter(explode( . , $domain, 4), function($value){
        return $value !==  www ;
    }), 0); //rebuild array indexes
    if (count($arr) > 2){
        $count = count($arr);
        $_sub = explode( . , $count === 4 ? $arr[3] : $arr[2]);
        $debug ? print(" (parts count: {$count})") : false;
        if (count($_sub) === 2){ // two level TLD
            $removed = array_shift($arr);
            if ($count === 4){ // got a subdomain acting as a domain
                $removed = array_shift($arr);
            }
            $debug ? print("<br>
" .  [*] Two level TLD: <strong>  . join( . , $_sub) .  </strong>  ) : false;
        }elseif (count($_sub) === 1){ // one level TLD
            $removed = array_shift($arr); //remove the subdomain
            if (strlen($arr[0]) === 2 && $count === 3){ // TLD domain must be 2 letters
                array_unshift($arr, $removed);
            }elseif(strlen($arr[0]) === 3 && $count === 3){
                array_unshift($arr, $removed);
            }else{
                // non country TLD according to IANA
                $tlds = array(
                     aero ,
                     arpa ,
                     asia ,
                     biz ,
                     cat ,
                     com ,
                     coop ,
                     edu ,
                     gov ,
                     info ,
                     jobs ,
                     mil ,
                     mobi ,
                     museum ,
                     name ,
                     net ,
                     org ,
                     post ,
                     pro ,
                     tel ,
                     travel ,
                     xxx ,
                );
                if (count($arr) > 2 && in_array($_sub[0], $tlds) !== false){ //special TLD don t have a country
                    array_shift($arr);
                }
            }
            $debug ? print("<br>
" . [*] One level TLD: <strong> .join( . , $_sub). </strong>  ) : false;
        }else{ // more than 3 levels, something is wrong
            for ($i = count($_sub); $i > 1; $i--){
                $removed = array_shift($arr);
            }
            $debug ? print("<br>
" .  [*] Three level TLD: <strong>  . join( . , $_sub) .  </strong>  ) : false;
        }
    }elseif (count($arr) === 2){
        $arr0 = array_shift($arr);
        if (strpos(join( . , $arr),  . ) === false && in_array($arr[0], array( localhost , test , invalid )) === false){ // not a reserved domain
            $debug ? print("<br>
" . Seems invalid domain: <strong> .join( . , $arr). </strong> re-adding: <strong> .$arr0. </strong>  ) : false;
            // seems invalid domain, restore it
            array_unshift($arr, $arr0);
        }
    }
    $debug ? print("<br>
". <strong style="color:gray">&laquo;</strong> Done parsing: <span style="color:red">  . $original .  </span> as <span style="color:blue"> . join( . , $arr) ."</span><br>
") : false;
    return join( . , $arr);
}

Answer 12

Here s one that works for all domains, including those with second level domains like "co.uk"

function strip_subdomains($url){

    # credits to gavingmiller for maintaining this list
    $second_level_domains = file_get_contents("https://raw.githubusercontent.com/gavingmiller/second-level-domains/master/SLDs.csv");

    # presume sld first ...
    $possible_sld = implode( . , array_slice(explode( . , $url), -2));

    # and then verify it
    if (strpos($second_level_domains, $possible_sld)){
        return  implode( . , array_slice(explode( . , $url), -3));
    } else {
        return  implode( . , array_slice(explode( . , $url), -2));
    }
}

Looks like there s a duplicate question here: delete-subdomain-from-url-string-if-subdomain-is-found

Answer 13

Very late, I see that you marked regex as a keyword and my function works like a charm, so far I haven t found a url that fails:

function get_domain_regex($url){
  $pieces = parse_url($url);
  $domain = isset($pieces[ host ]) ? $pieces[ host ] :   ;
  if (preg_match( /(?P<domain>[a-z0-9][a-z0-9-]{1,63}.[a-z.]{2,6})$/i , $domain, $regs)) {
    return $regs[ domain ];
  }else{
    return false;
  }
}

if you want one without regex I have this one, which I am sure I also took from this post

function get_domain($url){
  $parseUrl = parse_url($url);
  $host = $parseUrl[ host ];
  $host_array = explode(".", $host);
  $domain = $host_array[count($host_array)-2] . "." . $host_array[count($host_array)-1];
  return $domain;
}

They both work amazing, BUT, this took me a while to realize if the url doesn t start with http:// or https:// it will fail so make sure the url string starts with the protocol.

Answer 14

Simply try this:

   preg_match( /(www.)?([^.]+.[^.]+)$/ , $yourHost, $matches);

   echo "domain name is: {$matches[0]}
";

this working for majority of domains.

Answer 15

This function will return the domain name without the extension of any url given even if you parse a url without the http:// or https://

You can extend this code

(?:.co)?(?:.com)?(?:.gov)?(?:.net)?(?:.org)?(?:.id)?

with more extensions if you want to handle more second level domainnames.

    function get_domain_name($url){
      $pieces = parse_url($url);
      $domain = isset($pieces[ host ]) ? $pieces[ host ] : $url;
      $domain = strtolower($domain);
      $domain = preg_replace( /.international$/ ,  .com , $domain);
      if (preg_match( /(?P<domain>[a-z0-9][a-z0-9-]{1,90}.[a-z.]{2,6})$/i , $domain, $regs)) {
          if (preg_match( /(.*?)((?:.co)?(?:.com)?(?:.gov)?(?:.net)?(?:.org)?(?:.id)?(?:.asn)?.[a-z]{2,6})$/i , $regs[ domain ], $matches)) {
              return $matches[1];
          }else  return $regs[ domain ];
      }else{
        return $url;
      }
    }

Answer 16

I m using this to achieve the same target and it always works, I hope it will help others.

$url          = https://use.fontawesome.com/releases/v5.11.2/css/all.css?ver=2.7.5
$handle       = pathinfo( parse_url( $url )[ host ] )[ filename ];
$final_handle = substr( $handle , strpos( $handle ,  .  ) + 1 );

print_r($final_handle); // fontawesome

Answer 17

Simplest solution

@preg_replace( #/(.)*# ,   , @preg_replace( #^https?://(www.)?# ,   , $url))

Answer 18

Simply try this:

<?php
  $host = $_SERVER[ HTTP_HOST ];
  preg_match("/[^./]+.[^./]+$/", $host, $matches);
  echo "domain name is: {$matches[0]}
";
?>

友情链接