It is not possible to get the domain name without using a TLD list to compare with as their exist many cases with completely the same structure and length:
nas.db.de (Subdomain)
bbc.co.uk (Top-Level-Domain)
www.uk.com (Subdomain)
big.uk.com (Second-Level-Domain)
Mozilla s public suffix list should be the best option as it is used by all major browsers:
https://publicsuffix.org/list/public_suffix_list.dat
Feel free to use my function:
function tld_list($cache_dir=null) {
// we use "/tmp" if $cache_dir is not set
$cache_dir = isset($cache_dir) ? $cache_dir : sys_get_temp_dir();
$lock_dir = $cache_dir . /public_suffix_list_lock/ ;
$list_dir = $cache_dir . /public_suffix_list/ ;
// refresh list all 30 days
if (file_exists($list_dir) && @filemtime($list_dir) + 2592000 > time()) {
return $list_dir;
}
// use exclusive lock to avoid race conditions
if (!file_exists($lock_dir) && @mkdir($lock_dir)) {
// read from source
$list = @fopen( https://publicsuffix.org/list/public_suffix_list.dat , r );
if ($list) {
// the list is older than 30 days so delete everything first
if (file_exists($list_dir)) {
foreach (glob($list_dir . * ) as $filename) {
unlink($filename);
}
rmdir($list_dir);
}
// now set list directory with new timestamp
mkdir($list_dir);
// read line-by-line to avoid high memory usage
while ($line = fgets($list)) {
// skip comments and empty lines
if ($line[0] == / || !$line) {
continue;
}
// remove wildcard
if ($line[0] . $line[1] == *. ) {
$line = substr($line, 2);
}
// remove exclamation mark
if ($line[0] == ! ) {
$line = substr($line, 1);
}
// reverse TLD and remove linebreak
$line = implode( . , array_reverse(explode( . , (trim($line)))));
// we split the TLD list to reduce memory usage
touch($list_dir . $line);
}
fclose($list);
}
@rmdir($lock_dir);
}
// repair locks (should never happen)
if (file_exists($lock_dir) && mt_rand(0, 100) == 0 && @filemtime($lock_dir) + 86400 < time()) {
@rmdir($lock_dir);
}
return $list_dir;
}
function get_domain($url=null) {
// obtain location of public suffix list
$tld_dir = tld_list();
// no url = our own host
$url = isset($url) ? $url : $_SERVER[ SERVER_NAME ];
// add missing scheme ftp:// http:// ftps:// https://
$url = !isset($url[5]) || ($url[3] != : && $url[4] != : && $url[5] != : ) ? http:// . $url : $url;
// remove "/path/file.html", "/:80", etc.
$url = parse_url($url, PHP_URL_HOST);
// replace absolute domain name by relative (http://www.dns-sd.org/TrailingDotsInDomainNames.html)
$url = trim($url, . );
// check if TLD exists
$url = explode( . , $url);
$parts = array_reverse($url);
foreach ($parts as $key => $part) {
$tld = implode( . , $parts);
if (file_exists($tld_dir . $tld)) {
return !$key ? : implode( . , array_slice($url, $key - 1));
}
// remove last part
array_pop($parts);
}
return ;
}
What it makes special:
- it accepts every input like URLs, hostnames or domains with- or without scheme
- the list is downloaded row-by-row to avoid high memory usage
- it creates a new file per TLD in a cache folder so
get_domain()
only needs to check through file_exists()
if it exists so it does not need to include a huge database on every request like TLDExtract does it.
- the list will be automatically updated every 30 days
Test:
$urls = array(
http://www.example.com ,// example.com
http://subdomain.example.com ,// example.com
http://www.example.uk.com ,// example.uk.com
http://www.example.co.uk ,// example.co.uk
http://www.example.com.ac ,// example.com.ac
http://example.com.ac ,// example.com.ac
http://www.example.accident-prevention.aero ,// example.accident-prevention.aero
http://www.example.sub.ar ,// sub.ar
http://www.congresodelalengua3.ar ,// congresodelalengua3.ar
http://congresodelalengua3.ar ,// congresodelalengua3.ar
http://www.example.pvt.k12.ma.us ,// example.pvt.k12.ma.us
http://www.example.lib.wy.us ,// example.lib.wy.us
com ,// empty
.com ,// empty
http://big.uk.com ,// big.uk.com
uk.com ,// empty
www.uk.com ,// www.uk.com
.uk.com ,// empty
stackoverflow.com ,// stackoverflow.com
.foobarfoo ,// empty
,// empty
false,// empty
,// empty
1,// empty
a ,// empty
);
Recent version with explanations (German):
http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm