Get the second level domain URL (java)

I am wondering if java has a parser or library for extracting a second level domain (SLD) in the url, or if algo or regex does not exist for this. For example:

URI uri = new URI("http://www.mydomain.ltd.uk/blah/some/page.html"); String host = uri.getHost(); System.out.println(host); 

which prints:

 mydomain.ltd.uk 

Now what I would like to do is a reliable definition of the SLD component ("ltd.uk"). Any ideas?

Edit: I'm ideally looking for a generic option, so I would compare ".uk" to "polic.uk", ".co.uk" to "bbc.co. uk" and ".com" to "amazon.com".

thank

+14
java url
Dec 17 '09 at 18:54
source share
11 answers

I don’t know my purpose, but the second level domain may not mean much to you. You probably need to find a public suffix , and the domain just below it is what you are looking for.

The Apache Http component (HttpClient 4) comes with classes for processing,

 org.apache.http.impl.cookie.PublicSuffixFilter org.apache.http.impl.cookie.PublicSuffixListParser 

You need to upload a public suffix list,

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

+14
Dec 17 '09 at 19:24
source share

After everything is remembered here, the correct solution should be (with guava)

InternetDomainName.from (uriHost) .topPrivateDomain () ToString () ;.

when using Guava to get a private domain name

+11
Sep 19 '12 at 11:51 on
source share

After looking at these answers and not satisfied with them, I used the com.google.common.net.InternetDomainName class to subtract the public parts of the domain name from all parts:

 Set<String> nonePublicDomainParts(String uriHost) { InternetDomainName fullDomainName = InternetDomainName.from(uriHost); InternetDomainName publicDomainName = fullDomainName.publicSuffix(); Set<String> nonePublicParts = new HashSet<String>(fullDomainName.parts()); nonePublicParts.removeAll(publicDomainName.parts()); return nonePublicParts; } 

This class is located on maven in the guava library:

  <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>10.0.1</version> <scope>compile</scope> </dependency> 

Inside this class, TldPatterns.class is used, which is a private package and contains a list of top-level domains baked in it.

Interestingly, if you look at the source of these classes from the link below, it explicitly lists "polic.uk" as a private domain name. This is correct as polic.uk is a private domain controlled by the police; else criminals.police.uk will send you an email asking for your credit card information in connection with their ongoing investigations into card fraud;)

http: //code.google.com

+10
Dec 24 '11 at 21:28
source share

The answer chosen is the best approach. For those of you who don't want to code it, here's how I did it.

Firstly, either I do not understand org.apache.http.impl.cookie.PublicSuffixFilter, or there is an error in it.

Basically, if you go to google.com, it returns false correctly. If you go to google.com.au, it will incorrectly return true. An error in the code that applies the templates, for example. * .au.

Here is the verification code based on org.apache.http.impl.cookie.PublicSuffixFilter:

 public class TopLevelDomainChecker { private Set<String> exceptions; private Set<String> suffixes; public void setPublicSuffixes(Collection<String> suffixes) { this.suffixes = new HashSet<String>(suffixes); } public void setExceptions(Collection<String> exceptions) { this.exceptions = new HashSet<String>(exceptions); } /** * Checks if the domain is a TLD. * @param domain * @return */ public boolean isTLD(String domain) { if (domain.startsWith(".")) domain = domain.substring(1); // An exception rule takes priority over any other matching rule. // Exceptions are ones that are not a TLD, but would match a pattern rule // eg bl.uk is not a TLD, but the rule *.uk means it is. Hence there is an exception rule // stating that bl.uk is not a TLD. if (this.exceptions != null && this.exceptions.contains(domain)) return false; if (this.suffixes == null) return false; if (this.suffixes.contains(domain)) return true; // Try patterns. ie *.jp means that boo.jp is a TLD int nextdot = domain.indexOf('.'); if (nextdot == -1) return false; domain = "*" + domain.substring(nextdot); if (this.suffixes.contains(domain)) return true; return false; } public String extractSLD(String domain) { String last = domain; boolean anySLD = false; do { if (isTLD(domain)) { if (anySLD) return last; else return ""; } anySLD = true; last = domain; int nextDot = domain.indexOf("."); if (nextDot == -1) return ""; domain = domain.substring(nextDot+1); } while (domain.length() > 0); return ""; } } 

And a parser. I renamed it.

 /** * Parses the list from <a href="http://publicsuffix.org/">publicsuffix.org * Copied from http://svn.apache.org/repos/asf/httpcomponents/httpclient/trunk/httpclient/src/main/java/org/apache/http/impl/cookie/PublicSuffixListParser.java */ public class TopLevelDomainParser { private static final int MAX_LINE_LEN = 256; private final TopLevelDomainChecker filter; TopLevelDomainParser(TopLevelDomainChecker filter) { this.filter = filter; } public void parse(Reader list) throws IOException { Collection<String> rules = new ArrayList(); Collection<String> exceptions = new ArrayList(); BufferedReader r = new BufferedReader(list); StringBuilder sb = new StringBuilder(256); boolean more = true; while (more) { more = readLine(r, sb); String line = sb.toString(); if (line.length() == 0) continue; if (line.startsWith("//")) continue; //entire lines can also be commented using // if (line.startsWith(".")) line = line.substring(1); // A leading dot is optional // An exclamation mark (!) at the start of a rule marks an exception to a previous wildcard rule boolean isException = line.startsWith("!"); if (isException) line = line.substring(1); if (isException) { exceptions.add(line); } else { rules.add(line); } } filter.setPublicSuffixes(rules); filter.setExceptions(exceptions); } private boolean readLine(Reader r, StringBuilder sb) throws IOException { sb.setLength(0); int b; boolean hitWhitespace = false; while ((b = r.read()) != -1) { char c = (char) b; if (c == '\n') break; // Each line is only read up to the first whitespace if (Character.isWhitespace(c)) hitWhitespace = true; if (!hitWhitespace) sb.append(c); if (sb.length() > MAX_LINE_LEN) throw new IOException("Line too long"); // prevent excess memory usage } return (b != -1); } } 

And finally, how to use it

  FileReader fr = new FileReader("effective_tld_names.dat.txt"); TopLevelDomainChecker checker = new TopLevelDomainChecker(); TopLevelDomainParser parser = new TopLevelDomainParser(checker); parser.parse(fr); boolean result; result = checker.isTLD("com"); // true result = checker.isTLD("com.au"); // true result = checker.isTLD("ltd.uk"); // true result = checker.isTLD("google.com"); // false result = checker.isTLD("google.com.au"); // false result = checker.isTLD("metro.tokyo.jp"); // false String sld; sld = checker.extractSLD("com"); // "" sld = checker.extractSLD("com.au"); // "" sld = checker.extractSLD("google.com"); // "google.com" sld = checker.extractSLD("google.com.au"); // "google.com.au" sld = checker.extractSLD("www.google.com.au"); // "google.com.au" sld = checker.extractSLD("www.google.com"); // "google.com" sld = checker.extractSLD("foo.bar.hokkaido.jp"); // "foo.bar.hokkaido.jp" sld = checker.extractSLD("moo.foo.bar.hokkaido.jp"); // "foo.bar.hokkaido.jp" 
+2
Jan 28 2018-11-11T00:
source share
  • specified list + read wikipedia updates gives 98% correct TLD list
  • going through http://www.iana.org/domains/root/db/ and clicking on each of them, and the latest news gives you the other 2% (e.g. .com.aq and .gov.an)
  • Unfortunately, the big "free web spaces" providers are another thing to consider, for example. countless * .blogspot.com domains, if you download alexa top 100.000 (a free csv file), you can at least get a good overview of the most common ones that should get you for a certain percentage covered by these domains (for example, when comparing alexa's ranking with stumbleupon pageviews with delicious bookmarks) (alexa sometimes only occupies the top domain, and delicious is really md5 each URL, so 1 alexa β†’ some tasty md5 hashes
  • besides the fact that sometimes in the case of twitter, what comes after / also matters if you are looking for uniqueness to value something.

Here is a list of Alexa top 40,000 when a real TLD is filtered out to give you a feel: (this means that Alexa does NOT calculate the rating for the domain for the following):

bp.blogspot.com --- espn.go.com --- files.wordpress.com --- abcnews.go.com --- disney.go.com --- troktiko.blogspot.com-- -en. wordpress.com---api.ning.com---abc.go.com---220.181.38.82-213.174.154.20---abclocal.go.com---feedproxy.google.com/~r ---forums.wordpress.com---googleblog.blogspot.com---1.cnm999.com/user/10008-213.174.143.196---92.42.51.201---googlewebmastercentral.blogspot. com --- myespn.go.com --- 213.174.143.197 --- 61.132.221.146 --- support.wordpress.com --- dashboard.wordpress.com --- sethgodin.typepad.com --- PayGo. 17zhifu.com/user/10005---go2.wordpress.com---1.1.1.1---movies.go.com---home.comcast.net---googlesystem.blogspot.com---abcfamily. go.com---home.spaces.live.com---196.1.237.210---kaixin001.com/~record---xhamster.com/user/video---gold-oil-commodity.blogspot.com --- journeyplanner.tfl.gov.uk/user/XSLT_TRIP_REQUEST2-206.108.48.238---blog.wordpress.com---67.220.92.21-183.101.80.130-211.94.190.80--youtube -global.blogspot.com---uta-net.com/user/phplib---cinema3satu.blogspot.com---119. 147.41.16---sites.google.com/site/sites---kk.iij4u.or.jp/~dyo---220.181.6.19---toontown.go.com---signup.wordpress.com ---thesartorialist.blogspot.com---analytics.blogspot.com---ss.iij4u.or.jp/~ceh2---67.220.92.23---gmailblog.blogspot.com---183.99.121.86- --vgorode.ru/user/create---61.132.216.243--217.175.53.72---labnol.blogspot.com---adsense.blogspot.com---subscribe.wordpress.com---fimotro. blogspot.com --- creators.ning.com --- sarkari-naukri.blogspot.com --- search.wordpress.com --- orange-hiyoko.blogspot.com --- --- cashewmaniakpop.wordpress.com pixiehollow.go.com---adwords.blogspot.com---202.53.226.102---lorelle.wordpress.com---homestead.com/~site---multiply.com/user/signout---221.231 .148.249 --- 183.101.80.77---windowsliveintro.spaces.live.com---124.228.254.234---streaming-web.blogspot.com---id.tianya.cn/user/message---familyfun .go.com --- tro-ma-ktiko.blogspot.com---about.ning.com---paygo.17zhifu.com/user/10020---tututina.blogspot.com---toolserver.org /~geohack---superjob.ru/user/resume---ejobs.ro/user/locuri-de-Munch --- gnula.blogspot.com---alles.or.jp/~uir---chiark .greenend.org.uk / ~ sgtatham --- woork.blogspot.com --- 88.208.32.218---webstreamingmania.blogspot.com---spaces.live.com---youtube.com/user/RayWilliamJohnson- --cloob.com/user/login---asstr.org/~Kristen-- -getclicky.com/user/login---guesshermuff.blogspot.com---211.98.70.195-222.73. 105.196---pp.iij4u.or.jp/~taakii---unsoloclic.blogspot.com- --photoshopdisasters.blogspot.com --- 218.83.161.253 --- 217.16.18.163 --- 217.16.18.207-- -217.16.28.104---222.73.105.210---youtube.com/user/OldSpice--- hubpages.com/user/new---pelisdvdripdd.blogspot.com---95.143.193.60---es.wordpress .com --- 217.16.18.206 --- 61.147.116.146 --- damncoolpics.blogspot.com- --family.go.com --- 81.176.235.162 --- gutteruncensorednewsr.blogspot.com --- terselubung.blogspot .com --- faisalardhy.blogspot.com---67.220.92.14---goodreads.com/user/show --- 116.228.55.34---profile.typepad.com---kaixin001.com/~truth- --linkbuildersassociated.ning.com---nicotto.jp/user/mypage---ritemail.blogspo t.com---hyperboleandahalf.blogspot.com---carscoop.blogspot.com---tubemogul.com/user /dash---press-gr.blogspot.com---81.176.235.164---soapnet. go.com---208.98.30.69---trelokouneli.blogspot.com---help.ning.com---id.tianya.cn/user/register---slovari.yandex.ru/~%D0% BA% D0% BD% D0% B8% D0% B3% D0% B8 --- printable-coupons.blogspot.com --- unic77.blogspot.com --- --- globaleconomicanalysis.blogspot.com 183.101.80.68- --221.194.33.60---doujin-games88.blogspot.com---magaseek.com/user/SearchProducts---files.posterous.com---wwwnew.splinder.com---kolom-tutorial.blogspot. com---strobist.blogspot.com---67.21.91.73---needanarticle.com/user/activity---forum.moe.gov.om/~moeoman---milasdaydreams.blogspot.com---88.208 .17.189 --- 67.220.92.22-115.238.100.211---nonews-news.blogspot.com---testosterona.blog.br---nn.iij4u.or.jp/~has---cs. tut.fi/~jkorpela---youtube.com/user/oldspice---67.159.53.25---taxalia.blogspot.com---208.98.30.70---filmesporno.blog.br---alles-schallundrauch .blogspot.com --- vatera.hu/user/account---78.140.136.182 --- us.my.alibaba.com/user/join---stores.homestead.com---pes2008editing.blogspot.com ---ocn.ne.jp/~matrix---adweek.blogs. com---115.238.55.94---markjaquith.wordpress.com---k3.dion.ne.jp/~dreamlov---38.99.186.222---film.tv.it---android-developers.blogspot .com --- 217.218.110.147---kadokado.com/user/login---bollyvideolinks4u.blogspot.com---sookyeong.wordpress.com---87.101.230.11---livecodes.blogspot.com-- -67.220.91.19---homepage2.nifty.com/bustered---pp.iij4u.or.jp/~manga100---110.173.49.202---erogamescape.dyndns.org/~ap2---cs.berkeley .edu / ~ lorch --- cakewrecks.blogspot.com --- 59.106.117.185 --- 119.75.213.61 --- id.wordpress. com---de.wordpress.com---telefilmdblink.blogspot.com---61.139.105.138---multiply.com/user/join---programseo.blogspot.com---collectivebias.ning.com- --bablorub.blogspot.com---thinkexist.com/user/personalAccount---us.my.alibaba.com/user/sign---66.70.56.90---getsarkari-naukri.blogspot.com--- 59.106.117.183---productreviewplace.ning.com---support.weebly.com---kaixin001.com/~lucky---football-russia.blogspot.com---magaseek.com/user/ItemDetail-- -polprav.blogspot.com --- atlasshrugs2000.typepad.com --- jpn-manga.blogspot.com --- 88.208.32.219 --- google-latlong.blogspot.com --- 59.106.117.188 --- erogamescape .ddo.jp / ~ ap2 --- 218.87.32.245 --- watchhorrormovies.blogspot.com --- sarotiko.blogspot.com --- googlewebmastercentral-de.blogspot.com --- --- colmeia.blog.br us.my.alibaba.com/user/webatm---220.170.79.109---darkville.blogspot.com---youtube.com/user/PiMPDailyDose---disneymovierewards.go.com --- fukuoka.lg. jp---61.147.115.16---iisc.ernet.in---youtube.com/user/HuskyStarcraft---202.108.21 2.211---homepage3.nifty.com/otakarando---94.77.215.37---pitchit.ning.com---59.106.117.186---thestar.blogs.com---1.254.254.254---piratesonline. go.com---animedblink.blogspot.com --- 137.32.44.152---eurus.dti.ne.jp/~yoneyama---state.la.us---lastminute.is.it---bangpai .taobao.com / user / groups- --csse.monash.edu.au/~jwb---jquery-howto.blogspot.com---sakura.ne.jp/~moesino---users.skynet.be /mgueury---saitama.lg. JP --- portaldasfinancas.gov.pt --- bnonline.fi.cr --- 135.125.60.11 --- zhuhai.gd.cn --- kuna.net.kw --- 59.175.213.77 --- 58.218. 199.7---multiply.com/user/signin---youtube.com/user/HDstarcraft---blinklist.com/user/join---us.my.alibaba.com/user/company---jptwitterhelp. blogspot.com---67.220.92.017---88.208.17.51---youtube.com/user/GoogleWebmasterHelp---208.53.156.229---filmdblink.blogspot.com---blinklist.com/user/signup- --3arbtop.blogspot.com --- attivissimo.blogspot.com --- onlinemovie12.blogspot.com --- 98.1 26.189.86---mytvsource.blogspot.com---blinklist.com/user/login ---googlejapan.blogspot.com---76.73.65.166---gutteruncensorednewsb.blogspot.com---issuu.com/user/download --- 86.51.174.18 --- 88.208.17.120 --- profile.china .alibaba.com / user / admin --- jntuworldportal.blogspot.com --- sz.js.cn --- disneymovieclub. go.com---a1.com.mk---dd.iij4u.or.jp/~madonna---rr.iij4u.or.jp/~plasma---mlmlaunchformula.ning.com-1-1.78. 7.151---blogdelatele.blogspot.com---googlemobile.blogspot.com---78.109.199.240---wsu.edu/~brians---internapoli-city.blogspot.com---hh.iij4u.or .jp / ~ DMT --- kaixin001.com/~house---61.155.11.14---youtube.com/user/SHAYTARDS---turbobit.net/user/files---qjy168.com/user/do ---hubpages.com/user/finished---upload2.dyndns.org---f32.aaa.livedoor.jp/~azusa---naruto-spoilers.blogspot.com--205.209.140.195--- 193.227.20.21 --- adsenseforfeeds.blogspot.com --- group.ameba.jp/user/groups ---

+1
Aug 14 '10 at 23:15
source share

I have no answer for your specific case - and Jonathan's comment indicates that you should probably reorganize your question.

However, I suggest taking a look at the Reference class of the Restlet . It has a ton of useful methods. And since Restlet is Open Source, you will not need to use the entire library - you can load the source and add only one class to your project.

0
Dec 17 '09 at 19:02
source share

This is the desire that you want. publicSuffix

0
Aug 18 2018-11-21T00:
source share

one.

The nonePublicDomainParts method from the simbo1905 input should be fixed due to TLDs that contain "." for example "com.ac" :

input: "com.abc.com.ac"

output: "abc"

the correct output is "com.abc" .

To get the SLD , you can cut the TLD from the given domain using the publicSuffix() method.

2.

You cannot use a set because of domains that contain the same parts, for example:

input: part1.part2.part1.TLD

output: part1, part2

correct output: part1, part2, part1 or in the form of part1.part2.part1

So instead of Set<String> use List<String> .

0
Apr 03 2018-12-12T00:
source share
 public static String getTopLevelDomain(String uri) { InternetDomainName fullDomainName = InternetDomainName.from(uri); InternetDomainName publicDomainName = fullDomainName.topPrivateDomain(); String topDomain = ""; Iterator<String> it = publicDomainName.parts().iterator(); while(it.hasNext()){ String part = it.next(); if(!topDomain.isEmpty())topDomain += "."; topDomain += part; } return topDomain; } 

Just enter a domain and you will get a top level domain. download the jar file from http://code.google.com/p/guava-libraries/

0
Jun 27 '13 at 9:59
source share

Dnspy is another flexible alternative to publicsuffix lib.

0
Apr 15 '14 at 20:35
source share

If you need a second-level domain, you can split the string into "." and take the last two parts. Of course, this assumes that you always have a second-level domain that is not site specific (since it looks like what you want).

-one
Dec 17 '09 at 19:00
source share



All Articles