What is a regex to get a URL token?

Say I have lines like this:

  bunch of other html <a href = "http://domain.com/133742/The_Token_I_Want.zip" more html and stuff
 bunch of other html <a href = "http://domain.com/12345/another_token.zip" more html and stuff
 bunch of other html <a href = "http://domain.com/0981723/YET_ANOTHER_TOKEN.zip" more html and stuff 

Which regular expression matches The_Token_I_Want , another_token , YET_ANOTHER_TOKEN ?

+4
source share
7 answers

Appendix B of RFC 2396 provides an overly regular expression for dividing a URI into its components, and we can adapt it for your case.

 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))? ####### 

This leaves The_Token_I_Want at $6 , which is a hashderlined subexpression above. (Note that hashes are not part of the template.) See it live:

 #! /usr/bin/perl $_ = "http://domain.com/133742/The_Token_I_Want.zip"; if (m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?!) { print "$6\n"; } else { print "no match\n"; } 

Conclusion:

  $ ./prog.pl
 The_Token_I_Want 

UPDATE: I see in the comment that you are using boost::regex , so be sure to escape the backslash in your C ++ program.

 #include <boost/foreach.hpp> #include <boost/regex.hpp> #include <iostream> #include <string> int main() { boost::regex token("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*" "/([^.]+)" // ####### I CAN HAZ HASHDERLINE PLZ "[^?#]*)(\\?([^#]*))?(#(.*))?"); const char * const urls[] = { "http://domain.com/133742/The_Token_I_Want.zip", "http://domain.com/12345/another_token.zip", "http://domain.com/0981723/YET_ANOTHER_TOKEN.zip", }; BOOST_FOREACH(const char *url, urls) { std::cout << url << ":\n"; std::string t; boost::cmatch m; if (boost::regex_match(url, m, token)) t = m[6]; else t = "<no match>"; std::cout << " - " << m[6] << '\n'; } return 0; } 

Conclusion:

  http://domain.com/133742/The_Token_I_Want.zip:
   - The_Token_I_Want
 http://domain.com/12345/another_token.zip:
   - another_token
 http://domain.com/0981723/YET_ANOTHER_TOKEN.zip:
   - YET_ANOTHER_TOKEN 
+3
source

Try the following:

/ (?: F | XT) TPN: / {2} (?: WWW.) Domain [^ /] + ([^ /] +) ([^ /] +) / r <?? ../ p>

or

/ \ w {3,5}:? / {2} (.?: w {3}) of the domain [^ /] + ([^ /] +) ([^ /] +) / ..

+1
source
 /a href="http://domain.com/[0-9]+/([a-zA-Z_]+).zip"/ 

Perhaps you want to add more characters to [a-zA-Z_] +

+1
source

You can use:

 (http|ftp)+://[[:alnum:]./_]+/([[:alnum:]._-]+).[[:alnum:]_-]+ 

( [[:alnum:]._-]+ ) is the group for the associated template, and in your example it will be The_Token_I_Want . to access this group, use \ 2 or $ 2, because ( http|ftp ) is the first group and ( [[:alnum:]._-]+ ) is the second group of the matching template.

+1
source

First use the HTML parser and get the DOM. Then grab the anchor elements and loop into them looking for hrefs. Do not try to grab the token directly from the string.

Then:

The glib answer will look like this:

 /(The_Token_I_Want.zip)/ 

You might want to be more precise than one example.

I assume that you are really looking for:

 /([^/]+)$/ 
0
source
 m/The_Token_I_Want/ 

You need to be more specific about which token. Number? Line? Does this repeat itself? Does it have a shape or pattern?

0
source

It's probably best to use something smarter than RegEx. For example, if you use C #, you can use the System.Uri class to parse it.

0
source

All Articles