What is a regex to get a URL token?

Question

What is a regex to get a URL token?

Say I have lines like this:

  bunch of other html <a href = "http://domain.com/133742/The_Token_I_Want.zip" more html and stuff
 bunch of other html <a href = "http://domain.com/12345/another_token.zip" more html and stuff
 bunch of other html <a href = "http://domain.com/0981723/YET_ANOTHER_TOKEN.zip" more html and stuff

Which regular expression matches The_Token_I_Want , another_token , YET_ANOTHER_TOKEN ?

+4

c ++ boost regex

Nick strupat Aug 15 '10 at 20:31

source share

7 answers

Try the following:

/ (?: F | XT) TPN: / {2} (?: WWW.) Domain [^ /] + ([^ /] +) ([^ /] +) / r <?? ../ p>

or

/ \ w {3,5}:? / {2} (.?: w {3}) of the domain [^ /] + ([^ /] +) ([^ /] +) / ..

+1

Jet Aug 15 '10 at 20:45

source share

 /a href="http://domain.com/[0-9]+/([a-zA-Z_]+).zip"/

Perhaps you want to add more characters to [a-zA-Z_] +

+1

Thomas Aug 15 '10 at 20:46

source share

You can use:

 (http|ftp)+://[[:alnum:]./_]+/([[:alnum:]._-]+).[[:alnum:]_-]+

( [[:alnum:]._-]+ ) is the group for the associated template, and in your example it will be The_Token_I_Want . to access this group, use \ 2 or $ 2, because ( http|ftp ) is the first group and ( [[:alnum:]._-]+ ) is the second group of the matching template.

+1

M. Sadeq HE Aug 15 '10 at 20:49

source share

First use the HTML parser and get the DOM. Then grab the anchor elements and loop into them looking for hrefs. Do not try to grab the token directly from the string.

Then:

The glib answer will look like this:

 /(The_Token_I_Want.zip)/

You might want to be more precise than one example.

I assume that you are really looking for:

 /([^/]+)$/

0

Quentin Aug 15 '10 at 20:33

source share

 m/The_Token_I_Want/

You need to be more specific about which token. Number? Line? Does this repeat itself? Does it have a shape or pattern?

0

Shaggy frog Aug 15 '10 at 20:34

source share

It's probably best to use something smarter than RegEx. For example, if you use C #, you can use the System.Uri class to parse it.

0

Jesse collins Aug 15 '10 at 20:36

source share

Greg bacon · Accepted Answer · 2010-08-15T20:41:13+0000

Appendix B of RFC 2396 provides an overly regular expression for dividing a URI into its components, and we can adapt it for your case.

 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))? #######

This leaves The_Token_I_Want at $6 , which is a hashderlined subexpression above. (Note that hashes are not part of the template.) See it live:

 #! /usr/bin/perl $_ = "http://domain.com/133742/The_Token_I_Want.zip"; if (m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?!) { print "$6\n"; } else { print "no match\n"; }

Conclusion:

  $ ./prog.pl
 The_Token_I_Want

UPDATE: I see in the comment that you are using boost::regex , so be sure to escape the backslash in your C ++ program.

 #include <boost/foreach.hpp> #include <boost/regex.hpp> #include <iostream> #include <string> int main() { boost::regex token("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*" "/([^.]+)" // ####### I CAN HAZ HASHDERLINE PLZ "[^?#]*)(\\?([^#]*))?(#(.*))?"); const char * const urls[] = { "http://domain.com/133742/The_Token_I_Want.zip", "http://domain.com/12345/another_token.zip", "http://domain.com/0981723/YET_ANOTHER_TOKEN.zip", }; BOOST_FOREACH(const char *url, urls) { std::cout << url << ":\n"; std::string t; boost::cmatch m; if (boost::regex_match(url, m, token)) t = m[6]; else t = "<no match>"; std::cout << " - " << m[6] << '\n'; } return 0; }

Conclusion:

  http://domain.com/133742/The_Token_I_Want.zip:
   - The_Token_I_Want
 http://domain.com/12345/another_token.zip:
   - another_token
 http://domain.com/0981723/YET_ANOTHER_TOKEN.zip:
   - YET_ANOTHER_TOKEN

What is a regex to get a URL token?

More articles: