Regex matches resource path from URL

Just to make everyone understand the vocabulary involved, the general structure of the URL is as follows:

http :// www.a.com / path/to/resource.html ? query=value # fragment {scheme} :// {authority} / {path} ? {query} # {fragment} 

The path consists of the path and resource, in the case of path/to/resource.html path/to/ , and the resource is resource.html .

Bad, nasty and rude:
HTML, as it is found in the wild, can be poor, nasty and cruel , although quite often far from short. In this poor, nasty and cruel world, there are lively connections that themselves can be poor, unpleasant and cruel, even though URLs must adhere to standards . Therefore, bearing this in mind, I present to you the problem ...

Problem:

I am trying to create a regex to remove a resource from a URL path, which is necessary when there is a link on a web page that is a relative path. For instance:

  • I visit www.domain.com/path/to/page1.html .
  • There is a relative link to /page2.html
  • Remove /page1.html from the URL
  • Add /page2.html to www.domain.com/path/to

Result: at www.domain.com/path/to/page2.html

I'm stuck in step 3!

I have allocated a path and resource, but now I want to separate them. The regular expression that I tried to create is as follows: \z([^\/]\.[^\/])

In C #, the same regular expression is: "\\z([^/]\\.[^/])"

Translated into English, a regular expression should mean: matching the end of a line, which includes all characters separated by a period, until these characters are slashes.

I tried this regex, but currently it fails. What is the correct query to achieve the specified result.

Here are some examples:

/path/to/resource.html => / path / to / and resource.html
/pa.th/to/resource.html => /pa.th/to/ and resource.html
/path/to/resource.html/ => /path/to/resource.html/
/ * I # $> /78zxdc.78& (! ~ => / * I # $> / and 78zxdc.78 & (! ~

Thank you for your help!

+4
source share
5 answers

System.Uri

 var uri = new Uri("http://www.domain.com/path/to/page1.html?query=value#fragment"); Console.WriteLine(uri.Scheme); // http Console.WriteLine(uri.Host); // www.domain.com Console.WriteLine(uri.AbsolutePath); // /path/to/page1.html Console.WriteLine(uri.PathAndQuery); // /path/to/page1.html?query=value Console.WriteLine(uri.Query); // ?query=value Console.WriteLine(uri.Fragment); // #fragment Console.WriteLine(uri.Segments[uri.Segments.Length - 1]); // page1.html for (var i = 0 ; i < uri.Segments.Length ; i++) { Console.WriteLine("{0}: {1}", i, uri.Segments[i]); /* Output 0: / 1: path/ 2: to/ 3: page1.html */ } 
+3
source

I can’t imagine why you want to use regular expressions for this when the Uri class does almost all the work for you already. And to get the final part (i.e., Separate the resource from the path), you can simply use String.LastIndexOf and String.Substring . For instance:

 Uri myUri; if (!Uri.TryCreate(linkString, UriKind.RelativeOrAbsolute, out myUri)) { // some kind of error. } int pos = myUri.AbsolutePath.LastIndexOf('/'); ++pos; string resource = myUri.AbsolutePath.Substring(pos); 

I have little doubt that you can do this with regex. I doubt it, although this is a victory. As you said, the URLs that you find when you browse the Internet can be pretty bad. My crawler spends significant efforts on normalizing some really wild search queries. I regularly come across things like http://example.com/dir/subdir/subsubdir/../../dir///moretrash/resource.html . And you would not believe (or perhaps you if you scan on the Internet), the strange escape that I see. The Uri class is well-versed in the URL, then normalized. Unescaping is something you just can't do with regex.

My experience is that the time to create a Uri instance is overshadowed by the time taken to normalize the URLs: unescape, trimming fragments and session identifiers, identifying and preventing proxy servers and crawler traps, removing extraneous slashes, and navigating along paths ( i.e. /./ and /../ ), etc. I just can't see where to use the regex, even if it was faster than Uri.TryCreate would improve the runtime. And I seriously doubt that when parsing URLs that I find in the wild, it can do the same job as Uri.TryCreate .

+2
source

To extract a portion of a URI resource, you can use:

 ^ # matches start of str .* # greedy match up to the last '/' \/ # literal '/' ( # start capture of resource part [^\/\?\#]* # zero or more chars except '/', '?', and '#' ) # end capture (?: # start optional group - query part \? # literal '?' for optional query .+? # non-greedy match for any chars )? # end of optional group (?: # start of optional group - fragment part \# # literal '#' for optional fragment .+? # non-greedy match for any chars )? # end of optional group $ 
+1
source

I think you might need to split the line by '/', and not get stuck when a regex appears. also you viewed http://msdn.microsoft.com/en-us/library/ms952653.aspx

+1
source

Does your regex engine support variable lengths? If so, you can use this to look forward (therefore excluding) characters without a slash at the end:

 .*/(?=[^/]*$) 

Alternatively, use capture groups, and the path will be group 1 , resource group 2 :

 (.*/)([^/]*$) 

An algorithm without regular expression will be as follows:

  • Save pos last slash
  • Substr from 0 with length pos+1

Note I deliberately ignored . here. What meaning do they serve? In HTML, if you have a path that does not end with a slash, relative paths will be relative to the parent of the last part . Thus, for the purposes of this discussion, the dotless part is basically a non-expanding resource.

+1
source

All Articles