How to separate regular expressions, but keep the separator string?

Question

How to separate regular expressions, but keep the separator string?

I have the following URL pattern:

http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en

I would like to get everything before and inclusive /watch/\d+/.

So far I:

>>> re.split(r'watch/\d+/', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'supernatural-dub-hollywood-babylon/en']

But this does not include the shared string (the string that appears between the domain and the path). The final answer I want to achieve is:

http://www.hulu.jp/watch/589851

+4

python regex

David542 May 31, '15 at 20:39

source share

4 answers

As mentioned in another answer, you need to use groups to capture the “glue” between the split lines.

, split() search()? ( ), URL-, , /watch/XXX/, XXX - 1 , . , / , , , . :

re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/', 'watch/2342/', 'fdsaafsdf']

, . :

result = re.search(r'(watch/\d+/)(.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groups() if result else []

:

('watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')

named groups, :

result = re.search(r'(?P<watchId>watch/\d+/)(?P<path>.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groupdict() if result else {}

:

{'path': 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', 'watchId': 'watch/589851/'}

split(), maxsplit, :

re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', maxsplit=1)

:

['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf']

, URL- search() , , groupdict() , .

+4

Adam Parkin 31 '15 22:45

Qaru do't-parse-HTML-with-regex post, ?

HTML- [X] . HTML . Regex , HTML. HTML--regex, HTML.

, URL-, , .

URL-:

^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$ (+ )

, ? !

Don't parse regex urls ... almost.

There is one simple thing:

The path-related URL must be zero or more path segments separated by "/".

Separating the URL should be as simple as url.split("/").

from urllib.parse import urlparse, urlunparse

myurl = "http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en"

# Run a parser over it
parts = urlparse(myurl)

# Crop the path to UP TO length 2
new_path = str("/".join(parts.path.split("/")[:3]))

# Unparse
urlunparse(parts._replace(path=new_path))
#>>> 'http://www.hulu.jp/watch/589851'

0

Veedrac Jun 01 '15 at 9:42

source share

You can try the following regex

.*\/watch\/\d+

Working demo

-1

apgp88 May 31, '15 at 20:41

source share

Kasramvd · Accepted Answer · 2015-05-31T20:41:46+0000

You need to use a capture group:

>>> re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/en']

How to separate regular expressions, but keep the separator string?

More articles: