Python regex converts youtube url to youtube

I am creating a regular expression so that I can find YouTube links (maybe several) in a piece of HTML text posted by the user.

I am currently using the following regular expression to change "http://www.youtube.com/watch?v=-JyZLS2IhkQ" to display the corresponding YouTube video:

return re.compile('(http(s|):\/\/|)(www.|)youtube.(com|nl)\/watch\?v\=([a-zA-Z0-9-_=]+)').sub(tag, value)

(where the variable "tag" is the html bit, so the video works and the "value" of the user)

Now this works ... until the URL is like this:

"HTTP://www.youtube.com/watch? V = -JyZLS2IhkQ & function ...

Now I hope that you guys could help me figure out how to also combine the "& feature ..." element so that it disappears.

HTML example:

No replies to this post..

Youtube vid:

http://www.youtube.com/watch?v=-JyZLS2IhkQ

More blabla

Thanks for your thoughts, much appreciated

Stephen

+5
4

:

def youtube_url_validation(url):
    youtube_regex = (
        r'(https?://)?(www\.)?'
        '(youtube|youtu|youtube-nocookie)\.(com|be)/'
        '(watch\?v=|embed/|v/|.+\?v=)?([^&=%\?]{11})')

    youtube_regex_match = re.match(youtube_regex, url)
    if youtube_regex_match:
        return youtube_regex_match.group(6)

    return youtube_regex_match

:

youtube_urls_test = [
    'http://www.youtube.com/watch?v=5Y6HSHwhVlY',
    'http://youtu.be/5Y6HSHwhVlY', 
    'http://www.youtube.com/embed/5Y6HSHwhVlY?rel=0" frameborder="0"',
    'https://www.youtube-nocookie.com/v/5Y6HSHwhVlY?version=3&hl=en_US',
    'http://www.youtube.com/',
    'http://www.youtube.com/?feature=ytca']


for url in youtube_urls_test:
    m = youtube_url_validation(url)
    if m:
        print 'OK {}'.format(url)
        print m.groups()
        print m.group(6)
    else:
        print 'FAIL {}'.format(url)
+5

.

, , , .

((foo|)), - , ?.

- , .

, \w (equals [a-zA-Z0-9_]), .

r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([-\w]+)'

, URL-, , . lookahead ( ).

, -, =, %, & - , URL- ( , ).

, v- URL-, .*?.

r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([\w-]+)(&.*?)?(?=[^-\w&=%])'

, . , , .

+4

, urlparse, youtube, , ? , URL-, urlparse .

from urlparse import urlparse,parse_qs,urlunparse
from urllib import urlencode
youtube_url = urlparse('http://www.youtube.com/watch?v=aFNzk7TVUeY&feature=grec_index')
params = parse_qs(youtube_url.query)
new_params = {'v': params['v'][0]}

cleaned_youtube_url = urlunparse((youtube_url.scheme, \
                                  youtube_url.netloc, \
                                  youtube_url.path,
                                  None, \
                                  urlencode(new_params), \
                                  youtube_url.fragment))

This is a bit more code, but it avoids the frenzy of regex.

And as the hop said, you should use raw strings for regular expression.

+3
source

This is how I implemented it in my script:

string = "Hey, check out this video: https://www.youtube.com/watch?v=bS5P_LAqiVg"

youtube = re.findall(r'(https?://)?(www\.)?((youtube\.(com))/watch\?v=([-\w]+)|youtu\.be/([-\w]+))', string)

if youtube:
    print youtube

This outputs:

["", "youtube.com/watch?v=BS5P_LAqiVg", ".com", "watch", "com", "bS5P_LAqiVg", ""]

If you just wanted to capture the video id, for example, you would do:

video_id = [c for c in youtube[0] if c] # Get rid of empty list objects
video_id = video_id[len(video_id)-1] # Return the last item in the list
0
source

All Articles