Reorder copyright with regular expressions

I need to indicate the copyright year at the beginning of the line. Here are the possible input options:

(c) 2012 10 DC Comics
2012 DC Comics
10 DC Comics. 2012
10 DC Comics , (c) 2012.
10 DC Comics, Copyright 2012
Warner Bros, 2011
Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.
...etc...

From these inputs I always need to output in the same format -

2012. 10 DC Comics.
2011. Warner Bros.
2011. Stanford and Sons, Ltd. Inc. All Rights Reserved
etc...

How can I do this with a combination of string formatting and regular expression?

This needs to be cleaned up, but this is what I am doing now:

### copyright
copyright = value_from_key(sd_wb, 'COPYRIGHT', n).strip()
m = re.search('[0-2][0-9][0-9][0-9]', copyright)
try:
    year = m.group(0)
except AttributeError:
    copyright=''
else:
    copyright = year + ". " + copyright.replace(year,'')
    copyright = copyright.rstrip('.').strip() + '.'

if copyright:
    copyright=copyright.replace('\xc2\xa9 ','').replace('&', '&').replace('(c)','').replace('(C)','').replace('Copyright', '')
    if not copyright.endswith('.'):
        copyright = copyright + '.'
    copyright = copyright.replace('  ', ' ')
+5
source share
4 answers

This program:

from __future__ import print_function
import re

tests = (
    '(c) 2012 DC Comics',
    'DC Comics. 2012',
    'DC Comics, (c) 2012.',
    'DC Comics, Copyright 2012',
    '(c) 2012 10 DC Comics',
    '10 DC Comics. 2012',
    '10 DC Comics , (c) 2012.',
    '10 DC Comics, Copyright 2012',
    'Warner Bros, 2011',
    'Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.',
)

for input in tests:
    print("<", input)
    output = re.sub(r'''
            (?P<lead> (?: \S .*? \S )?? )
            [\s.,]*
            (?: (?: \( c \) | copyright ) \s+ )?
            (?P<year> (?:19|20)\d\d )
            [\s.,]?
        ''', r"\g<year>. \g<lead>", input, 1, re.I + re.X)
    print(">", output, "\n")

when running under Python 2.7 or 3.2, it produces this output:

< (c) 2012 DC Comics
> 2012. DC Comics 

< DC Comics. 2012
> 2012. DC Comics 

< DC Comics, (c) 2012.
> 2012. DC Comics 

< DC Comics, Copyright 2012
> 2012. DC Comics 

< (c) 2012 10 DC Comics
> 2012. 10 DC Comics 

< 10 DC Comics. 2012
> 2012. 10 DC Comics 

< 10 DC Comics , (c) 2012.
> 2012. 10 DC Comics 

< 10 DC Comics, Copyright 2012
> 2012. 10 DC Comics 

< Warner Bros, 2011
> 2011. Warner Bros 

< Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.
> 2011. Stanford and Sons, Ltd. Inc All Rights Reserved. 

Most likely, this is what you were looking for.

+2
source

What about an answer that doesn't use regex?

tests = (
    '(c) 2012 DC Comics',
    'DC Comics. 2012',
    'DC Comics, (c) 2012.',
    'DC Comics, Copyright 2012',
    '(c) 2012 10 DC Comics',
    '10 DC Comics. 2012',
    '10 DC Comics , (c) 2012.',
    '10 DC Comics, Copyright 2012',
    'Warner Bros, 2011',
    'Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.',
    )

def reorder_copyright(text):
    year = None
    first = []
    second = []
    words = text.split()
    if words[0].lower() in ('(c)','copyright'):
        year = words[1]
        company = ' '.join(words[2:])
    for i, word in enumerate(words):
        if word.lower() in ('(c)','copyright'):
            year = words[i+1]
            company = ' '.join(words[:i] + words[i+2:])
            break
    else:
        year = words[-1]
        company = ' '.join(words[:-1])
    year = year.strip(' ,.')
    company = company.strip(' ,.')
    return "%s. %s." % (year, company)

if __name__ == '__main__':
    for line in tests:
        print(reorder_copyright(line))
+2
source

^\(c\)\s+(?P<year>\d{4})\s+(?P<digits>\d{2}).*$|^(?P<digits>\d{2}).*(?P<year>\d{4})\.?

\g<year>. \g<digits> DC Comics.

( 2012 ) ( 10). , . , :)

Edit: OP , , . , .

+1

, , , , :

  • , , "" ,

  • |, , ( ), , , "(c) 2012" "2012".

  • .

: before, year after, before, after , , , .

, b, y a , :

(c) 2012 10 DC Comics
    yyyy aaaaaaaaaaaa

2012 DC Comics
yyyy aaaaaaaaa

10 DC Comics , (c) 2012.
bbbbbbbbbbbb       yyyy

Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.
bbbbbbbbbbbbbbbbbbbbbbbbbbbb     yyyy  aaaaaaaaaaaaaaaaaaaa

( , "(c)" .., ).

, , :

(?i)(?:(?P<before>.*)\s*Copyright\s*(?P<year>\d{4})(?P<after>.*)|
       (?P<before>.*)\s*\(c\)\s*(?P<year>\d{4})(?P<after>.*)|
       (?P<before>.*)\s*(?P<year>\d{4})(?P<after>.*))

. , "Copyright", "(c)" , , "2012" ( (?i) ). - :

d = match.groupdict()
d['year'] + ' ' + d.get('before', '') + ' ' + d.get('after', '')

or using .sub()something like:

re.sub(..., r'\g<year> \g<before> \g<after>', ...)

finally, you will probably find that you need one more pass to remove weird punctuation (remove any commas followed immediately by a period, replace a few spaces with one, etc.).

+1
source

All Articles