Some basic Python questions

I am a complete python noob, so please bear with me. I want python to look at the html page and replace instances of Microsoft Word objects with something compatible with UTF-8.

My question is: how do you do this in Python (I had it in Googled, but no clear answer yet)? I want to dip my finger in the water of Python, so I think that something is simple, as it is a good place to start. I seem to need:

  • load text inserted from MS Word into a variable
  • run some sort of replace function in the content
  • bring him out

In PHP, I would do it like this:

$test = $_POST['pasted_from_Word']; //for example "Going Mobile"

function defangWord($string) 
{
    $search = array(
        (chr(0xe2) . chr(0x80) . chr(0x98)),
        (chr(0xe2) . chr(0x80) . chr(0x99)),
        (chr(0xe2) . chr(0x80) . chr(0x9c)), 
        (chr(0xe2) . chr(0x80) . chr(0x9d)), 
        (chr(0xe2) . chr(0x80) . chr(0x93)),
        (chr(0xe2) . chr(0x80) . chr(0x94)), 
        (chr(0x2d))
    ); 

    $replace = array(
        "‘",
        "’",
        "“",
        "”",
        "–",
        "—",
        "–"
    );

    return str_replace($search, $replace, $string); 
} 

echo defangWord($test); 

How do you do this in Python?

EDIT: , UTF-8 . , MS Word. , , . PHP, , , , . , , , (0xe2, 0x80 ..). HTML. , , , UTF-8, , MS Word, ?

EDIT2: , Python , . , , . UTF-8, , , - UTF-8, , , - UTF-8... Word . . Python...

+5
4

, Microsoft Word - UTF-8. HTML.

- :

chr(0xe2) . chr(0x80) . chr(0x98)

:

'\xe2\x80\x98'

Python , :

def defang(string):
    return string.decode('utf-8').encode('ascii', 'xmlcharrefreplace')

UTF-8 , “.

, :

import re
from htmlentitydefs import codepoint2name

def convert_match_to_named(match):
    num = int(match.group(1))
    if num in codepoint2name:
        return "&%s;" % codepoint2name[num]
    else:
        return match.group(0)

def defang_named(string):
    return re.sub('&#(\d+);', convert_match_to_named, defang(string))

:

>>> defang_named('\xe2\x80\x9cHello, world!\xe2\x80\x9d')
'“Hello, world!”'

, :

# in Python, it common to operate a line at a time on a file instead of
# reading the entire thing into memory

my_file = open("test100.html")
for line in my_file:
    print defang_named(line)
my_file.close()

, Python 2.5; Unicode Python 3 +.

bobince : UTF-8 , ; , ASCII, - .

+20

Python .

PHP-isms Python-isms.

File. file.read() string. "".

+3

Word HTML HTML Tidy, . Python, , .

+2

., Python - / .

, Python file_get_contents(), , , :

sample = '\n'.join(open(test, 'r').readlines())

: , : sample = file(test).read()

str_replace():

sample = sample.replace(search, replace)

And the output is as simple as the operator print:

print defang_word(sample)

So, as you can see, the two versions look almost accurate.

+1
source

All Articles