How to filter only printable characters in a file in Bash (linux) or Python?

I want to make a file, including non-printable characters, to include only printable characters. I think this problem is related to the ACSCII control action , but I could not find a solution for this, and also could not understand the meaning .[16D (ASCII control action symbol ??) in the following file.

HEXDUMP INPUT FILE:

 00000000: 4845 4c4c 4f20 5448 4953 2049 5320 5448 HELLO THIS IS TH 00000010: 4520 5445 5354 1b5b 3136 4420 2020 2020 E TEST.[16D 00000020: 2020 2020 2020 2020 2020 201b 5b31 3644 .[16D 00000030: 2020 

When I cat edited this file in bash , I just got: "HELLO". I think this is because, by default, cat interprets the ASCII control action, two .[16D s.

Why two lines .[16D make cat FILE print only "HELLO"? ... and how can I make this file just to include printable characters, i.e. "HELLO"?

+5
source share
4 answers

hexdump indicates that the dot in .[16D is actually an escape character, \x1b .
Esc[ n D is an ANSI escape code to remove n characters. Therefore, Esc[16D tells the terminal to delete 16 characters, which explains the output of cat .

There are various ways to remove ANSI escape codes from a file using Bash commands (for example, using sed , as in Anubhava's answer) or Python.

However, in such cases, it might be better to run the file through the terminal emulator in order to interpret any existing editing control sequences in the file, so you will get the result that the file author intended after they applied these editing sequences.

One way to do this in Python is to use pyte , a Python module that implements a simple terminal emulator compatible with VTXXX. You can easily install it with pip , and here are its docs on readthedocs .

Here is a simple demo program that interprets the data asked in a question. It is written for Python 2, but easily adapts to Python 3. pyte is Unicode-aware, and its standard class Stream expects Unicode strings, but this example uses ByteStream, so I can pass it a regular byte string.

 #!/usr/bin/env python ''' pyte VTxxx terminal emulator demo Interpret a byte string containing text and ANSI / VTxxx control sequences Code adapted from the demo script in the pyte tutorial at http://pyte.readthedocs.org/en/latest/tutorial.html#tutorial Posted to http://stackoverflow.com/a/30571342/4014959 Written by PM 2Ring 2015.06.02 ''' import pyte #hex dump of data #00000000 48 45 4c 4c 4f 20 54 48 49 53 20 49 53 20 54 48 |HELLO THIS IS TH| #00000010 45 20 54 45 53 54 1b 5b 31 36 44 20 20 20 20 20 |E TEST.[16D | #00000020 20 20 20 20 20 20 20 20 20 20 20 1b 5b 31 36 44 | .[16D| #00000030 20 20 | | data = 'HELLO THIS IS THE TEST\x1b[16D \x1b[16D ' #Create a default sized screen that tracks changed lines screen = pyte.DiffScreen(80, 24) screen.dirty.clear() stream = pyte.ByteStream() stream.attach(screen) stream.feed(data) #Get index of last line containing text last = max(screen.dirty) #Gather lines, stripping trailing whitespace lines = [screen.display[i].rstrip() for i in range(last + 1)] print '\n'.join(lines) 

Output

 HELLO 

hex output dump

 00000000 48 45 4c 4c 4f 0a |HELLO.| 
+2
source

You can try this sed command to remove all non-printable characters from a file:

 sed -i.bak 's/[^[:print:]]//g' file 
+1
source

A minimalist solution comes to my mind

 import string printable_string = filter(lambda x: x in string.printable, your_string) ## TODO: substitute your string in the place of "your_string" 

If this still doesn’t help, try also to include uni-code specific [curses.ascii]

0
source

See the built-in string module.

 import string printable_str = filter(string.printable, string) 
0
source

All Articles