Read the MSWord file in R

Can I read the MSWord 2010 file in R? I have Windows 7 and a Dell computer.

I use the line:

my.data <- readLines('c:/users/mark w miller/simple R programs/test_for_r.docx') 

to try reading the MSWord file containing the following text:

 A 20 1000 AA B 30 1001 BB C 10 1500 CC 

A warning message appears:

Warning message: In readLines ("c: / users / mark w miller / simple R programs / test_for_r.docx"): an incomplete ending line is found on 'c: / users / mark w miller / simple R programs / test_for_r.docx'

and my.data seems gibberish:

 # [1] "PK\003\004\024" "Β€l" "ÈFÃË‹ÁtΓ­" 

I know with this simple example, I could easily convert the MSWord file to another format. However, my actual data files consist of complex tables that were printed decades ago and then subsequently scanned into pdf documents. The age of the original paper document and possibly the flaws in the original paper, typing and / or scanning process led to the fact that some letters and numbers were not very clear. So far, the conversion of PDF files to MSWord seems to be most successful with the correct translation of tables. Convert MSWord files to Excel or rich text, etc. Not very successful. Even after converting to MSWord, the resulting files are very complex and contain numerous errors. I thought that if I could read the MSWord files in R, which could be the most efficient way to edit and fix.

I am aware of the tm package, which I think can read MSWord files in R, but I'm a little worried about using it because it seems to require third-party software installation.

Thanks for any suggestions.

+7
source share
3 answers

Firstly, readLines () is not the right solution because the Word file is not a text (i.e. text) file.

The function associated with Word in the tm package is called readDOC (), but both it and the required third-party tool (Antiword) are intended for old Word files (up to Word 2003) and will not work using new .docx files.

The best I can offer is to try readPDF (), also found in the tm package. Note. This requires the pdftotext tool to be installed on your system. Easy for Linux, don't know about Windows. Also, find a Windows tool that converts PDF to a plain ASCII text file ( non Word files) - they should open and display correctly using Notepad on Windows, and then try readLines () again. However, given that your PDFs are old and come from a scanner, converting to text can be difficult.

Finally: I understand that you did not make the initial decision in this case, but for anyone else - Word and PDF are not suitable formats for storing the data that you want to analyze.

+6
source

I did not understand how to read the MSWord file in R, but I got the contents in a format that R can read.

  • I converted PDF to MSWord with Acrobat X Pro

  • The source tables had solid vertical lines separating the columns. It turns out that these vertical lines violated the data format when I converted the MSWord file to a text file, but before creating the text file, I managed to delete the lines from the MSWord file.

  • Convert the MSWord file to a text file after deleting the vertical lines in step 2.

  • The resulting text files still require extensive editing, but at least the data is mostly in R format, which can be read, and I don’t have to re-enter all the data in pdf files manually, saving many hours of work.

+1
source

You can do this with RDCOMClient very easily. Saying so, some characters will not be read correctly.

 require(RDCOMClient) # Create the connection wordApp <- COMCreate("Word.Application") # Let set visible to true so you can see it run wordApp[["Visible"]] <- TRUE # Define the file we want to open wordFileName <- "c:/path/to/word/doc.docx" # Open the file doc <- wordApp[["Documents"]]$Open(wordFileName) # Print the text print(doc$range()$text()) 
0
source

All Articles