Can I read the MSWord 2010 file in R? I have Windows 7 and a Dell computer.
I use the line:
my.data <- readLines('c:/users/mark w miller/simple R programs/test_for_r.docx')
to try reading the MSWord file containing the following text:
A 20 1000 AA B 30 1001 BB C 10 1500 CC
A warning message appears:
Warning message: In readLines ("c: / users / mark w miller / simple R programs / test_for_r.docx"): an incomplete ending line is found on 'c: / users / mark w miller / simple R programs / test_for_r.docx'
and my.data seems gibberish:
# [1] "PK\003\004\024" "Β€l" "ΓFΓΓβΉΓtΓ"
I know with this simple example, I could easily convert the MSWord file to another format. However, my actual data files consist of complex tables that were printed decades ago and then subsequently scanned into pdf documents. The age of the original paper document and possibly the flaws in the original paper, typing and / or scanning process led to the fact that some letters and numbers were not very clear. So far, the conversion of PDF files to MSWord seems to be most successful with the correct translation of tables. Convert MSWord files to Excel or rich text, etc. Not very successful. Even after converting to MSWord, the resulting files are very complex and contain numerous errors. I thought that if I could read the MSWord files in R, which could be the most efficient way to edit and fix.
I am aware of the tm package, which I think can read MSWord files in R, but I'm a little worried about using it because it seems to require third-party software installation.
Thanks for any suggestions.
Mark miller
source share