Retrieving web page content and processing (printing or saving to a file)

I am a real estate appraiser and have limited experience with vb and .net. I have a task that I perform, which requires me to go to the conuty appraisers website and print a copy (for a BMP or JPG image or directly on the default printer) of the current public record information anywhere from a few pages to 1000 plus records at a time.

I really do not get paid for this part of the work, so they do not care, it takes several minutes or several hours to do this. I thought there should be a way to automate this process, so last week I started searching and testing code snippets.

What I have today opens up an instance of IE; moves to the registered page; finds elemet form for AcctNo; Fill it out and submit the form. The returned page is formatted for screen presentation and is not suitable for sending to a printer. However, there is a link that, when clicked, returns a page formatted for pinting. A print dialog box is also displayed at the bottom, which then needs to be processed. I managed to press the Print button or the Cancel button in several ways, which leaves me with a document that either goes to the printer or sits on the screen.

Questions:

  • Is there a way to do this without displaying the Print dialog box? Maybe HTTPRequest or HTTPWebREquest, since I don't need to see the screens just need the last page.

  • The resulting page is usually longer than the letter in several lines and wants to print on two pages. It would be nice to resize the page to fit, and usually it will be the same resize.

  • If I stick to the print dialog by clicking the Print or Cancel button, how can I intercept the document and decide using the options set in the program to send the file to the printer or save the image?

    / li>

I am sure that I work too much to do this, and thought there was someone who could answer it in a second, while I spent most of 3 days trying to figure it out.

I like to decide how to make up tags, therefore, pointing to a class or some website, we highly appreciate it, but the sample code is useful, because I am not an experienced programmer and basically take examples and change them according to my needs.

thanks

+4
source share
4 answers

What you are trying to do is called web scraping. Although I'm not a VB guy (sorry!), As a rule, I break up web search programs like this -

  • Download the HTML file from the URL using GET or POST.
  • Extract information from this file.
  • Format and return this information, or perhaps repeat the following links found in HTML.

A Google search for "vb web scraping" offered several different methods, but I'm not sure what you're comfortable with. Ideally, a language that is more Internet friendly might be a good idea. I do most of my scraping in Python. Although I did it hard, I recently started experimenting with a library-mechanization, which makes my life easier.

This Python bit is sent to the Google homepage, goes to the "About" link and saves the HTML to a file.

import mechanize, re browser = mechanize.Browser() browser.open("http://google.com") #find and follow a link with the text "About" in it about_page = browser.follow_link(text_regex = re.compile("About")) #open a local html file to save to output_file = open("about.html","w") for line in about_page.read(): output_file.write(line + '\n') output_file.close() 

I know that you do not know Python, but it is one of the easiest languages โ€‹โ€‹to learn and seems to be better suited for this task than VB. In addition, many people on StackOverflow talk about this - compare ~ 14k tags to ~ 5k.

+1
source

This is similar to what can be easily solved (simpler than a software approach) using GUI-based Macro Recording software such as AutoHotkey . The only difficulties I see are finding the correct form elements and a link to print.

0
source

What you can try, instead of programmatically clicking the link to the print page, take the URL in the same way. If I remember, you will get this via the Document property in your web browser and find the link using the DOM tools, and then take the HREF attribute.

Once you have the link url, use HttpClient to load that url into a file (or memory stream). Load the file into memory (if it is not in the memory stream, in which case it is already in memory) and delete the scripts that go to the printer (or just turn off all scripts by replacing <script with <!--<script and </script> on </script>--> . You may need to make some more logic, since most script tags have HTML comment tags inside them.

After you have processed everything, save it to disk as a temporary file and move the browser control to a file.

If there are images or links that do not load / work, check to add the <base> to the file when processing it. This should fix the urls.

Hope this helps!

0
source

Most likely, there is some kind of pattern between the URL for the regular (screen) and documents for printing. For example, they can use the same document ID in the URL.

Once you know this pattern, you can calculate the URL to print, and then just save the results of loading that URL into a file.

Just make sure the โ€œpage not foundโ€ or other error (hack the URL to find out what their error page is), so if they change the format of the printed URL, you will be warned instead of blindly saving error pages :)

0
source

All Articles