Fetch API: get title, keywords and body text from HTTP response

Question

Fetch API: get title, keywords and body text from HTTP response

I want to know what might be the best way to get the title, keywords and content visible to the user from responseText using fetch api ( Is there a way to not send cookies when creating XMLHttpRequest on the same origin? )

I am currently using regular expressions to get the title from the response text, for example:

var re_title = new RegExp("<title>[\n\r\s]*(.*)[\n\r\s]*</title>", "gmi"); var title = re_title.exec(responseText); if (title) title = title[1]

And to get the content in the keyword meta tag, I need to use some regular expressions.

For the content to be visible to the user, we do not need tags such as script, div, etc. Also, we do not need text between script tags. It is only necessary to get words that make sense in the body of the answer.

I think (as with several stackoverflow posts), using regular expressions to do this is not quite the right approach. What could be an alternative?

+5

javascript web-scraping

jack Aug 25 '15 at 17:24

source share

1 answer

rphv · Accepted Answer · 2015-08-25T21:44:30+0000

As mentioned in zzzzBov , you can use the DOMParser API browser implementation to implement this browser by analyzing the response.text() fetch request. Here is an example that sends such a request for itself and analyzes the header, keywords and body text:

 <!DOCTYPE html> <html> <head> <title>This is the page title</title> <meta charset="UTF-8"> <meta name="description" content="Free Web Help"> <meta name="keywords" content="HTML,CSS,XML,JavaScript"> <meta charset="utf-8"> <script> fetch("https://dl.dropboxusercontent.com/u/76726218/so.html") .then(function(response) { return (response.text()); }) .then(function(responseText) { var parsedResponse = (new window.DOMParser()).parseFromString(responseText, "text/html"); document.getElementById("title").innerHTML = "Title: " + parsedResponse.title; document.getElementById("keywords").innerHTML = "Keywords: " + parsedResponse.getElementsByName("keywords")[0].getAttribute("content"); document.getElementById("visibleText").innerHTML = "Visible Text: " + parsedResponse.getElementsByTagName("body")[0].textContent; }); </script> </head> <body> <div>This text is visible to the user.</div> <div>So <i>is</i> <b>this</b>.</div> <hr> <b>Results:</b> <ul id="results"> <li id="title"></li> <li id="keywords"></li> <li id="visibleText"></li> </ul> </body> </html>

I found the Mozilla documentation in the Fetch API , Using Fetch and It's Useful to Use Basic Concepts .

Fetch API: get title, keywords and body text from HTTP response

More articles: