How to get absolute path for '<img src =' '>' in node from response.body object

Therefore, I want to use a promise request to pull out the body of the page. As soon as I have a page, I want to collect all the tags and get an src array of these images. Suppose the src attributes on a page have both relative and absolute paths. I want an array of absolute paths for imgs on the page. I know that I can use some string manipulation and the npm path to create an absolute path, but I wanted to find a better way to do this.

var rp = require('request-promise'), cheerio = require('cheerio'); var options = { uri: 'http://www.google.com', method: 'GET', resolveWithFullResponse: true }; rp(options) .then (function (response) { $ = cheerio.load(response.body); var relativeLinks = $("img"); relativeLinks.each( function() { var link = $(this).attr('src'); console.log(link); if (link.startsWith('http')){ console.log('abs'); } else { console.log('rel'); } }); }); 

results

  /logos/doodles/2016/phoebe-snetsingers-85th-birthday-5179281716019200-hp.gif rel 
+5
source share
3 answers

To get an array of image links in your script, you can use url.resolve to resolve the relative attributes of the src img tags with the request URL, which results in an absolute URL. The array is passed to final then ; you can do other things with an array other than console.log if necessary.

 var rp = require('request-promise'), cheerio = require('cheerio'), url = require('url'), base = 'http://www.google.com'; var options = { uri: base, method: 'GET', resolveWithFullResponse: true }; rp(options) .then (function (response) { var $ = cheerio.load(response.body); return $('img').map(function () { return url.resolve(base, $(this).attr('src')); }).toArray(); }) .then(console.log); 

This url.resolve will work for absolute or relative URLs (it resolves and returns the combined absolute URL when resolving the URL of your request to a relative path, but when resolving the URL of your request to an absolute URL, it simply returns absolute URL value). For example, with img tags on google with /logos/cat.gif and https://test.com/dog.gif as src attributes this will lead to the output:

 [ 'http://www.google.com/logos/cat.gif', 'https://test.com/dog.gif' ] 
0
source

Save the URL of your page as a variable using url.resolve to url.resolve fragments. In the Node REPL, this works for both relative and absolute paths (hence the "resolution"):

 $:~/Projects/test$ node > var base = "https://www.google.com"; undefined > var imageSrc = "/logos/doodles/2016/phoebe-snetsingers-85th-birthday-5179281716019200-hp.gif"; undefined > var url = require('url'); undefined > url.resolve(base, imageSrc); 'https://www.google.com/logos/doodles/2016/phoebe-snetsingers-85th-birthday-5179281716019200-hp.gif' > imageSrc = base + imageSrc; 'https://www.google.com/logos/doodles/2016/phoebe-snetsingers-85th-birthday-5179281716019200-hp.gif' > url.resolve(base, imageSrc); 'https://www.google.com/logos/doodles/2016/phoebe-snetsingers-85th-birthday-5179281716019200-hp.gif' 

Your code will change to something like:

 var rp = require('request-promise'), cheerio = require('cheerio'), url = require('url'), base = 'http://www.google.com'; var options = { uri: base, method: 'GET', resolveWithFullResponse: true }; rp(options) .then (function (response) { $ = cheerio.load(response.body); var relativeLinks = $("img"); relativeLinks.each( function() { var link = $(this).attr('src'); var fullImagePath = url.resolve(base, link); // should be absolute console.log(link); if (link.startsWith('http')){ console.log('abs'); } else { console.log('rel'); } }); }); 
+3
source

It looks like you are using jQuery so you can

 $('img').each(function(i, e) { console.log(e.src) }); 

If you use src , it will classify relative paths as absolute.

0
source

All Articles