• Jobs
  • About Us
  • professionals
    • Home
    • Jobs
    • Courses and challenges
  • business
    • Home
    • Post vacancy
    • Our process
    • Pricing
    • Assessments
    • Payroll
    • Blog
    • Sales
    • Salary Calculator

0

390
Views
How to send bulk get requests using nodejs?

I wrote a web crawler with nodejs to send get requests to about 300 urls. Here is the main loop:

for (let i = 1; i <= 300; i++) { 
    let page= `https://xxxxxxxxx/forum-103-${i}.html`
    await getPage(page,(arr)=>{
        console.log(`page ${i}`)
    })
}

Here is the function getPage(url,callback):

export default async function getPage(url, callback) {
    await https.get(url, (res) => {
        let html = ""
        res.on("data", data => {
            html += data
        })
        res.on("end", () => {
            const $ = cheerio.load(html)
            let obj = {}
            let arr = []
            obj = $("#threadlisttableid tbody")
            for (let i in obj) {
                if (obj[i].attribs?.id?.substr(0, 6) === 'normal') {
                    arr.push(`https://xxxxxxx/${obj[i].attribs.id.substr(6).split("_").join("-")}-1-1.html`)
                }
            }
            callback(arr)
            console.log("success!")
        })
    })
        .on('error', (e) => {
            console.log(`Got error: ${e.message}`);
        })
}

I use cheerio to analyze HTML and put all information i need to variable nameed 'arr'. The program will report an error after running normally for a period of time,like that:

...
success!
page 121
success!
page 113
success!
page 115
success!
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443

I have two questions:

1.What is the reason for the error? Is it because I am sending too many get requests? How can I limit the request frequency?

2.As you can see, The order in which the pages are accessed is chaotic,how to control them?

I have tried using other modules to send get request (such as Axios) but it didn't work.

about 3 years ago · Juan Pablo Isaza
2 answers
Answer question

0

As you can see, The order in which the pages are accessed is chaotic,how to control them?

await is meaningless unless you put a promise on the right hand side. http.get does not deal in promises.

You could wrap it in a promise but it would be easier to use an API which supports then natively such as node-fetch, axios, or Node.js's native fetch. (That all have APIs that are, IMO, easier to use than http.get in general nor just with regards to flow control).

What is the reason for the error?

It isn't clear.

Is it because I am sending too many get requests?

That is a likely hypothesis.

How can I limit the request frequency?

Once you have your for loop working with promises so the requests are sent in serial instead of parallel, you can insert a sleep between each request.

about 3 years ago · Juan Pablo Isaza Report

0

The http requests are fired simultaneously because the loop is not waiting for the previous request due to wrong use of await. Proper control of loop will limit the request frequency.


for (let i = 1; i <= 300; i++) { 
    let page= `https://xxxxxxxxx/forum-103-${i}.html`
    var arr = await getPage(page);
    // use arr in the way you want
    console.log(`page ${i}`);
}

export default async function getPage(url) {
    // Declare a new promise, wait for the promise to resolve and return its value.
    return await new Promise((reso, rej) => {
        https.get(url, (res) => {
            let html = ""
            res.on("data", data => {
                html += data
            })
            res.on("end", () => {
                const $ = cheerio.load(html)
                let obj = {}
                let arr = []
                obj = $("#threadlisttableid tbody")
                for (let i in obj) {
                    if (obj[i].attribs?.id?.substr(0, 6) === 'normal') {
                        arr.push(`https://xxxxxxx/${obj[i].attribs.id.substr(6).split("_").join("-")}-1-1.html`)
                    }
                }
                reso(arr) // Resolve with arr
                console.log("success!")
            })
        })
        .on('error', (e) => {
            console.log(`Got error: ${e.message}`);
            throw e;
        })
    })
}
about 3 years ago · Juan Pablo Isaza Report
Answer question
Find remote jobs

Discover the new way to find a job!

Top jobs
Top job categories
Business
Post vacancy Pricing Our process Sales
Legal
Terms and conditions Privacy policy
© 2025 PeakU Inc. All Rights Reserved.

Andres GPT

Recommend me some offers
I have an error