Why is JSoup timing out at random places in my code? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!Why is subtracting these two times (in 1927) giving a strange result?Why does this code using random strings print “hello world”?jsoup posting JavaGWT 2.5.1 and Kindle paperwhite user agentHow Spring MVC make HttpServletRequest field threadsafe?Spring Java servlet return incorrect user agentHow to save the body content of New York Times links using jsoupWhy is executing Java code in comments with certain Unicode characters allowed?Jsoup catchdata appear unknowhost exception ,and can`t ping the website ,but my web browser can visitScrapy, can't crawl any page: “TCP connection timed out: 110: Connection timed out.”
In musical terms, what properties are varied by the human voice to produce different words / syllables?
The test team as an enemy of development? And how can this be avoided?
As a dual citizen, my US passport will expire one day after traveling to the US. Will this work?
What does 丫 mean? 丫是什么意思?
How often does castling occur in grandmaster games?
How to write capital alpha?
Tips to organize LaTeX presentations for a semester
Tannaka duality for semisimple groups
What is the difference between a "ranged attack" and a "ranged weapon attack"?
How much damage would a cupful of neutron star matter do to the Earth?
Co-worker has annoying ringtone
Is it possible for SQL statements to execute concurrently within a single session in SQL Server?
How were pictures turned from film to a big picture in a picture frame before digital scanning?
I can't produce songs
Did Mueller's report provide an evidentiary basis for the claim of Russian govt election interference via social media?
Why do early math courses focus on the cross sections of a cone and not on other 3D objects?
Constant factor of an array
Why complex landing gears are used instead of simple,reliability and light weight muscle wire or shape memory alloys?
Project Euler #1 in C++
Why BitLocker does not use RSA
Can you force honesty by using the Speak with Dead and Zone of Truth spells together?
GDP with Intermediate Production
What would you call this weird metallic apparatus that allows you to lift people?
Putting class ranking in CV, but against dept guidelines
Why is JSoup timing out at random places in my code?
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
Data science time! April 2019 and salary with experience
The Ask Question Wizard is Live!Why is subtracting these two times (in 1927) giving a strange result?Why does this code using random strings print “hello world”?jsoup posting JavaGWT 2.5.1 and Kindle paperwhite user agentHow Spring MVC make HttpServletRequest field threadsafe?Spring Java servlet return incorrect user agentHow to save the body content of New York Times links using jsoupWhy is executing Java code in comments with certain Unicode characters allowed?Jsoup catchdata appear unknowhost exception ,and can`t ping the website ,but my web browser can visitScrapy, can't crawl any page: “TCP connection timed out: 110: Connection timed out.”
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I am currently trying to use JSoup in Java to scrape retrosheets.org for a baseball coding project I am working on.
I perform multiple JSoup connections in my code, and some of these connections are done in a loop (therefore are executed many many times). So, in total, I'm making hundreds of connections in my program to scrape the necessary data.
The program works for ~5 seconds but then gets hung up on a connection (a different one each time). Then, when I try to access the website separately in my browser the website will not load. What could be causing this? Is there an issue with performing too many connections?
Here is an example of a connection I am performing (all connections follow this same format).
doc = Jsoup.connect("https://www.retrosheet.org/boxesetc/index.html").maxBodySize(0).userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15").get();
This is the error I am getting
java web-scraping connection timeout jsoup
add a comment |
I am currently trying to use JSoup in Java to scrape retrosheets.org for a baseball coding project I am working on.
I perform multiple JSoup connections in my code, and some of these connections are done in a loop (therefore are executed many many times). So, in total, I'm making hundreds of connections in my program to scrape the necessary data.
The program works for ~5 seconds but then gets hung up on a connection (a different one each time). Then, when I try to access the website separately in my browser the website will not load. What could be causing this? Is there an issue with performing too many connections?
Here is an example of a connection I am performing (all connections follow this same format).
doc = Jsoup.connect("https://www.retrosheet.org/boxesetc/index.html").maxBodySize(0).userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15").get();
This is the error I am getting
java web-scraping connection timeout jsoup
add a comment |
I am currently trying to use JSoup in Java to scrape retrosheets.org for a baseball coding project I am working on.
I perform multiple JSoup connections in my code, and some of these connections are done in a loop (therefore are executed many many times). So, in total, I'm making hundreds of connections in my program to scrape the necessary data.
The program works for ~5 seconds but then gets hung up on a connection (a different one each time). Then, when I try to access the website separately in my browser the website will not load. What could be causing this? Is there an issue with performing too many connections?
Here is an example of a connection I am performing (all connections follow this same format).
doc = Jsoup.connect("https://www.retrosheet.org/boxesetc/index.html").maxBodySize(0).userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15").get();
This is the error I am getting
java web-scraping connection timeout jsoup
I am currently trying to use JSoup in Java to scrape retrosheets.org for a baseball coding project I am working on.
I perform multiple JSoup connections in my code, and some of these connections are done in a loop (therefore are executed many many times). So, in total, I'm making hundreds of connections in my program to scrape the necessary data.
The program works for ~5 seconds but then gets hung up on a connection (a different one each time). Then, when I try to access the website separately in my browser the website will not load. What could be causing this? Is there an issue with performing too many connections?
Here is an example of a connection I am performing (all connections follow this same format).
doc = Jsoup.connect("https://www.retrosheet.org/boxesetc/index.html").maxBodySize(0).userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15").get();
This is the error I am getting
java web-scraping connection timeout jsoup
java web-scraping connection timeout jsoup
edited Mar 8 at 23:43
Jacob Snyder
asked Mar 8 at 23:33
Jacob SnyderJacob Snyder
32
32
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
This is most definitely load protection on the target website side - it detects too many requests from same IP and blocks it for a while or throttles number of connections/requests from that IP. That's why you can't open the website in the browser as well - it's not about JSoup or Java at all, it's about connections/requests from your IP to target website being blocked/throttled.
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55072441%2fwhy-is-jsoup-timing-out-at-random-places-in-my-code%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is most definitely load protection on the target website side - it detects too many requests from same IP and blocks it for a while or throttles number of connections/requests from that IP. That's why you can't open the website in the browser as well - it's not about JSoup or Java at all, it's about connections/requests from your IP to target website being blocked/throttled.
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
add a comment |
This is most definitely load protection on the target website side - it detects too many requests from same IP and blocks it for a while or throttles number of connections/requests from that IP. That's why you can't open the website in the browser as well - it's not about JSoup or Java at all, it's about connections/requests from your IP to target website being blocked/throttled.
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
add a comment |
This is most definitely load protection on the target website side - it detects too many requests from same IP and blocks it for a while or throttles number of connections/requests from that IP. That's why you can't open the website in the browser as well - it's not about JSoup or Java at all, it's about connections/requests from your IP to target website being blocked/throttled.
This is most definitely load protection on the target website side - it detects too many requests from same IP and blocks it for a while or throttles number of connections/requests from that IP. That's why you can't open the website in the browser as well - it's not about JSoup or Java at all, it's about connections/requests from your IP to target website being blocked/throttled.
answered Mar 8 at 23:50
mvmnmvmn
1,8091524
1,8091524
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
add a comment |
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55072441%2fwhy-is-jsoup-timing-out-at-random-places-in-my-code%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown