Scrapy Avoid External Links Missing Protocol2019 Community Moderator ElectionCalling an external command in PythonScraping large number of sites with ScrapyCannot display HTML stringPyQt4 Scrapy ImplementationHow to get link from onclick event using Scrapy?Does Scrapy crawl ALL links with Rules?Scrapy login to vBulletin guidance neededScrapy Missing fields/data in output fileUsing Scrapy on a Google cache of a websiteTuning scrapy to avoid specific links and return url responses

If I can solve Sudoku can I solve Travelling Salesman Problem(TSP)? If yes, how?

What is the significance behind "40 days" that often appears in the Bible?

Does Mathematica reuse previous computations?

Brexit - No Deal Rejection

A limit with limit zero everywhere must be zero somewhere

Time travel from stationary position?

Employee lack of ownership

Charles Hockett - 'F' article?

Why did it take so long to abandon sail after steamships were demonstrated?

Python if-else code style for reduced code for rounding floats

Is a party consisting of only a bard, a cleric, and a warlock functional long-term?

How to create the Curved texte?

Gravity magic - How does it work?

How could a scammer know the apps on my phone / iTunes account?

Min function accepting varying number of arguments in C++17

Happy pi day, everyone!

How can you use ICE tables to solve multiple coupled equilibria?

Is it possible to upcast ritual spells?

Should we release the security issues we found in our product as CVE or we can just update those on weekly release notes?

How to use deus ex machina safely?

Gantt Chart like rectangles with log scale

A Cautionary Suggestion

Co-worker team leader wants to inject his friend's awful software into our development. What should I say to our common boss?

Is it true that good novels will automatically sell themselves on Amazon (and so on) and there is no need for one to waste time promoting?



Scrapy Avoid External Links Missing Protocol



2019 Community Moderator ElectionCalling an external command in PythonScraping large number of sites with ScrapyCannot display HTML stringPyQt4 Scrapy ImplementationHow to get link from onclick event using Scrapy?Does Scrapy crawl ALL links with Rules?Scrapy login to vBulletin guidance neededScrapy Missing fields/data in output fileUsing Scrapy on a Google cache of a websiteTuning scrapy to avoid specific links and return url responses










0















I'm a total Scrapy n00b, and am encountering a problematic situation. Several pages on the site I'm scraping contain external links in the following format:



<a href="www.externalsite.com/somepage">www.externalsite.com/somepage.</a>


The problem is that because the protocol is missing from the link, Scrapy takes the completely reasonable action of harvesting the link and pre-pending the base domain onto it, resulting in a link like so:



https://www.basedomain.com/page1/www.externalsite.com/somepage


This is perfectly reasonable, as it's the same action a browser takes when you click the external link missing the protocol. The problem is that in Scrapy, this creates a spider trap following links like these, ad infinitum:



https://www.basedomain.com/page1/www.externalsite.com/somepage/www.externalsite.com/somepage
https://www.basedomain.com/page1/www.externalsite.com/somepage/www.externalsite.com/somepage/www.externalsite.com/somepage


Eventually the URL is so long the server returns 500 and the loop stops.



I know there must be a way to avoid this issue with the LinkExtractor, but I just don't know how to do it. And I would prefer to avoid hard-coding a case for this site, and find a solution that will work for this scenario regardless. Any information would be greatly appreciated.










share|improve this question


























    0















    I'm a total Scrapy n00b, and am encountering a problematic situation. Several pages on the site I'm scraping contain external links in the following format:



    <a href="www.externalsite.com/somepage">www.externalsite.com/somepage.</a>


    The problem is that because the protocol is missing from the link, Scrapy takes the completely reasonable action of harvesting the link and pre-pending the base domain onto it, resulting in a link like so:



    https://www.basedomain.com/page1/www.externalsite.com/somepage


    This is perfectly reasonable, as it's the same action a browser takes when you click the external link missing the protocol. The problem is that in Scrapy, this creates a spider trap following links like these, ad infinitum:



    https://www.basedomain.com/page1/www.externalsite.com/somepage/www.externalsite.com/somepage
    https://www.basedomain.com/page1/www.externalsite.com/somepage/www.externalsite.com/somepage/www.externalsite.com/somepage


    Eventually the URL is so long the server returns 500 and the loop stops.



    I know there must be a way to avoid this issue with the LinkExtractor, but I just don't know how to do it. And I would prefer to avoid hard-coding a case for this site, and find a solution that will work for this scenario regardless. Any information would be greatly appreciated.










    share|improve this question
























      0












      0








      0








      I'm a total Scrapy n00b, and am encountering a problematic situation. Several pages on the site I'm scraping contain external links in the following format:



      <a href="www.externalsite.com/somepage">www.externalsite.com/somepage.</a>


      The problem is that because the protocol is missing from the link, Scrapy takes the completely reasonable action of harvesting the link and pre-pending the base domain onto it, resulting in a link like so:



      https://www.basedomain.com/page1/www.externalsite.com/somepage


      This is perfectly reasonable, as it's the same action a browser takes when you click the external link missing the protocol. The problem is that in Scrapy, this creates a spider trap following links like these, ad infinitum:



      https://www.basedomain.com/page1/www.externalsite.com/somepage/www.externalsite.com/somepage
      https://www.basedomain.com/page1/www.externalsite.com/somepage/www.externalsite.com/somepage/www.externalsite.com/somepage


      Eventually the URL is so long the server returns 500 and the loop stops.



      I know there must be a way to avoid this issue with the LinkExtractor, but I just don't know how to do it. And I would prefer to avoid hard-coding a case for this site, and find a solution that will work for this scenario regardless. Any information would be greatly appreciated.










      share|improve this question














      I'm a total Scrapy n00b, and am encountering a problematic situation. Several pages on the site I'm scraping contain external links in the following format:



      <a href="www.externalsite.com/somepage">www.externalsite.com/somepage.</a>


      The problem is that because the protocol is missing from the link, Scrapy takes the completely reasonable action of harvesting the link and pre-pending the base domain onto it, resulting in a link like so:



      https://www.basedomain.com/page1/www.externalsite.com/somepage


      This is perfectly reasonable, as it's the same action a browser takes when you click the external link missing the protocol. The problem is that in Scrapy, this creates a spider trap following links like these, ad infinitum:



      https://www.basedomain.com/page1/www.externalsite.com/somepage/www.externalsite.com/somepage
      https://www.basedomain.com/page1/www.externalsite.com/somepage/www.externalsite.com/somepage/www.externalsite.com/somepage


      Eventually the URL is so long the server returns 500 and the loop stops.



      I know there must be a way to avoid this issue with the LinkExtractor, but I just don't know how to do it. And I would prefer to avoid hard-coding a case for this site, and find a solution that will work for this scenario regardless. Any information would be greatly appreciated.







      python scrapy






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 6 at 19:50









      LandonCLandonC

      414620




      414620






















          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55031144%2fscrapy-avoid-external-links-missing-protocol%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55031144%2fscrapy-avoid-external-links-missing-protocol%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Save data to MySQL database using ExtJS and PHP [closed]2019 Community Moderator ElectionHow can I prevent SQL injection in PHP?Which MySQL data type to use for storing boolean valuesPHP: Delete an element from an arrayHow do I connect to a MySQL Database in Python?Should I use the datetime or timestamp data type in MySQL?How to get a list of MySQL user accountsHow Do You Parse and Process HTML/XML in PHP?Reference — What does this symbol mean in PHP?How does PHP 'foreach' actually work?Why shouldn't I use mysql_* functions in PHP?

          Compiling GNU Global with universal-ctags support Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!Tags for Emacs: Relationship between etags, ebrowse, cscope, GNU Global and exuberant ctagsVim and Ctags tips and trickscscope or ctags why choose one over the other?scons and ctagsctags cannot open option file “.ctags”Adding tag scopes in universal-ctagsShould I use Universal-ctags?Universal ctags on WindowsHow do I install GNU Global with universal ctags support using Homebrew?Universal ctags with emacsHow to highlight ctags generated by Universal Ctags in Vim?

          Add ONERROR event to image from jsp tldHow to add an image to a JPanel?Saving image from PHP URLHTML img scalingCheck if an image is loaded (no errors) with jQueryHow to force an <img> to take up width, even if the image is not loadedHow do I populate hidden form field with a value set in Spring ControllerStyling Raw elements Generated from JSP tagds with Jquery MobileLimit resizing of images with explicitly set width and height attributeserror TLD use in a jsp fileJsp tld files cannot be resolved