Extract both href and text on same line using Xidel, specific links onlyHow to extract a specific part from XMLxquery expression to return a link text only if it contains within it a specific stringExtract only the values using XQueryCompiler in JavaExtract specific info from xml stored in clobI'm using XQUERY, I have 500 records saved in Database, I want to extract only n number of records from it, according to my needHow to extract text with html link?Extracting links (get href values) with certain text with Xpath under a div tag with certain classXQuery SQL Extract text value from a child element of a specific nodeeXist-db HTML output 'stalling' after a few lineseXist-db / XQuery compression:zip() of XML files saves text only

Should I outline or discovery write my stories?

Count the occurrence of each unique word in the file

Create all possible words using a set or letters

What was this official D&D 3.5e Lovecraft-flavored rulebook?

Lowest total scrabble score

What is the evidence for the "tyranny of the majority problem" in a direct democracy context?

Why did the Mercure fail?

What should you do if you miss a job interview (deliberately)?

Fear of getting stuck on one programming language / technology that is not used in my country

Can I sign legal documents with a smiley face?

Should I stop contributing to retirement accounts?

Where does the bonus feat in the cleric starting package come from?

Is this toilet slogan correct usage of the English language?

Is there a working SACD iso player for Ubuntu?

Intuition of generalized eigenvector.

Does a 'pending' US visa application constitute a denial?

Why did the HMS Bounty go back to a time when whales are already rare?

GraphicsGrid with a Label for each Column and Row

What is Cash Advance APR?

On a tidally locked planet, would time be quantized?

How do I color the graph in datavisualization?

Non-trope happy ending?

Approximating irrational number to rational number

Creepy dinosaur pc game identification



Extract both href and text on same line using Xidel, specific links only


How to extract a specific part from XMLxquery expression to return a link text only if it contains within it a specific stringExtract only the values using XQueryCompiler in JavaExtract specific info from xml stored in clobI'm using XQUERY, I have 500 records saved in Database, I want to extract only n number of records from it, according to my needHow to extract text with html link?Extracting links (get href values) with certain text with Xpath under a div tag with certain classXQuery SQL Extract text value from a child element of a specific nodeeXist-db HTML output 'stalling' after a few lineseXist-db / XQuery compression:zip() of XML files saves text only













1















I am trying to extract the link (href) and text inside the <a> tag for a number of links in an html page.



I only want specific links, which I match by a substring.



Example of my html:



<a href="/this/dir/1234/">This should be 1234</a> some other html
<a href="/this/dir/1236/">This should be 1236</a> some other html
<a href="/about_us/">Not important link</a> some other html


I am using Xidel, which allows me to avoid regexp. It seems to be the simplest for the job.



What I have so far:



xidel -e "//a/(@href[contains(.,'/this/dir')],text())"


It basically works, but two issues remain:



  • I get the data separated by linefeed. I would like to have it on same line.

  • Every link text is returned, so I get the text "Not important link" as well.

What is recommended way to get output like



/this/dir/1234 ; This should be 1234
/this/dir/1236 ; This should be 1236


Appreciate any feedback / tips.



edit:



The solution provided by Martin was 99% there. Newlines were not output, so I am using awk to replace a dummy text with newlines.



note : I am on windows.



xidel myhtml.htm -e "string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), 'XXX')" | awk -F "XXX" "$1=$11" "OFS=n" 









share|improve this question




























    1















    I am trying to extract the link (href) and text inside the <a> tag for a number of links in an html page.



    I only want specific links, which I match by a substring.



    Example of my html:



    <a href="/this/dir/1234/">This should be 1234</a> some other html
    <a href="/this/dir/1236/">This should be 1236</a> some other html
    <a href="/about_us/">Not important link</a> some other html


    I am using Xidel, which allows me to avoid regexp. It seems to be the simplest for the job.



    What I have so far:



    xidel -e "//a/(@href[contains(.,'/this/dir')],text())"


    It basically works, but two issues remain:



    • I get the data separated by linefeed. I would like to have it on same line.

    • Every link text is returned, so I get the text "Not important link" as well.

    What is recommended way to get output like



    /this/dir/1234 ; This should be 1234
    /this/dir/1236 ; This should be 1236


    Appreciate any feedback / tips.



    edit:



    The solution provided by Martin was 99% there. Newlines were not output, so I am using awk to replace a dummy text with newlines.



    note : I am on windows.



    xidel myhtml.htm -e "string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), 'XXX')" | awk -F "XXX" "$1=$11" "OFS=n" 









    share|improve this question


























      1












      1








      1


      1






      I am trying to extract the link (href) and text inside the <a> tag for a number of links in an html page.



      I only want specific links, which I match by a substring.



      Example of my html:



      <a href="/this/dir/1234/">This should be 1234</a> some other html
      <a href="/this/dir/1236/">This should be 1236</a> some other html
      <a href="/about_us/">Not important link</a> some other html


      I am using Xidel, which allows me to avoid regexp. It seems to be the simplest for the job.



      What I have so far:



      xidel -e "//a/(@href[contains(.,'/this/dir')],text())"


      It basically works, but two issues remain:



      • I get the data separated by linefeed. I would like to have it on same line.

      • Every link text is returned, so I get the text "Not important link" as well.

      What is recommended way to get output like



      /this/dir/1234 ; This should be 1234
      /this/dir/1236 ; This should be 1236


      Appreciate any feedback / tips.



      edit:



      The solution provided by Martin was 99% there. Newlines were not output, so I am using awk to replace a dummy text with newlines.



      note : I am on windows.



      xidel myhtml.htm -e "string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), 'XXX')" | awk -F "XXX" "$1=$11" "OFS=n" 









      share|improve this question
















      I am trying to extract the link (href) and text inside the <a> tag for a number of links in an html page.



      I only want specific links, which I match by a substring.



      Example of my html:



      <a href="/this/dir/1234/">This should be 1234</a> some other html
      <a href="/this/dir/1236/">This should be 1236</a> some other html
      <a href="/about_us/">Not important link</a> some other html


      I am using Xidel, which allows me to avoid regexp. It seems to be the simplest for the job.



      What I have so far:



      xidel -e "//a/(@href[contains(.,'/this/dir')],text())"


      It basically works, but two issues remain:



      • I get the data separated by linefeed. I would like to have it on same line.

      • Every link text is returned, so I get the text "Not important link" as well.

      What is recommended way to get output like



      /this/dir/1234 ; This should be 1234
      /this/dir/1236 ; This should be 1236


      Appreciate any feedback / tips.



      edit:



      The solution provided by Martin was 99% there. Newlines were not output, so I am using awk to replace a dummy text with newlines.



      note : I am on windows.



      xidel myhtml.htm -e "string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), 'XXX')" | awk -F "XXX" "$1=$11" "OFS=n" 






      xquery xidel






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 12 at 20:10









      Mr Lister

      35.3k1077121




      35.3k1077121










      asked Mar 7 at 7:32









      MyICQMyICQ

      92




      92






















          1 Answer
          1






          active

          oldest

          votes


















          0














          You can move the condition into a predicate e.g. //a[contains(@href, '/this/dir')]!(@href, string()). As for the result format, what happens if you delegate all to XQuery with



          string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '
          ')





          share|improve this answer

























          • Thank you Martin! That was 99% correct. See my edit to original question. I did not know about the predicate.

            – MyICQ
            Mar 7 at 14:22












          • The use of '
            '
            is use of XQuery syntax so if Xidel has any options to make sure the expression you pass in is evaluated as XQuery and not plain XPath then try that. Or use codepoints-to-string(10) instead e.g. string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), codepoints-to-string(10)), that should go through as XPath.


            – Martin Honnen
            Mar 7 at 14:50











          • the codepoints-to-string(10) worked. You are brilliant. Thank you !

            – MyICQ
            Mar 7 at 15:05











          • @MartinHonnen, by putting the entire query inside string-join() you can expect the entire output to be on a single line. MyICQ likes to have every @href on a separate line, so instead //a[contains(@href,'/this/dir')]/join((@href,.),' ; '), or //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) would be better.

            – Reino
            Mar 8 at 14:29












          • @Reino, can you cite anything from the XQuery spec or XQuery functions spec that supports your claim that using string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '
            ')
            as I have done puts the entire output on a single line? Not sure where your expectations come from, I certainly don't share them. And I don't see why //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) ensures output on separate lines, you construct a sequence of strings without defining any separator between them.


            – Martin Honnen
            Mar 8 at 15:12










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55038333%2fextract-both-href-and-text-on-same-line-using-xidel-specific-links-only%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          You can move the condition into a predicate e.g. //a[contains(@href, '/this/dir')]!(@href, string()). As for the result format, what happens if you delegate all to XQuery with



          string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '
          ')





          share|improve this answer

























          • Thank you Martin! That was 99% correct. See my edit to original question. I did not know about the predicate.

            – MyICQ
            Mar 7 at 14:22












          • The use of '
            '
            is use of XQuery syntax so if Xidel has any options to make sure the expression you pass in is evaluated as XQuery and not plain XPath then try that. Or use codepoints-to-string(10) instead e.g. string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), codepoints-to-string(10)), that should go through as XPath.


            – Martin Honnen
            Mar 7 at 14:50











          • the codepoints-to-string(10) worked. You are brilliant. Thank you !

            – MyICQ
            Mar 7 at 15:05











          • @MartinHonnen, by putting the entire query inside string-join() you can expect the entire output to be on a single line. MyICQ likes to have every @href on a separate line, so instead //a[contains(@href,'/this/dir')]/join((@href,.),' ; '), or //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) would be better.

            – Reino
            Mar 8 at 14:29












          • @Reino, can you cite anything from the XQuery spec or XQuery functions spec that supports your claim that using string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '
            ')
            as I have done puts the entire output on a single line? Not sure where your expectations come from, I certainly don't share them. And I don't see why //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) ensures output on separate lines, you construct a sequence of strings without defining any separator between them.


            – Martin Honnen
            Mar 8 at 15:12















          0














          You can move the condition into a predicate e.g. //a[contains(@href, '/this/dir')]!(@href, string()). As for the result format, what happens if you delegate all to XQuery with



          string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '
          ')





          share|improve this answer

























          • Thank you Martin! That was 99% correct. See my edit to original question. I did not know about the predicate.

            – MyICQ
            Mar 7 at 14:22












          • The use of '
            '
            is use of XQuery syntax so if Xidel has any options to make sure the expression you pass in is evaluated as XQuery and not plain XPath then try that. Or use codepoints-to-string(10) instead e.g. string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), codepoints-to-string(10)), that should go through as XPath.


            – Martin Honnen
            Mar 7 at 14:50











          • the codepoints-to-string(10) worked. You are brilliant. Thank you !

            – MyICQ
            Mar 7 at 15:05











          • @MartinHonnen, by putting the entire query inside string-join() you can expect the entire output to be on a single line. MyICQ likes to have every @href on a separate line, so instead //a[contains(@href,'/this/dir')]/join((@href,.),' ; '), or //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) would be better.

            – Reino
            Mar 8 at 14:29












          • @Reino, can you cite anything from the XQuery spec or XQuery functions spec that supports your claim that using string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '
            ')
            as I have done puts the entire output on a single line? Not sure where your expectations come from, I certainly don't share them. And I don't see why //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) ensures output on separate lines, you construct a sequence of strings without defining any separator between them.


            – Martin Honnen
            Mar 8 at 15:12













          0












          0








          0







          You can move the condition into a predicate e.g. //a[contains(@href, '/this/dir')]!(@href, string()). As for the result format, what happens if you delegate all to XQuery with



          string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '
          ')





          share|improve this answer















          You can move the condition into a predicate e.g. //a[contains(@href, '/this/dir')]!(@href, string()). As for the result format, what happens if you delegate all to XQuery with



          string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '
          ')






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 7 at 11:28

























          answered Mar 7 at 10:55









          Martin HonnenMartin Honnen

          113k66279




          113k66279












          • Thank you Martin! That was 99% correct. See my edit to original question. I did not know about the predicate.

            – MyICQ
            Mar 7 at 14:22












          • The use of '
            '
            is use of XQuery syntax so if Xidel has any options to make sure the expression you pass in is evaluated as XQuery and not plain XPath then try that. Or use codepoints-to-string(10) instead e.g. string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), codepoints-to-string(10)), that should go through as XPath.


            – Martin Honnen
            Mar 7 at 14:50











          • the codepoints-to-string(10) worked. You are brilliant. Thank you !

            – MyICQ
            Mar 7 at 15:05











          • @MartinHonnen, by putting the entire query inside string-join() you can expect the entire output to be on a single line. MyICQ likes to have every @href on a separate line, so instead //a[contains(@href,'/this/dir')]/join((@href,.),' ; '), or //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) would be better.

            – Reino
            Mar 8 at 14:29












          • @Reino, can you cite anything from the XQuery spec or XQuery functions spec that supports your claim that using string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '
            ')
            as I have done puts the entire output on a single line? Not sure where your expectations come from, I certainly don't share them. And I don't see why //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) ensures output on separate lines, you construct a sequence of strings without defining any separator between them.


            – Martin Honnen
            Mar 8 at 15:12

















          • Thank you Martin! That was 99% correct. See my edit to original question. I did not know about the predicate.

            – MyICQ
            Mar 7 at 14:22












          • The use of '
            '
            is use of XQuery syntax so if Xidel has any options to make sure the expression you pass in is evaluated as XQuery and not plain XPath then try that. Or use codepoints-to-string(10) instead e.g. string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), codepoints-to-string(10)), that should go through as XPath.


            – Martin Honnen
            Mar 7 at 14:50











          • the codepoints-to-string(10) worked. You are brilliant. Thank you !

            – MyICQ
            Mar 7 at 15:05











          • @MartinHonnen, by putting the entire query inside string-join() you can expect the entire output to be on a single line. MyICQ likes to have every @href on a separate line, so instead //a[contains(@href,'/this/dir')]/join((@href,.),' ; '), or //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) would be better.

            – Reino
            Mar 8 at 14:29












          • @Reino, can you cite anything from the XQuery spec or XQuery functions spec that supports your claim that using string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '
            ')
            as I have done puts the entire output on a single line? Not sure where your expectations come from, I certainly don't share them. And I don't see why //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) ensures output on separate lines, you construct a sequence of strings without defining any separator between them.


            – Martin Honnen
            Mar 8 at 15:12
















          Thank you Martin! That was 99% correct. See my edit to original question. I did not know about the predicate.

          – MyICQ
          Mar 7 at 14:22






          Thank you Martin! That was 99% correct. See my edit to original question. I did not know about the predicate.

          – MyICQ
          Mar 7 at 14:22














          The use of '
          '
          is use of XQuery syntax so if Xidel has any options to make sure the expression you pass in is evaluated as XQuery and not plain XPath then try that. Or use codepoints-to-string(10) instead e.g. string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), codepoints-to-string(10)), that should go through as XPath.


          – Martin Honnen
          Mar 7 at 14:50





          The use of '
          '
          is use of XQuery syntax so if Xidel has any options to make sure the expression you pass in is evaluated as XQuery and not plain XPath then try that. Or use codepoints-to-string(10) instead e.g. string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), codepoints-to-string(10)), that should go through as XPath.


          – Martin Honnen
          Mar 7 at 14:50













          the codepoints-to-string(10) worked. You are brilliant. Thank you !

          – MyICQ
          Mar 7 at 15:05





          the codepoints-to-string(10) worked. You are brilliant. Thank you !

          – MyICQ
          Mar 7 at 15:05













          @MartinHonnen, by putting the entire query inside string-join() you can expect the entire output to be on a single line. MyICQ likes to have every @href on a separate line, so instead //a[contains(@href,'/this/dir')]/join((@href,.),' ; '), or //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) would be better.

          – Reino
          Mar 8 at 14:29






          @MartinHonnen, by putting the entire query inside string-join() you can expect the entire output to be on a single line. MyICQ likes to have every @href on a separate line, so instead //a[contains(@href,'/this/dir')]/join((@href,.),' ; '), or //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) would be better.

          – Reino
          Mar 8 at 14:29














          @Reino, can you cite anything from the XQuery spec or XQuery functions spec that supports your claim that using string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '
          ')
          as I have done puts the entire output on a single line? Not sure where your expectations come from, I certainly don't share them. And I don't see why //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) ensures output on separate lines, you construct a sequence of strings without defining any separator between them.


          – Martin Honnen
          Mar 8 at 15:12





          @Reino, can you cite anything from the XQuery spec or XQuery functions spec that supports your claim that using string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '
          ')
          as I have done puts the entire output on a single line? Not sure where your expectations come from, I certainly don't share them. And I don't see why //a[contains(@href,'/this/dir')]/concat(@href,' ; ',.) ensures output on separate lines, you construct a sequence of strings without defining any separator between them.


          – Martin Honnen
          Mar 8 at 15:12



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55038333%2fextract-both-href-and-text-on-same-line-using-xidel-specific-links-only%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          1928 у кіно

          Захаров Федір Захарович

          Ель Греко