Scrape main content using php The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) The Ask Question Wizard is Live! Data science time! April 2019 and salary with experienceHow can I prevent SQL injection in PHP?PHP: Delete an element from an arrayConvert HTML + CSS to PDF with PHP?How to make div not larger than its contents?startsWith() and endsWith() functions in PHPHow do I get PHP errors to display?How Do You Parse and Process HTML/XML in PHP?Reference — What does this symbol mean in PHP?How does PHP 'foreach' actually work?Why shouldn't I use mysql_* functions in PHP?
Did God make two great lights or did He make the great light two?
How are presidential pardons supposed to be used?
Take groceries in checked luggage
Wolves and sheep
Sort a list of pairs representing an acyclic, partial automorphism
How can I protect witches in combat who wear limited clothing?
Didn't get enough time to take a Coding Test - what to do now?
Who or what is the being for whom Being is a question for Heidegger?
Finding the path in a graph from A to B then back to A with a minimum of shared edges
Difference between "generating set" and free product?
Make it rain characters
How can I define good in a religion that claims no moral authority?
How should I replace vector<uint8_t>::const_iterator in an API?
Why is the object placed in the middle of the sentence here?
Cooking pasta in a water boiler
What aspect of planet Earth must be changed to prevent the industrial revolution?
Was credit for the black hole image misattributed?
Is this wall load bearing? Blueprints and photos attached
Is there a writing software that you can sort scenes like slides in PowerPoint?
How does ice melt when immersed in water?
What information about me do stores get via my credit card?
Typeface like Times New Roman but with "tied" percent sign
Why does this iterative way of solving of equation work?
Single author papers against my advisor's will?
Scrape main content using php
The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
The Ask Question Wizard is Live!
Data science time! April 2019 and salary with experienceHow can I prevent SQL injection in PHP?PHP: Delete an element from an arrayConvert HTML + CSS to PDF with PHP?How to make div not larger than its contents?startsWith() and endsWith() functions in PHPHow do I get PHP errors to display?How Do You Parse and Process HTML/XML in PHP?Reference — What does this symbol mean in PHP?How does PHP 'foreach' actually work?Why shouldn't I use mysql_* functions in PHP?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I am building a import tool just like medium.com story import tool so far i have used this code
include('includes/import/simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('https://neilpatel.com/blog/starting-over/');
// find all link
foreach($html->find('a') as $e)
echo $e->href . '<br>';
// find all image
foreach($html->find('img') as $e)
echo $e->src . '<br>';
// find all image with full tag
foreach($html->find('img') as $e)
echo $e->outertext . '<br>';
// find all div tags with id=gbar
foreach($html->find('div#gbar') as $e)
echo $e->innertext . '<br>';
// find all span tags with class=gb1
foreach($html->find('span.gb1') as $e)
echo $e->outertext . '<br>';
// find all td tags with attribite align=center
foreach($html->find('td[align=center]') as $e)
echo $e->innertext . '<br>';
// extract text from table
echo $html->find('td[align="center"]', 1)->plaintext.'<br><hr>';
// extract text from HTML
echo $html->plaintext;
But this scrape the whole page is it possible to just find and scrape only the main content like the medium import tool doing for any link
Kindly solve this problem and how can i achieve this kind of result
javascript php jquery html regex
|
show 3 more comments
I am building a import tool just like medium.com story import tool so far i have used this code
include('includes/import/simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('https://neilpatel.com/blog/starting-over/');
// find all link
foreach($html->find('a') as $e)
echo $e->href . '<br>';
// find all image
foreach($html->find('img') as $e)
echo $e->src . '<br>';
// find all image with full tag
foreach($html->find('img') as $e)
echo $e->outertext . '<br>';
// find all div tags with id=gbar
foreach($html->find('div#gbar') as $e)
echo $e->innertext . '<br>';
// find all span tags with class=gb1
foreach($html->find('span.gb1') as $e)
echo $e->outertext . '<br>';
// find all td tags with attribite align=center
foreach($html->find('td[align=center]') as $e)
echo $e->innertext . '<br>';
// extract text from table
echo $html->find('td[align="center"]', 1)->plaintext.'<br><hr>';
// extract text from HTML
echo $html->plaintext;
But this scrape the whole page is it possible to just find and scrape only the main content like the medium import tool doing for any link
Kindly solve this problem and how can i achieve this kind of result
javascript php jquery html regex
please tell us what you have tried so far to solve the problem
– Arikael
Mar 8 at 13:39
The main issue is probably how do you recognise the main content, if you can define how to identify it that would help.
– Nigel Ren
Mar 8 at 13:41
I have tried the above code and got the whole page and i just want the main content like the from where the main article starts and ends
– donm
Mar 8 at 13:41
@NigelRen yes you are right but we wanted to create a general tool for every url so how i identify where the main article starts and ends like only the text content of the article
– donm
Mar 8 at 13:43
@NigelRen I hope you got my point every url content, tags are different so how can I identify the article content starting and end
– donm
Mar 8 at 13:44
|
show 3 more comments
I am building a import tool just like medium.com story import tool so far i have used this code
include('includes/import/simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('https://neilpatel.com/blog/starting-over/');
// find all link
foreach($html->find('a') as $e)
echo $e->href . '<br>';
// find all image
foreach($html->find('img') as $e)
echo $e->src . '<br>';
// find all image with full tag
foreach($html->find('img') as $e)
echo $e->outertext . '<br>';
// find all div tags with id=gbar
foreach($html->find('div#gbar') as $e)
echo $e->innertext . '<br>';
// find all span tags with class=gb1
foreach($html->find('span.gb1') as $e)
echo $e->outertext . '<br>';
// find all td tags with attribite align=center
foreach($html->find('td[align=center]') as $e)
echo $e->innertext . '<br>';
// extract text from table
echo $html->find('td[align="center"]', 1)->plaintext.'<br><hr>';
// extract text from HTML
echo $html->plaintext;
But this scrape the whole page is it possible to just find and scrape only the main content like the medium import tool doing for any link
Kindly solve this problem and how can i achieve this kind of result
javascript php jquery html regex
I am building a import tool just like medium.com story import tool so far i have used this code
include('includes/import/simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('https://neilpatel.com/blog/starting-over/');
// find all link
foreach($html->find('a') as $e)
echo $e->href . '<br>';
// find all image
foreach($html->find('img') as $e)
echo $e->src . '<br>';
// find all image with full tag
foreach($html->find('img') as $e)
echo $e->outertext . '<br>';
// find all div tags with id=gbar
foreach($html->find('div#gbar') as $e)
echo $e->innertext . '<br>';
// find all span tags with class=gb1
foreach($html->find('span.gb1') as $e)
echo $e->outertext . '<br>';
// find all td tags with attribite align=center
foreach($html->find('td[align=center]') as $e)
echo $e->innertext . '<br>';
// extract text from table
echo $html->find('td[align="center"]', 1)->plaintext.'<br><hr>';
// extract text from HTML
echo $html->plaintext;
But this scrape the whole page is it possible to just find and scrape only the main content like the medium import tool doing for any link
Kindly solve this problem and how can i achieve this kind of result
javascript php jquery html regex
javascript php jquery html regex
edited Mar 8 at 14:12
donm
asked Mar 8 at 13:32
donmdonm
197
197
please tell us what you have tried so far to solve the problem
– Arikael
Mar 8 at 13:39
The main issue is probably how do you recognise the main content, if you can define how to identify it that would help.
– Nigel Ren
Mar 8 at 13:41
I have tried the above code and got the whole page and i just want the main content like the from where the main article starts and ends
– donm
Mar 8 at 13:41
@NigelRen yes you are right but we wanted to create a general tool for every url so how i identify where the main article starts and ends like only the text content of the article
– donm
Mar 8 at 13:43
@NigelRen I hope you got my point every url content, tags are different so how can I identify the article content starting and end
– donm
Mar 8 at 13:44
|
show 3 more comments
please tell us what you have tried so far to solve the problem
– Arikael
Mar 8 at 13:39
The main issue is probably how do you recognise the main content, if you can define how to identify it that would help.
– Nigel Ren
Mar 8 at 13:41
I have tried the above code and got the whole page and i just want the main content like the from where the main article starts and ends
– donm
Mar 8 at 13:41
@NigelRen yes you are right but we wanted to create a general tool for every url so how i identify where the main article starts and ends like only the text content of the article
– donm
Mar 8 at 13:43
@NigelRen I hope you got my point every url content, tags are different so how can I identify the article content starting and end
– donm
Mar 8 at 13:44
please tell us what you have tried so far to solve the problem
– Arikael
Mar 8 at 13:39
please tell us what you have tried so far to solve the problem
– Arikael
Mar 8 at 13:39
The main issue is probably how do you recognise the main content, if you can define how to identify it that would help.
– Nigel Ren
Mar 8 at 13:41
The main issue is probably how do you recognise the main content, if you can define how to identify it that would help.
– Nigel Ren
Mar 8 at 13:41
I have tried the above code and got the whole page and i just want the main content like the from where the main article starts and ends
– donm
Mar 8 at 13:41
I have tried the above code and got the whole page and i just want the main content like the from where the main article starts and ends
– donm
Mar 8 at 13:41
@NigelRen yes you are right but we wanted to create a general tool for every url so how i identify where the main article starts and ends like only the text content of the article
– donm
Mar 8 at 13:43
@NigelRen yes you are right but we wanted to create a general tool for every url so how i identify where the main article starts and ends like only the text content of the article
– donm
Mar 8 at 13:43
@NigelRen I hope you got my point every url content, tags are different so how can I identify the article content starting and end
– donm
Mar 8 at 13:44
@NigelRen I hope you got my point every url content, tags are different so how can I identify the article content starting and end
– donm
Mar 8 at 13:44
|
show 3 more comments
1 Answer
1
active
oldest
votes
I'm not completely sure what you are asking / trying to do.. But I'll give it a try.
You are trying to Identify the main content area - To scrape only the needed information without any garbage or unneeded content.
My approach is to use the common structures and good practices of well formatted HTML pages. Consider this:
- The main article will be encapsulated in a unique
ARTICLE
tag on the page. - The
H1
tag on the article will be its header. - We know that there are some repeating ID's used such as (main_content, main_article, etc..).
Summarize those rules on your targets and build an Identifiers list sorted by priority -> Then you can try and parse the target until one of the identifiers will be found - which indicates that you identified the main content area.
Here is an Example -> using the URL you provided:
$search_logic = [
"#main_content",
"#main_article",
"#main",
"article",
];
// get DOM from URL or file
$html = file_get_contents('https://neilpatel.com/blog/starting-over/');
$dom = new DOMDocument ();
@$dom->loadHTML($html);
//
foreach ($search_logic as $logic)
$main_container = null;
//Search by ID or By tag name:
if ($logic[0] === "#")
//Serch by ID:
$main_container = $dom->getElementById(ltrim($logic, '#'));
else
//Serch by tag name:
$main_container = $dom->getElementsByTagName($logic);
//Do we have results:
if (!empty($main_container))
echo "> Found main part identified by: ".$logic."n";
$article = isset($main_container->length) ? $main_container[0] : $main_container; // Normalize the container.
//Parse the $main_container:
echo " - Example get the title:n";
echo "t".$article->getElementsByTagName("h1")[0]->textContent."nn";
//You can stop the iteration:
//break;
else
echo "> Nothing on the page containing: ".$logic."nn";
As you can see the firs to ID's were not found so we keep trying down the list until we hit the result we want -> a good set of those tagnames / ID's will be good enough.
Here is the result:
> Nothing on the page containing: #main_content
> Nothing on the page containing: #main_article
> Found main part identified by: #main
- Example get the title:
If I Had to Start All Over Again, I Would…
> Found main part identified by: article
- Example get the title:
If I Had to Start All Over Again, I Would…
Hope I helped.
Thanks for the help we can go for this option but what the url content does not contains any of the above mentioned tags is there any other way we can do this maybe in jquery, javascript
– donm
Mar 8 at 15:21
have you ever used medium.com story import tool?
– donm
Mar 8 at 15:23
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55064292%2fscrape-main-content-using-php%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I'm not completely sure what you are asking / trying to do.. But I'll give it a try.
You are trying to Identify the main content area - To scrape only the needed information without any garbage or unneeded content.
My approach is to use the common structures and good practices of well formatted HTML pages. Consider this:
- The main article will be encapsulated in a unique
ARTICLE
tag on the page. - The
H1
tag on the article will be its header. - We know that there are some repeating ID's used such as (main_content, main_article, etc..).
Summarize those rules on your targets and build an Identifiers list sorted by priority -> Then you can try and parse the target until one of the identifiers will be found - which indicates that you identified the main content area.
Here is an Example -> using the URL you provided:
$search_logic = [
"#main_content",
"#main_article",
"#main",
"article",
];
// get DOM from URL or file
$html = file_get_contents('https://neilpatel.com/blog/starting-over/');
$dom = new DOMDocument ();
@$dom->loadHTML($html);
//
foreach ($search_logic as $logic)
$main_container = null;
//Search by ID or By tag name:
if ($logic[0] === "#")
//Serch by ID:
$main_container = $dom->getElementById(ltrim($logic, '#'));
else
//Serch by tag name:
$main_container = $dom->getElementsByTagName($logic);
//Do we have results:
if (!empty($main_container))
echo "> Found main part identified by: ".$logic."n";
$article = isset($main_container->length) ? $main_container[0] : $main_container; // Normalize the container.
//Parse the $main_container:
echo " - Example get the title:n";
echo "t".$article->getElementsByTagName("h1")[0]->textContent."nn";
//You can stop the iteration:
//break;
else
echo "> Nothing on the page containing: ".$logic."nn";
As you can see the firs to ID's were not found so we keep trying down the list until we hit the result we want -> a good set of those tagnames / ID's will be good enough.
Here is the result:
> Nothing on the page containing: #main_content
> Nothing on the page containing: #main_article
> Found main part identified by: #main
- Example get the title:
If I Had to Start All Over Again, I Would…
> Found main part identified by: article
- Example get the title:
If I Had to Start All Over Again, I Would…
Hope I helped.
Thanks for the help we can go for this option but what the url content does not contains any of the above mentioned tags is there any other way we can do this maybe in jquery, javascript
– donm
Mar 8 at 15:21
have you ever used medium.com story import tool?
– donm
Mar 8 at 15:23
add a comment |
I'm not completely sure what you are asking / trying to do.. But I'll give it a try.
You are trying to Identify the main content area - To scrape only the needed information without any garbage or unneeded content.
My approach is to use the common structures and good practices of well formatted HTML pages. Consider this:
- The main article will be encapsulated in a unique
ARTICLE
tag on the page. - The
H1
tag on the article will be its header. - We know that there are some repeating ID's used such as (main_content, main_article, etc..).
Summarize those rules on your targets and build an Identifiers list sorted by priority -> Then you can try and parse the target until one of the identifiers will be found - which indicates that you identified the main content area.
Here is an Example -> using the URL you provided:
$search_logic = [
"#main_content",
"#main_article",
"#main",
"article",
];
// get DOM from URL or file
$html = file_get_contents('https://neilpatel.com/blog/starting-over/');
$dom = new DOMDocument ();
@$dom->loadHTML($html);
//
foreach ($search_logic as $logic)
$main_container = null;
//Search by ID or By tag name:
if ($logic[0] === "#")
//Serch by ID:
$main_container = $dom->getElementById(ltrim($logic, '#'));
else
//Serch by tag name:
$main_container = $dom->getElementsByTagName($logic);
//Do we have results:
if (!empty($main_container))
echo "> Found main part identified by: ".$logic."n";
$article = isset($main_container->length) ? $main_container[0] : $main_container; // Normalize the container.
//Parse the $main_container:
echo " - Example get the title:n";
echo "t".$article->getElementsByTagName("h1")[0]->textContent."nn";
//You can stop the iteration:
//break;
else
echo "> Nothing on the page containing: ".$logic."nn";
As you can see the firs to ID's were not found so we keep trying down the list until we hit the result we want -> a good set of those tagnames / ID's will be good enough.
Here is the result:
> Nothing on the page containing: #main_content
> Nothing on the page containing: #main_article
> Found main part identified by: #main
- Example get the title:
If I Had to Start All Over Again, I Would…
> Found main part identified by: article
- Example get the title:
If I Had to Start All Over Again, I Would…
Hope I helped.
Thanks for the help we can go for this option but what the url content does not contains any of the above mentioned tags is there any other way we can do this maybe in jquery, javascript
– donm
Mar 8 at 15:21
have you ever used medium.com story import tool?
– donm
Mar 8 at 15:23
add a comment |
I'm not completely sure what you are asking / trying to do.. But I'll give it a try.
You are trying to Identify the main content area - To scrape only the needed information without any garbage or unneeded content.
My approach is to use the common structures and good practices of well formatted HTML pages. Consider this:
- The main article will be encapsulated in a unique
ARTICLE
tag on the page. - The
H1
tag on the article will be its header. - We know that there are some repeating ID's used such as (main_content, main_article, etc..).
Summarize those rules on your targets and build an Identifiers list sorted by priority -> Then you can try and parse the target until one of the identifiers will be found - which indicates that you identified the main content area.
Here is an Example -> using the URL you provided:
$search_logic = [
"#main_content",
"#main_article",
"#main",
"article",
];
// get DOM from URL or file
$html = file_get_contents('https://neilpatel.com/blog/starting-over/');
$dom = new DOMDocument ();
@$dom->loadHTML($html);
//
foreach ($search_logic as $logic)
$main_container = null;
//Search by ID or By tag name:
if ($logic[0] === "#")
//Serch by ID:
$main_container = $dom->getElementById(ltrim($logic, '#'));
else
//Serch by tag name:
$main_container = $dom->getElementsByTagName($logic);
//Do we have results:
if (!empty($main_container))
echo "> Found main part identified by: ".$logic."n";
$article = isset($main_container->length) ? $main_container[0] : $main_container; // Normalize the container.
//Parse the $main_container:
echo " - Example get the title:n";
echo "t".$article->getElementsByTagName("h1")[0]->textContent."nn";
//You can stop the iteration:
//break;
else
echo "> Nothing on the page containing: ".$logic."nn";
As you can see the firs to ID's were not found so we keep trying down the list until we hit the result we want -> a good set of those tagnames / ID's will be good enough.
Here is the result:
> Nothing on the page containing: #main_content
> Nothing on the page containing: #main_article
> Found main part identified by: #main
- Example get the title:
If I Had to Start All Over Again, I Would…
> Found main part identified by: article
- Example get the title:
If I Had to Start All Over Again, I Would…
Hope I helped.
I'm not completely sure what you are asking / trying to do.. But I'll give it a try.
You are trying to Identify the main content area - To scrape only the needed information without any garbage or unneeded content.
My approach is to use the common structures and good practices of well formatted HTML pages. Consider this:
- The main article will be encapsulated in a unique
ARTICLE
tag on the page. - The
H1
tag on the article will be its header. - We know that there are some repeating ID's used such as (main_content, main_article, etc..).
Summarize those rules on your targets and build an Identifiers list sorted by priority -> Then you can try and parse the target until one of the identifiers will be found - which indicates that you identified the main content area.
Here is an Example -> using the URL you provided:
$search_logic = [
"#main_content",
"#main_article",
"#main",
"article",
];
// get DOM from URL or file
$html = file_get_contents('https://neilpatel.com/blog/starting-over/');
$dom = new DOMDocument ();
@$dom->loadHTML($html);
//
foreach ($search_logic as $logic)
$main_container = null;
//Search by ID or By tag name:
if ($logic[0] === "#")
//Serch by ID:
$main_container = $dom->getElementById(ltrim($logic, '#'));
else
//Serch by tag name:
$main_container = $dom->getElementsByTagName($logic);
//Do we have results:
if (!empty($main_container))
echo "> Found main part identified by: ".$logic."n";
$article = isset($main_container->length) ? $main_container[0] : $main_container; // Normalize the container.
//Parse the $main_container:
echo " - Example get the title:n";
echo "t".$article->getElementsByTagName("h1")[0]->textContent."nn";
//You can stop the iteration:
//break;
else
echo "> Nothing on the page containing: ".$logic."nn";
As you can see the firs to ID's were not found so we keep trying down the list until we hit the result we want -> a good set of those tagnames / ID's will be good enough.
Here is the result:
> Nothing on the page containing: #main_content
> Nothing on the page containing: #main_article
> Found main part identified by: #main
- Example get the title:
If I Had to Start All Over Again, I Would…
> Found main part identified by: article
- Example get the title:
If I Had to Start All Over Again, I Would…
Hope I helped.
answered Mar 8 at 15:09
Shlomi HassidShlomi Hassid
5,29322038
5,29322038
Thanks for the help we can go for this option but what the url content does not contains any of the above mentioned tags is there any other way we can do this maybe in jquery, javascript
– donm
Mar 8 at 15:21
have you ever used medium.com story import tool?
– donm
Mar 8 at 15:23
add a comment |
Thanks for the help we can go for this option but what the url content does not contains any of the above mentioned tags is there any other way we can do this maybe in jquery, javascript
– donm
Mar 8 at 15:21
have you ever used medium.com story import tool?
– donm
Mar 8 at 15:23
Thanks for the help we can go for this option but what the url content does not contains any of the above mentioned tags is there any other way we can do this maybe in jquery, javascript
– donm
Mar 8 at 15:21
Thanks for the help we can go for this option but what the url content does not contains any of the above mentioned tags is there any other way we can do this maybe in jquery, javascript
– donm
Mar 8 at 15:21
have you ever used medium.com story import tool?
– donm
Mar 8 at 15:23
have you ever used medium.com story import tool?
– donm
Mar 8 at 15:23
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55064292%2fscrape-main-content-using-php%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
please tell us what you have tried so far to solve the problem
– Arikael
Mar 8 at 13:39
The main issue is probably how do you recognise the main content, if you can define how to identify it that would help.
– Nigel Ren
Mar 8 at 13:41
I have tried the above code and got the whole page and i just want the main content like the from where the main article starts and ends
– donm
Mar 8 at 13:41
@NigelRen yes you are right but we wanted to create a general tool for every url so how i identify where the main article starts and ends like only the text content of the article
– donm
Mar 8 at 13:43
@NigelRen I hope you got my point every url content, tags are different so how can I identify the article content starting and end
– donm
Mar 8 at 13:44