Issue
I've been searching for hours (there shouldn't be any duplicate) and attempting many different ways, using both RegEx (regular expressions) and DOMdocument, without success.
Non-Standard HTML Code:
<a class="SOMECLASS" href="javascript:__FUNCTION(SOME_HREF_INSIDE)" onclick="SOME_JS_FUNCTION();" id="SOME_ID" style="SOME_STYLE">
<a href="SOME_URL_3">SOME TEXT</a>
</a>
Now the problem is I'm trying to get the url "SOME_URL_3" and both when parsing using RegEx or DOMdocument, the pasing stops as soon as it encounters the first href. Of course as the second "a" tag is part of the first one, the parser only sees it as one.
I observed that browsers seem to automatically separate the tags when parsing as follow.
Before:
<a href="SOME_URL">
<a href="SOME_URL_2">
</a>
</a>
After:
<a href="SOME_URL">
</a>
<a href="SOME_URL_2">
</a>
I've not been able to replicate this browsers behavior using php.
Previous Attempt:
$dom = new DOMDocument();
@$dom->loadHTML($result);
foreach($dom->getElementsByTagName('a') as $link) {
$href_count = 0;
$attrs = array();
for ($i = 0; $i < $link->attributes->length; ++$i) {
$node = $link->attributes->item($i);
if ($node->nodeName == "href") {
$attrs[$node->nodeName][$href_count] = $node->nodeValue;
$href_count++;
if ($href_count >= 2) {
echo "A second href has been found";
}
}
}
echo "<pre>";
var_dump($attrs);
echo "</pre>";
}
As you may expect, it unfortunately doesn't work, otherwise I wouldn't be here asking for help...
Please do not hesitate to share your knowledge, any help or suggestion will be greatly appreciated!
Update:
I had forgotten to specify in my initial question that the answer should still allow to capture href from standard/non-nested "a" tags. My goal is to extend/improve my existing HTML parser to ensure that I'm also retrieving the urls from any href attribute. My initial code was only using RegEx and I wasn't able to capture a additional href from within a nested "a" tags. The solution I'm looking for would allow to capture href both from nested and standard/non-nested "a" tags. Brandon White's solution is great for nested href only. However, it would be resource consuming to use two different RegEx (nested/non-nested) to parse the entire HTML content twice. An ideal solution would be a RegEx allowing to capture both at the same time, if this is possible.
Solution
The following code extracts all <a>
tag href
values. Demo
$result = <<<HTML
<a href="SOME_URL">
<a href="SOME_URL_2">
</a>
</a>
<a href="SOME_URL3">
<a href="SOME_URL_4">
</a>
</a>
<a href="SOME_URL_5">
</a>
<a href="SOME_URL_6">
</a>
HTML;
$dom = new DOMDocument();
@$dom->loadHTML($result);
foreach($dom->getElementsByTagName('a') as $link) {
$tag_html = $dom->saveHTML($link); //Get tag inner html
if (substr_count($tag_html, "href") > 1) { //If tag contains more than one href attribute
preg_match_all('/href="([^"]*)"/is', $tag_html, $link_output, PREG_SET_ORDER);
$output[] = $link_output[1][1]; //Output second href
} else { //Not nested tag
$output[] = $link->getAttribute('href'); //Output first href
}
}
echo "<pre>";
print_r($output);
echo "</pre>";
Output:
<pre>Array
(
[0] => SOME_URL
[1] => SOME_URL_2
[2] => SOME_URL3
[3] => SOME_URL_4
[4] => SOME_URL_5
[5] => SOME_URL_6
)
</pre>
Answered By - ElGatito
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.