Thursday

Get href values from all <a> tags which are not children of another <a> tag

5:42 AM domdocument, href, html, php, regex No comments

Issue

I've been searching for hours (there shouldn't be any duplicate) and attempting many different ways, using both RegEx (regular expressions) and DOMdocument, without success.

Non-Standard HTML Code:

<a class="SOMECLASS" href="javascript:__FUNCTION(SOME_HREF_INSIDE)" onclick="SOME_JS_FUNCTION();" id="SOME_ID" style="SOME_STYLE">
    <a href="SOME_URL_3">SOME TEXT</a>
</a>

Now the problem is I'm trying to get the url "SOME_URL_3" and both when parsing using RegEx or DOMdocument, the pasing stops as soon as it encounters the first href. Of course as the second "a" tag is part of the first one, the parser only sees it as one.

I observed that browsers seem to automatically separate the tags when parsing as follow.

Before:

<a href="SOME_URL">
    <a href="SOME_URL_2">
    </a>
</a>

After:

<a href="SOME_URL">
</a>
<a href="SOME_URL_2">
</a>

I've not been able to replicate this browsers behavior using php.

Previous Attempt:

$dom = new DOMDocument();
@$dom->loadHTML($result);

foreach($dom->getElementsByTagName('a') as $link) { 
    $href_count = 0;
    $attrs = array();

    for ($i = 0; $i < $link->attributes->length; ++$i) {
        $node = $link->attributes->item($i);
        if ($node->nodeName == "href") {
            $attrs[$node->nodeName][$href_count] = $node->nodeValue;
            $href_count++;
            if ($href_count >= 2) {
                echo "A second href has been found";
            }
        }
    }

    echo "<pre>";
    var_dump($attrs);
    echo "</pre>";
}

As you may expect, it unfortunately doesn't work, otherwise I wouldn't be here asking for help...

Please do not hesitate to share your knowledge, any help or suggestion will be greatly appreciated!

Update:

I had forgotten to specify in my initial question that the answer should still allow to capture href from standard/non-nested "a" tags. My goal is to extend/improve my existing HTML parser to ensure that I'm also retrieving the urls from any href attribute. My initial code was only using RegEx and I wasn't able to capture a additional href from within a nested "a" tags. The solution I'm looking for would allow to capture href both from nested and standard/non-nested "a" tags. Brandon White's solution is great for nested href only. However, it would be resource consuming to use two different RegEx (nested/non-nested) to parse the entire HTML content twice. An ideal solution would be a RegEx allowing to capture both at the same time, if this is possible.

Solution

The following code extracts all <a> tag href values. Demo

$result = <<<HTML
<a href="SOME_URL">
    <a href="SOME_URL_2">
    </a>
</a>

<a href="SOME_URL3">
    <a href="SOME_URL_4">
    </a>
</a>

<a href="SOME_URL_5">
</a>
<a href="SOME_URL_6">
</a>

HTML;

$dom = new DOMDocument();
@$dom->loadHTML($result);
foreach($dom->getElementsByTagName('a') as $link) {
    $tag_html = $dom->saveHTML($link); //Get tag inner html
    
    if (substr_count($tag_html, "href") > 1) { //If tag contains more than one href attribute
        preg_match_all('/href="([^"]*)"/is', $tag_html, $link_output, PREG_SET_ORDER);
        $output[] = $link_output[1][1]; //Output second href
    } else { //Not nested tag
        $output[] = $link->getAttribute('href'); //Output first href
    }
}

echo "<pre>";
print_r($output);
echo "</pre>";

Output:

<pre>Array
(
    [0] => SOME_URL
    [1] => SOME_URL_2
    [2] => SOME_URL3
    [3] => SOME_URL_4
    [4] => SOME_URL_5
    [5] => SOME_URL_6
)
</pre>

Answered By - ElGatito

This Answer collected from stackoverflow and tested by AngularFix community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday

Get href values from all <a> tags which are not children of another <a> tag

Issue

Update:

Solution

0 comments:

Post a Comment

Popular Posts

Labels