Issue
I need to fetch a remote page, modify some elements (using 'PHP Simple HTML DOM Parser' library for that) and output modified content.
There's a problem with remote pages that don't have full URLs in their source, so CSS elements and images are not loaded. Sure, it doesn't stop me from modifying elements, but the output looks bad.
For example, open https://www.raspberrypi.org/downloads/
However, if you use code
$html = file_get_html('http://www.raspberrypi.org/downloads');
echo $html;
it will look bad. I tried to apply a simple hack, but that helps just a little:
$html = file_get_html('http://www.raspberrypi.org/downloads');
$html=str_ireplace("</head>", "<base href='http://www.raspberrypi.org'></head>", $html);
echo $html;
Is there any way to "instruct" script to parse all links from $html variable from 'http://www.raspberrypi.org'? In other words, how to make raspberrypi.org to be the "main" source of all images/CSS elements fetched?
Solution
Since only Nikolay Ganovski offered a solution, I wrote a code which converts partial pages into full by looking for incomplete css/img/form tags and making them full. In case someone needs it, find the code below:
// finalizes remote page by completing incomplete css/img/form URLs (path/file.css becomes http://somedomain.com/path/file.css, etc.)
function finalize_remote_page($content, $root_url)
{
$root_url_without_scheme = preg_replace('/(?:https?:\/\/)?(?:www\.)?(.*)\/?$/i', '$1', $root_url); // ignore schemes, in case URL provided by user was http://somedomain.com while URL in source is https://somedomain.com (or vice-versa)
$content_object = str_get_html($content);
if (is_object($content_object))
{
foreach ($content_object->find('link.[rel=stylesheet]') as $entry) //find css
{
if (substr($entry->href, 0, 2) != "//" && stristr($entry->href, $root_url_without_scheme) === FALSE) // ignore "invalid" URLs like // somedomain.com
{
$entry->href = $root_url.$entry->href;
}
}
foreach ($content_object->find('img') as $entry) //find img
{
if (substr($entry->src, 0, 2) != "//" && stristr($entry->src, $root_url_without_scheme) === FALSE) // ignore "invalid" URLs like // somedomain.com
{
$entry->src = $root_url.$entry->src;
}
}
foreach ($content_object->find('form') as $entry) //find form
{
if (substr($entry->action, 0, 2) != "//" && stristr($entry->action, $root_url_without_scheme) === FALSE) // ignore "invalid" URLs like // somedomain.com
{
$entry->action = $root_url.$entry->action;
}
}
}
return $content_object;
}
Answered By - Mindaugas Li
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.