1

In short, I'm utilizing pre_replace to find style sheets and essentially proxy this experience for viewers on my website, I use the external domain and prepend it to the current href. The style sheet starts like so.

<link rel="stylesheet" type="text/css" href="/assets/css/base.css">

I will take the href and prepend the domain to be

<link rel="stylesheet" type="text/css" href="http://www.website.com/assets/css/base.css">

My issue is, when I encounter a site that does not include HTTP/HTTPS

<link rel="stylesheet" type="text/css" href="//cdn.website.com/assets/css/base.css">

Then my current preg replace would not function and return the stylesheet to the following

<link rel="stylesheet" type="text/css" href="http://www.website.com//cdn.website.com/assets/css/base.css">

Is it possible to create some sort of If then with preg_replace to not manipulate the "//" hrefs and only replace the ones with no absolute base domain?

Current preg_replace being used:

$html = file_get_contents($website_url);
$domain = 'website.com';
$html = preg_replace("/(href|src)\=\"([^(http)])(\/)?/", "$1=\"$domain$2", $html);
echo $html;
1
  • 2
    simple: don't use regexes. Use a DOM parser and then it's a simple string replace operation once you've got the href attribute's contents. Commented Jun 13, 2014 at 22:22

3 Answers 3

2

There are if/then/else conditionals in regex, although not really necessary for this to work:

(?!(href|src)=)(\")\/(\\w+.+)(\">)

Code:

$html = file_get_contents($website_url);
$domain = 'http://website.com';
$result = preg_replace("/(?!(href|src)=)(\")\/(\\w+.+)(\">)/u", "$2$domain/$3$4", $html);
echo $result;

Output:

<link rel="stylesheet" type="text/css" href="http://website.com/assets/css/base.css">

Example:

http://regex101.com/r/kU7pF1

Sign up to request clarification or add additional context in comments.

Comments

1

[^(href)] is not a negation. It's still a character class.

You are looking for a (?!...) negative lookahead:

 ~  (href|src) =\" (?!href:)  \/?  ~x

While I dispute the SO meme and overgeneralization of firing up a DOM traversal for each trivia, it should be noted that regex is often only appropriate for normalized and well-known HTML input; not if your task is proxying arbitrary websites.

Comments

0
function alterLinks($html) {

  $ret = '';

  $dom = new DomDocument();
  $dom->loadHTML($html);
  $links = $dom->getElementsByTagName('a');

  foreach ($links as $alink) {
    $href = $alink->getAttribute('href'); 
    $aMungedLink = $this->mungeHref($href);
    $alink->setAttribute("href",$aMungedLink);
  }

  $ret = $dom->saveHTML();
  return $ret;
}

3 Comments

Welcome to StackOverflow. While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. Consider editing your answer to add that context.
Some of the comments in this thread involved regular expressions. I recently had a "change hrefs" problem writing a plugin to a dynamic CMS, so I could optionally output staticHTML instead. I tried but failed to get preg_replace and regular expressions to work. The code above is clean and simple. It worked for me. I didn't write the mungeHref($href) function above because my needs were different than yours. That's the easy part anyway.
fwiw I used almost identical codes to rework the "src" attributes for all images in a dynamic HTML page, so it could then be written out as static HTML. But that's a different topic.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.