How is it possible to get the the full link from a string like this:
<a href="https://www.google.com/setprefdomain?prefdom=DE&prev=https://www.google.de/&sig=K_DtcF1dnV7Xn6g9Ir_3SUs6a6TiA%3D">
I want to isolate the string starting after 'href="' and ending with 'A%3D', but only if this string contains the string domain
.
I don't really know, how to check, if the string 'domain' is included.
My regex so far is: /(?<=href=")(.*)(?=")/gi
I love regex, but I prefer not to parse valid html with it as a matter of stability.
Use a legitimate parsing technique to isolate the href
attribute. This will ensure that you will never accidentally match data-href
or any other attributes that share the consecutive letters href
. This also frees the burden of needing to match the possibility of single quotes or double quotes.
After the href attribute is isolated, use includes()
or indexOf()
to check if domain
is anywhere in the string value. If you need to tighten up the accuracy of matching domain
, you might now entertain using regex with word boundaries or other checks on surrounding substrings (such as checking if domain
occurs before the first ?
).
const str = '<a href="https://www.google.com/setprefdomain?prefdom=DE&prev=https://www.google.de/&sig=K_DtcF1dnV7Xn6g9Ir_3SUs6a6TiA%3D">',
url = new DOMParser()
.parseFromString(str, 'text/html')
.documentElement.querySelector('a')
.href;
console.log(url.includes('domain') ? url : null);
For those who think that parsing the valid anchor tag is too much work for a reliably constructed string, then you can use regex as a shortcut (but I probably wouldn't in a professional application).
Use a literal space (or word boundary - \b
) before href
to ensure that you are targetting the correct attribute and not making a partial match on a larger attribute. I am going to presume that the input string is guaranteed to be wrapped in double quotes, so match the string between the double quotes. Within the double quotes, match one or more non-double-quote characters (greedily), then the sought word domain
, then one or more non-double-quote characters (greedily). This will return the isolated url if it qualifies and weed out a few fringe cases that could damage the result.
let str = `<a class="domain" data-dummy-href="example.com" href="https://www.google.com/setprefdomain?prefdom=DE&prev=https://www.google.de/&sig=K_DtcF1dnV7Xn6g9Ir_3SUs6a6TiA%3D" style="background-image: url('http://www.example.com/domain/123.png')">`;
console.log(str.match(/ href="([^"]+domain[^"]+)"/i)[1] || 'Not valid');
If domain
may occur at the start or at the end of the href value, then respectively change +
to *
to change the qualifier from "one or more" to "zero or more".
I think @jscrip's answer may be the most straight forward way. Alternatively, you could check to see if the string includes the string 'domain' before matching the regex. For example:
let str = '<a href="https://www.google.com/setprefdomain?prefdom=DE&prev=https://www.google.de/&sig=K_DtcF1dnV7Xn6g9Ir_3SUs6a6TiA%3D">'
let href = str.includes('domain') ? str.match(/(?<=href=").*(?=")/)[0] : 'Not valid'
console.log(href)