Print lines between two patterns where first pattern appears more than once before second pattern

Question

I am trying to find a way to use awk to print the lines between two patterns, Virtual and Redirect, but only if the two patterns are consecutive, exclusive of any other strings between them.

Example pre-formatted file:

<VirtualHost ${APACHE_IP}: 80>
    ServerName domain1.com: 
    ServerAlias www.domain1.com: 
    RedirectMatch permanent "^(.*)$" https: //www.otherdomain.com/
<VirtualHost ${APACHE_IP}: 80>
    ServerName dev1-www-domain3.domain.com: 
<VirtualHost ${APACHE}: 80>
    ServerName domain3.com: 
    RedirectPermanent / https: //www.domain3.com/
<VirtualHost ${APACHE_IP}: 80>
    ServerName www.domain4.com: 
<VirtualHost 10.0.0.1: 80>
    ServerName web1.domain.com: 
    RedirectPermanent / https: //web1.domain.com/
<VirtualHost 10.0.0.1: 80>
    ServerName web1.www_site_com.domain.com: 
<VirtualHost 10.0.0.1: 80>
    ServerName web1.dev_site_com.domain.com: 
<VirtualHost 10.0.0.1: 443>
    ServerName web1.domain.com

I need to get all the domains that correspond to a redirect. So the output should be (whether the output includes the Virtual or Redirect lines is irrelevant to me as I can always grep them out, but I do not need them):

    ServerName domain1.com: 
    ServerAlias www.domain1.com:
    ServerName domain3.com: 
    ServerName web1.domain.com:

I have been trying a number of different methods, and the last one I tried was this:

awk 'f { if (/Redirect/) {
             printf "%s", buf; f = 0; buf = ""
         } else buf = buf $0 ORS 
       }
     /Virtual/ { f = 1}' file

But it will start at the first instance of 'Virtual' and continue grabbing lines until it gets to a 'Redirect', regardless of any other 'Virtual' strings in between. I only want it to return strings between a matched set of 'Virtual' and 'Redirect' and ignore any blocks that are between two 'Virtual' strings.

This is what the output ends up as:

    ServerName domain1.com: 
    ServerAlias www.domain1.com: 
    ServerName dev1-www-domain3.domain.com: 
<VirtualHost ${APACHE}: 80>
    ServerName domain3.com: 
    ServerName www.domain4.com: 
<VirtualHost 10.0.0.1: 80>
    ServerName web1.domain.com:

All other examples I have been finding are decidedly not my particular use case, as close as they might be. I have spent several hours looking through Stackexchange/overflow and Reddit with no luck so far.

This is not very clear. (1) Please say, early in the question, what two patterns you are looking for. (OK, on reading the question for the third time, I see that you talk about “the Virtual or Redirect lines”, so I guess that the patterns you are looking for are “Virtual” and “Redirect”.) (2) I was going to say that “but only if the two patterns are consecutive, exclusive of any other strings between them” was also unclear, but I see that you eventually got around to explaining it. (3) “ServerName web1.domain.com” appears twice in your input data. This is confusing. — G-Man Says 'Reinstate Monica'
– G-Man Says 'Reinstate Monica', Commented Aug 20, 2024 at 4:32
What does the not preformatted file look like? IMHO, giving you a way to work with the original data would help in various ways: a known parser of the data format could be used (if the data is in a well-known format), and you would end up with less handcrafted code to maintain. — Kusalananda
– Kusalananda ♦, Commented Aug 20, 2024 at 5:50
Based on what I know about Apache configs, I'd expect the original file has some lines that aren't too relevant here. There would be end tags for the virtual hosts, but I don't think they're too useful either since the starting tags are enough to know where one ends and the next one starts. — ilkkachu
– ilkkachu, Commented Aug 20, 2024 at 9:23
@G-ManSays'ReinstateMonica' ServerName web1.domain.com does in fact appear twice in the file, which is why I included it. This file is an amalgamation of several Apache conf files because I need to combine them all and isolate redirected domains regardless if they appear in more than one conf and may not be redirected in all confs. — JPaul
– JPaul, Commented Aug 20, 2024 at 17:44
@Kusalananda The source files are standard Apache VHost configurations. However I need the list created from several combined files where the domain may be configured (ie, seperate Vhost and SSLhost confs) and from across multiple clusters. — JPaul
– JPaul, Commented Aug 20, 2024 at 17:49

Chris Davies · Accepted Answer · 2024-08-20 12:05:01Z

Yet another awk solution. This one sets three variables:

block - pattern match that starts a block
need - pattern match we need to trigger any output
result - pattern match for lines to be output

Code:

awk -v block='^<VirtualHost.*>' -v need='Redirect' -v result='ServerName|ServerAlias' '
    # Reset the count of saved lines at the beginning of each block
    $0 ~ block { i=0 }

    # Save any lines matching the required result pattern
    $0 ~ result { output[i++]=$0 }

    # Output all saved lines when we hit the trigger
    $0 ~ need && i { for(x=0; x<i; x++) { print output[x] } }
' file

You can crunch this into a single line (remove the comments first), but IMO it's more readable and maintainable as a multi-line block of commented code.

Things get more complicated if the trigger can occur before all the required lines in a block have been found. In this case you'd want to output the saved lines either on entry to a new block, or at the END of the input data file.

Be careful of awk's -v flag: escape sequences such as \n, \v… in the value are expanded, so for example if one the strings contained \n the corresponding value would contain a newline character rather than those two characters \n. If necessary you can get around this by setting environment variables and accessing them through the ENVIRON[] array.

jubilatious1 · Accepted Answer · 2024-08-20 04:42:16Z

Using Raku (formerly known as Perl_6)

First, generate code to break into full desired records:

~$ raku -e 'my @a = slurp.split(/^^ <?before \<VirtualHost > /, :skip-empty ); 
            .put for @a.grep(/ \s+ Redirect /);'  file

Sample Output:

<VirtualHost ${APACHE_IP}: 80>
    ServerName domain1.com: 
    ServerAlias www.domain1.com: 
    RedirectMatch permanent "^(.*)$" https: //www.otherdomain.com/

<VirtualHost ${APACHE}: 80>
    ServerName domain3.com: 
    RedirectPermanent / https: //www.domain3.com/

<VirtualHost 10.0.0.1: 80>
    ServerName web1.domain.com: 
    RedirectPermanent / https: //web1.domain.com/

Second, change the return to only include desired lines of above full desired records:

~$ raku -e 'my @a = slurp.split(/^^ <?before \<VirtualHost > /, :skip-empty ); 
            .lines.[1..*-2].join("\n").put for @a.grep(/ \s+ Redirect /);'  file

Sample Output:

ServerName domain1.com: 
ServerAlias www.domain1.com: 
ServerName domain3.com: 
ServerName web1.domain.com:

Briefly, the code above slurps the file into memory all at once. Text is broken before ^^ start-of-line \<VirtualHost with the positive lookahead <?before \<VirtualHost >, and saved into the @a array. Then the @a array is grepped through for the desired records.

In the second (final) example, .put becomes .lines.[1..*-2].join("\n").put. This ensures that only the desired lines are returned by breaking each record into .lines, using the index .[1..*-2] to skip the first/last line of each record, re-joining on newline, and then outputting.

Note: the return above can be further cleaned-up by interposing as call to .trim-leading (to remove leading whitespace).

https://docs.raku.org
https://raku.org

lihao · Accepted Answer · 2024-08-20 05:20:51Z

1

If you are looking to retrieve all domain names/aliases under a VirtualHost which has a Redirect directive, you can try reading each Virtual Host as a record (RS="<VirtualHost ") and set newline as the Field separator (-F'\n') and then iterate through the fields and do match on each field:

awk -v RS='<VirtualHost ' -F'\n' '
    /Redirect/{
       for(i=1; i<=NF; i++)
         if($i ~ /Server(Name|Alias)/) print $i
}' file

answered Aug 20, 2024 at 5:20

lihao

1213 bronze badges

You should mention that requires GNU awk for multi-char RS.

Ed Morton
– Ed Morton

2024-08-20 12:08:14 +00:00
Commented Aug 20, 2024 at 12:08

Add a comment |

ilkkachu · Accepted Answer · 2024-08-20 10:18:00Z

If we reduce this to printing each block if it contains a Redirect, the straightforward solution is to collect lines to a variable and check what we got at the end (end of file or start of next block). Then print if there's a match. This is somewhat like what you did there, except that the buffer should be emptied at the start of a block, and I would collect the whole block, just in case the lines are in a different order. (I.e. a ServerAlias after the RedirectMatch. I'm not sure if Apache requires some order, but it's more general that way.)

% awk '/Virtual/ {
           if (buf ~ /Redirect/) print buf;
           buf=""
       }
       1 { buf = buf $0 ORS }
       END { if (buf ~ /Redirect/) print buf }
    ' file
<VirtualHost ${APACHE_IP}: 80>
    ServerName domain1.com:
    ServerAlias www.domain1.com:
    RedirectMatch permanent "^(.*)$" https: //www.otherdomain.com/

<VirtualHost ${APACHE}: 80>
    ServerName domain3.com:
    RedirectPermanent / https: //www.domain3.com/

<VirtualHost 10.0.0.1: 80>
    ServerName web1.domain.com:
    RedirectPermanent / https: //web1.domain.com/

Or just collect the ServerName/ServerAlias lines and keep a flag to record if Redirect was seen:

% awk '/Virtual/ { 
           if (match) print buf;
           buf=""; match=0
       }
       /Server(Name|Alias)/ { buf = buf $0 ORS}
       /Redirect/ { match=1 }
       END { if (match) print buf }
   ' file
    ServerName domain1.com:
    ServerAlias www.domain1.com:

    ServerName domain3.com:

    ServerName web1.domain.com:

If the input had empty lines between blocks, we could use AWK's paragraph mode, giving each block as in input record, in stead of lines.

Though stacking seds and awks is a bit silly, this should be relatively straightforward:

% sed -e $'/^<VirtualHost/i\\\n' file | awk -v RS="" '/Redirect/' | grep -Ee 'Server(Name|Alias)'
    ServerName domain1.com:
    ServerAlias www.domain1.com:
    ServerName domain3.com:
    ServerName web1.domain.com:

_{(sed's i command requires a backslash and the line to insert on a separate line, so it's easiest to use $'...' to embed a newline in the command line arg.)}

Ed Morton · Accepted Answer · 2024-08-20 12:06:55Z

0

Using any awk:

$ awk '/^</{s=""} /Server/{s=s $0 ORS} /Redirect/{printf "%s", s}' file
    ServerName domain1.com:
    ServerAlias www.domain1.com:
    ServerName domain3.com:
    ServerName web1.domain.com:

answered Aug 20, 2024 at 12:06

Ed Morton

35.8k6 gold badges25 silver badges60 bronze badges

Add a comment |

Chris Davies · Accepted Answer · 2024-08-20 15:09:32Z

For fun here's an alternative solution. Source data is in file:

# Temporary directory
t=$(mktemp --directory 'vhnames.XXXXXXXXXX')

# Split the source data file into parts, each starting with a line matching "<VirtualHost.*>"
csplit -f "$t"/xx -n 2 -k -s file '/<VirtualHost.*>/' '{*}'

# Identify parts containing "Redirect" and then output the associated ServerName, ServerAlias lines
grep -E 'ServerName|ServerAlias' $(grep -l Redirect "$t"/*)

# Tidy up
rm -rf "$t"

The code relies on there being no spaces in the temporary directory and file names. This is a reasonable assumption because they are directly under program control. If you don't have GNU (or GNU-like) utilities you can change the t= assignment to something like t=vhnames.$$.tmp; mkdir -m 700 "$t", and replace csplit's {*} with {99} (assuming no more than 99 parts)

Stack Exchange Network

Print lines between two patterns where first pattern appears more than once before second pattern

6 Answers 6

You must log in to answer this question.

Hot Network Questions

Print lines between two patterns where first pattern appears more than once before second pattern

6 Answers 6

You must log in to answer this question.

Related

Hot Network Questions