2

I have a .toc (table of contents file) from my .tex document.

It contains a lot of lines and some of them have the form

\contentsline {part}{Some title here\hfil }{5}
\contentsline {chapter}{\numberline {}Person name here}{5}

I know how to grep for part and for chapter. But I'd like to filter for those lines and have the output in a csv file like this:

{Some title here},{Person name here},{5}

or with no braces

Some title here,Person name here,5

1. For sure the number (page number) in the last pair {} is the same for both two lines, so we can filter only the second one.

2. Note that some empty pair {} could happens or also could contain another pair {}. For example, it could be

\contentsline {part}{Title with math $\frac{a}{b}$\hfil }{15}

which should be filtered as

Title with math $\frac{a}{b}$

edit 1: I was able to obtain the numbers without braces at end of line using

grep '{part}' file.toc | awk -F '[{}]' '{print $(NF-1)}'

edit 2: I was able to filter the chapter lines and remove the garbage with

grep '{chapter}' file.toc | sed 's/\\numberline//' | sed 's/\\contentsline//' | sed 's/{chapter}//' | sed 's/{}//' | sed 's/^ {/{/'

and the output without blank spaces was

    {Person name here}{5}

edit 3: I was able to filter for part and clean the output with

    \contentsline {chapter}{\numberline {}Person name here}{5}

which returns

{Title with math $\frac{a}{b}$}{15}
6
  • This is extremely confusing. The first example proposes a GROUP BY operation on the page numbers, but all edits only talk about grepping braces. Are you after a group by by page numbers? Commented Oct 6, 2016 at 22:06
  • @grochmal, thanks for attention and sorry for confusing. I'd like to find the lines containing part and chapter and then filter the data to collect the names and page numbers so that the final csv file looks like {Some title here},{Person name here},{5} (with comma and with/without braces). I don't know how to put all 3 info together on a single line of a csv file. Commented Oct 6, 2016 at 22:14
  • Does it have to be awk? I'd probably use the Text::Balanced perl module as that has a extract_bracketed call, or there might be other modules that know how to parse TeX. Commented Oct 6, 2016 at 22:30
  • OK, my knowledge of latex is enough to understand that a part has several chapters but there is one thing that bugs me. What does the {5} do at the end? Is it always a 5? I always saw \contentsline{chapter}{title}{}, i.e. that last argument was always empty. Commented Oct 6, 2016 at 22:35
  • @grochmal, that number means the page number when the part begins. You are right when you say that a part (could) contain(s) a lot of chapters. In my case, it contains only one. So just after a part line below comes a chapter line. Commented Oct 6, 2016 at 23:10

3 Answers 3

1

This is using GNU awk, using POSIX awk would be very troublesome (lack of gensub, which I use more than once).

#!/usr/bin/env gawk

function join(array, result, i)
{
    result = array[0];
    end = length(array) - 1;
    for (i = 1; i <= end; i++)
        result = result "," array[i];
    return result;
}
function push(arr, elem)
{
    arr[length(arr)] = elem;
}

# split("", arr) is a horribly unreadable way to clear an array
BEGIN { split("", arr); }

/{part}|{chapter}/ {
    l = gensub(".*{(.+)}{(.+)}{([0-9]+)}$", "\\1,\\3,\\2", "g");
    if ("part" == substr(l, 0, 4)) {
        if (length(arr) > 0) { print join(arr); }
        split("", arr);
        push(arr, gensub("^(.*),(.*),(.*)$", "\\2,\\3","g", l));
    } else {
        push(arr, gensub("^(.*),(.*),(.*)$", "\\3","g", l));
    }
}

END { print join(arr); }

This uses the fact that regexes are greedy, so the matches will get the full line each time. It took more effort than I though at first.

With the following input:

\contentsline {part}{Some title here\hfil }{5}
\contentsline {chapter}{\numberline {}Person name here}{5}
blah blah
\contentsline {chapter}{\numberline {}Person name here}{5}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {chapter}{\numberline {}Person name here}{5}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {part}{Some title here\hfil }{7}
\contentsline {chapter}{\numberline {}Person name here}{7}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{7}
blah blah
\contentsline {part}{Some title here\hfil }{9}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{9}

We produce with cat input | awk -f the_above_script.awk:

5,Some title here\hfil ,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here
7,Some title here\hfil ,\numberline {}Person name here,\numberline {}Person name here
9,Some title here\hfil ,\numberline {}Person name here

The page number is taken from the {part} then any {chapter} that happens after the {part} is included. This allows for several chapters inside parts of a book.

5
  • OK, thanks. Some comments: 1. since I don't have more than 1 chapter after part, the output should have only 3 blocks, with 2 commas. 2. How to delete the last comma at end of line? 3. What do you recommend to clean the lines, I mean, remove \hfil and \numberline {} and so on. Could I use after your script some sed? Commented Oct 7, 2016 at 0:38
  • @Sigur - Last comma? As of sed just remember that this is *nix, add another pipe at the end and place sed to kill latex parts sed -e 's/\\[^ ]\+//g' -e 's/{.*}//g' should work for ~90% of latex stuff, but to make 100% you would need to use a latex parser (which would involve writing a lot of code). Commented Oct 7, 2016 at 0:45
  • Hum, something is strange because now I observed that your output does not contain a comma at end of line. When I run you code on my toc file the output is like 5,A Title\hfil ,\numberline {}Person Name, Observe the ending ,. Commented Oct 7, 2016 at 0:49
  • @Sigur - oops, forgot a -1. It was in my test command line, but not here (edited) Commented Oct 7, 2016 at 0:52
  • I found this: sed 's/,$//' to delete it. Commented Oct 7, 2016 at 0:53
1

With the Perl Text::Balanced module the top-level {} can have their contents extracted thusly:

#!/usr/bin/env perl
use strict;
use warnings;
use Text::Balanced qw(extract_bracketed);

# this will of course fail if the input is one multiple lines, as this
# is only a line-by-line parser of standard input or the filenames
# passed to this script
while ( my $line = readline ) {
    if ( $line =~ m/\\contentsline / ) {
        my @parts = extract_contents($line);
        # emit as CSV (though ideally instead use Text::CSV module)
        print join( ",", @parts ), "\n";
    } else {
        #print "NO MATCH ON $line";
    }
}

sub extract_contents {
    my $line = shift;
    my @parts;
    # while we can get a {} bit out of the input line, anywhere in the
    # input line
    while ( my $part = extract_bracketed( $line, '{}', qr/[^{]*/ ) ) {
        # trim off the delimiters
        $part = substr $part, 1, length($part) - 2;
        push @parts, $part;
    }
    return @parts;
}

With some input:

% < input 
not content line
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {part}{Title with math $\frac{a}{b}$\hfil }{15}
also not content line
% perl parser input
chapter,\numberline {}Person name here,5
part,Title with math $\frac{a}{b}$\hfil ,15
% 
1
  • Thanks for effort. I have no idea about your code. The last code block shows the output, right? If so, the output is not what I wish. I'd like something like {Some title here},{Person name here},{5} where the 1st brace comes from part line and the 2nd and 3rd braces come from the chapter line. Commented Oct 7, 2016 at 0:14
1

In TXR

@(repeat)
\contentsline {part}{@title\hfil }{@page}
@  (trailer)
@  (skip)
\contentsline {chapter}{\numberline {}@author}{@page}
@  (do (put-line `@title,@author,@page`))
@(end)

Sample data:

\lorem{ipsum}
\contentsline {part}{The Art of The Meringue\hfil }{5}
a
b
c
j
\contentsline {chapter}{\numberline {}Doug LeMonjello}{5}


\contentsline {part}{Parachuting Primer\hfil }{16}

\contentsline {chapter}{\numberline {}Hugo Phirst}{16}

\contentsline {part}{Making Sense of $\frac{a}{b}$\hfil }{19}

\contentsline {part}{War and Peace\hfil }{27}

\contentsline {chapter}{\numberline {}D. Vide}{19}

\contentsline {part}{War and Peace\hfil }{19}

Run:

$ txr title-auth.txr data
The Art of The Meringue,Doug LeMonjello,5
Parachuting Primer,Hugo Phirst,16
Making Sense of $\frac{a}{b}$,D. Vide,19

Notes:

  • Because @(trailer) is used, the lines which give the author do not have to strictly follow their part. The data could introduce several \contentsline {part} elements which are then followed by the chapter lines that match on page number.
  • @(skip) implies a search through the entire remaining data. The performance can be improved by limiting the range by adding a numeric argument. If it can be assumed that a matching {chapter} is always found within 50 lines after {part}, we can use @(skip 50).
1
  • Interesting. Let me study your code. Thanks. Commented Oct 8, 2016 at 14:12

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.