3

I would like to count the number of lines in an ASCII text file. I thought the best way to do this would be by counting the newlines in the file:

for (int c = fgetc(fp); c != EOF; c = fgetc(fp)) {  /* Count word line endings. */
    if (c == '\n') ++lines;
}

However, I'm not sure if this would account for the last line on all both MS Windows and Linux. That is if my text file finishes as below, without an explicit newline, is there one encoded there anyway or should I add an extra ++lines; after the for loop?

cat
dog

Then what about if there is an explicit newline at the end of the file? Or do I just need to test for this case by keeping track of the previously read value?

10
  • 1
    You are right about doubting your approach. Since the EOF overwrites the last read value, you'd need to save that somewhere else. Does a file that contains just a single \n contain one or two lines? Commented May 16, 2015 at 16:36
  • 1
    Well, it depends whether the newline actually is there. If create a document with Notepad, it won't do it, but maybe some editors will. You could check if the last character in the document is a newline and act accordingly. Commented May 16, 2015 at 16:38
  • 1
    (For clarity of this particular question, you may want to rename words to lines. Unless line == word, in which case you may need a different approach.) Commented May 16, 2015 at 16:38
  • 3
    fgets is likely to be faster on some platforms because of the stream locking stream scheme and function call overhead. But you will still need to scan the buffer for '\n' in case you have a line longer than the buffer size, and handling the last line will be even more complicated because of that. Keep it simple. Commented May 16, 2015 at 16:51
  • 2
    If you define a line to be an optional string delimited by a newline, then according to that definition any trailing content is not a line. If you define what you mean with line differently, it can be different. In any case, make sure you don't forget to consider cornercases like empty lines, empty files or files without newlines. Commented May 16, 2015 at 18:02

7 Answers 7

3

If there is no newline, one won't be generated. C tells you exactly what's there.

Sign up to request clarification or add additional context in comments.

2 Comments

Almost true: C tells you there is a '\n' whereas the files really contain '\r', '\n' in Windows.
@chqrlie unless you open it in b mode
3

Text files are always expected to end with a line feed. There's no canonical way of handling files that don't.

Here's how some tools choose to deal with characters after the last line feed:

  • wc doesn't count it as a line (so you have good precedence for that)
  • Vim marks the file as [noeol], and saves the file without a trailing line feed
  • GNU sed treats the file as if it had a last line feed
  • sh's read exits with error, but still returns the data

Since behaviour is pretty much undefined, you can just do whatever's convenient or useful to you.

Comments

3

First, there will not be any implicitly encoded newline at the end of the last line. The only way there will be a newline is if the software or person that produced the file put it there. Putting it there is generally considered good practice, however.

The ultimate answer for what you should report as the line count depends on the convention that you need to follow for the software or people that will be using this line count, and probably what you can assume about the behavior of the input source as well.

Most command-line tools will terminate their output with a newline character. In this case, the sensible answer may be to report the number of newline characters as the number of actual lines.

On the other hand, when a text editor is displaying a file, you will see that the line numbering in the margin (if supported) contains a number for the last line whether it is empty or not. This is in part to tell the user that there is a blank line there, but if you want to count the number of lines displayed in the margin, it is one plus the number of newline characters in the file. It is typical for some coders to not terminate their last lines with a newline character (sometimes due to sloppiness), so in this case this convention would actually be the right answer.

I'm not sure any other conventions make much sense. For example, if you choose not to count the last line unless it is non-empty, then what counts as non-empty? The file ending after newline? What if there is whitespace on that line? What if there are several empty lines at the end of the file?

Comments

2

If you're going to use this method, you could always keep a separate counter for how many letters on the line you are at. If the count at the end is greater than 1, then you know there is stuff on the last line that wasn't counted.

int letters = 0

for (int c = fgetc(fp); c != EOF; c = fgetc(fp)) {  /* Count word line endings. */
    letters++; // Increase count on character

    if (c == '\n')
    {
        ++words;
        letters = 0; // Set back to 0 after new line
    }
}

if (letters > 0)
{
    ++words;
}

Comments

2

Your concern is real, the last line in the file may be missing the final end of line marker. The end of line marker is a single '\n' in Linux, a CR LF pair in Windows that the C runtime converts automatically into a '\n'.

You can simplify your code and handle the special case of the last line missing a linefeed this way:

int c, last = '\n', lines = 0;

while ((c = getc(fp)) != EOF) {  /* Count word line endings. */
    if (c == '\n')
        lines += 1;
    last = c;
}
if (last != '\n')
    lines += 1;

Since you are concerned with speed, using getc instead of fgetc will help on platforms where it is defined as a macro that handles the stream structures directly and calls a function only to refill the buffer, every BUFSIZ characters or so, unless the stream is unbuffered.

Comments

1

How about this:

Create a flag for yourself to keep track of any non \n characters following a \n that is reset when c=='\n'. After the EOF, check to see if the flag is true and increment if yes.

bool more_chars = false;
for (int c = fgetc(fp); c != EOF; c = fgetc(fp)) {  /* Count word line endings. */
            if (c == '\n') {
              more_chars = false;
              ++words;
            } else more_chars = true;
 }
 if(more_chars) words++;

Comments

-1

Windows and UNIX/Linux style line breaks make no difference here. On either system a text file may or may not have a newline at the end of the last line.

If you always add 1 to the line count, this effectively counts the empty line at the end of the file when there is a newline at the end (i.e., file "foo\n" will count as having two lines: "foo" and ""). This may be an entirely reasonable solution, depending on how you want to define a line.

Another definition of a "line" is that it always ends in a newline, i.e., the file "foo\nbar" would only have one line ("foo") by this definition. This definition is used by wc.

Of course you could keep track of whether the newline was the last character in file and only add 1 to the count in case it wasn't. Then a "line" would be defined as either ending in a newline or being non-empty at the end of the file, which sounds quite complex to me.

8 Comments

This would also generate a line count of 1 on a completely empty file, though.
@DavidHoelzer Yes, "the empty line at the end of the file" would be the only line in that special case. It's kind of a philosophical question whether an empty file still has one line (or if an empty file even counts as a "text file" since there is no text). =)
No philosophy here: A file with a single byte '\n' has a single line, en empty file has no lines. no lines obviously means lines = 0. The philosophical question might be: why is there a plural in no lines instead of no line?
@chqrlie You've are just assuming one definition of line as the only correct one. From a programmer's perspective I think the simplest definition of a line is indeed "a line always ends in a linefeed", but that leaves the question of how many lines do the files "foo\nbar" (wc says 1) or "foo" (wc says 0) have.
@chqrlie I would generally agree that the empty file has no lines, but on the other hand in many text editors the empty file is shown as having one line when opened. Also, if one were to say that the file "foo" (no linefeed) has 1 line and the file "foo\n\nbar" has 3 lines, then it would seem to follow quite simply that the file "" has 1 line (but as discussed earlier, it's valid to say that these files have 0, 2, and 0 lines, respectively). As for philosophy, can you call an empty field a forest with 0 trees? =)
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.