File | Vector documentation

Requirements

The vector process must have the ability to read the files listed in include and execute any of the parent directories for these files. Please see File permissions for more details.

Configuration

Example configurations

{
  "sources": {
    "my_source_id": {
      "type": "file",
      "acknowledgements": null,
      "ignore_older_secs": 600,
      "include": [
        "/var/log/**/*.log"
      ],
      "read_from": "beginning"
    }
  }
}

[sources.my_source_id]
type = "file"
ignore_older_secs = 600
include = [ "/var/log/**/*.log" ]
read_from = "beginning"

---
sources:
  my_source_id:
    type: file
    acknowledgements: null
    ignore_older_secs: 600
    include:
      - /var/log/**/*.log
    read_from: beginning

{
  "sources": {
    "my_source_id": {
      "type": "file",
      "acknowledgements": null,
      "exclude": [
        "/var/log/binary-file.log"
      ],
      "file_key": "file",
      "glob_minimum_cooldown_ms": 1000,
      "host_key": "host",
      "ignore_not_found": null,
      "ignore_older_secs": 600,
      "include": [
        "/var/log/**/*.log"
      ],
      "line_delimiter": "\n",
      "max_line_bytes": 102400,
      "max_read_bytes": 2048,
      "oldest_first": null,
      "remove_after_secs": null,
      "read_from": "beginning",
      "data_dir": "/var/lib/vector",
      "ignore_checkpoints": null
    }
  }
}

[sources.my_source_id]
type = "file"
exclude = [ "/var/log/binary-file.log" ]
file_key = "file"
glob_minimum_cooldown_ms = 1_000
host_key = "host"
ignore_older_secs = 600
include = [ "/var/log/**/*.log" ]
line_delimiter = """

"""
max_line_bytes = 102_400
max_read_bytes = 2_048
read_from = "beginning"
data_dir = "/var/lib/vector"

---
sources:
  my_source_id:
    type: file
    acknowledgements: null
    exclude:
      - /var/log/binary-file.log
    file_key: file
    fingerprint: null
    glob_minimum_cooldown_ms: 1000
    host_key: host
    ignore_not_found: null
    ignore_older_secs: 600
    include:
      - /var/log/**/*.log
    line_delimiter: "\n"
    max_line_bytes: 102400
    max_read_bytes: 2048
    oldest_first: null
    remove_after_secs: null
    read_from: beginning
    multiline: null
    data_dir: /var/lib/vector
    encoding: null
    ignore_checkpoints: null

acknowledgements

common optional bool

Controls if the source will wait for destination sinks to deliver the events before acknowledging receipt.

default: false

data_dir

optional string file_system_path

The directory used to persist file checkpoint positions. By default, the global data_dir option is used. Please make sure the Vector project has write permissions to this dir.

Examples

"/var/lib/vector"

encoding

optional object

Configures the encoding specific source behavior.

Encoding of the source messages. Takes one of the encoding label strings defined as part of the Encoding Standard. When set, the messages are transcoded from the specified encoding to UTF-8, which is the encoding vector assumes internally for string-like data. Enable this transcoding operation if you need your data to be in UTF-8 for further processing. At the time of transcoding, any malformed sequences (that can’t be mapped to UTF-8) will be replaced with replacement character and warnings will be logged.

exclude

optional [string]

Array of file patterns to exclude. Globbing is supported.Takes precedence over the include option.

Array string literal

Examples

[
 "/var/log/binary-file.log"
]

file_key

optional string literal

The key name added to each event with the full path of the file.

Examples

"file"

default: file

fingerprint

optional object

Configuration for how the file source should identify files.

fingerprint.ignored_header_bytes

optional uint

The number of bytes to skip ahead (or ignore) when generating a unique fingerprint. This is helpful if all files share a common header.

Relevant when: strategy = "checksum"

fingerprint.lines

optional uint

The number of lines to read when generating a unique fingerprint. This is helpful when some files share common first lines. If the file has less than this amount of lines then it won’t be read at all.

Relevant when: strategy = "checksum"

default: 1 (lines)

fingerprint.strategy

optional string enum literal

The strategy used to uniquely identify files. This is important for checkpointing when file rotation is used.

Enum options

Option	Description
`checksum`	Read first N lines of the file, skipping the first `ignored_header_bytes` bytes, to uniquely identify files via a checksum.
`device_and_inode`	Uses the device and inode to unique identify files.

default: checksum

glob_minimum_cooldown_ms

optional uint

Delay between file discovery calls. This controls the interval at which Vector searches for files. Higher value result in greater chances of some short living files being missed between searches, but lower value increases the performance impact of file discovery.

default: 1000 (milliseconds)

host_key

optional string literal

The key name added to each event representing the current host. This can also be globally set via the global host_key option.

default: host

ignore_checkpoints

optional bool

This causes Vector to ignore existing checkpoints when determining where to start reading a file. Checkpoints are still written normally.

default: false

ignore_not_found

optional bool

Ignore missing files when fingerprinting. This may be useful when used with source directories containing dangling symlinks.

default: false

ignore_older_secs

common optional uint

Ignore files with a data modification date older than the specified number of seconds.

Examples

include

required [string]

Array of file patterns to include. Globbing is supported.

Array string literal

Examples

[
 "/var/log/**/*.log"
]

line_delimiter

optional string literal

String sequence used to separate one file line from another

Examples

"\r\n"

default:

max_line_bytes

optional uint

The maximum number of a bytes a line can contain before being discarded. This protects against malformed lines or tailing incorrect files.

default: 102400 (bytes)

max_read_bytes

optional uint

An approximate limit on the amount of data read from a single file at a given time.

Examples

multiline

optional object

Multiline parsing configuration. If not specified, multiline parsing is disabled.

multiline.condition_pattern

required string regex

Condition regex pattern to look for. Exact behavior is configured via mode.

multiline.mode

required string enum literal

Mode of operation, specifies how the condition_pattern is interpreted.

Enum options

Option	Description
`continue_past`	All consecutive lines matching this pattern, plus one additional line, are included in the group. This is useful in cases where a log message ends with a continuation marker, such as a backslash, indicating that the following line is part of the same message.
`continue_through`	All consecutive lines matching this pattern are included in the group. The first line (the line that matched the start pattern) does not need to match the `ContinueThrough` pattern. This is useful in cases such as a Java stack trace, where some indicator in the line (such as leading whitespace) indicates that it is an extension of the preceding line.
`halt_before`	All consecutive lines not matching this pattern are included in the group. This is useful where a log line contains a marker indicating that it begins a new message.
`halt_with`	All consecutive lines, up to and including the first line matching this pattern, are included in the group. This is useful where a log line ends with a termination marker, such as a semicolon.

multiline.start_pattern

required string regex

Start regex pattern to look for as a beginning of the message.

multiline.timeout_ms

required uint

The maximum time to wait for the continuation. Once this timeout is reached, the buffered message is guaranteed to be flushed, even if incomplete.

oldest_first

optional bool

Instead of balancing read capacity fairly across all watched files, prioritize draining the oldest files before moving on to read data from younger files.

default: false

read_from

common optional string literal enum

In the absence of a checkpoint, this setting tells Vector where to start reading files that are present at startup.

Enum options string literal

Option	Description
`beginning`	Read from the beginning of the file.
`end`	Start reading from the current end of the file.

default: beginning

remove_after_secs

optional uint

Timeout from reaching eof after which file will be removed from filesystem, unless new data is written in the meantime. If not specified, files will not be removed.

Examples

Output

Logs

Line

An individual line from a file. Lines can be merged using the multiline options.

Fields

file required string literal

The absolute path of originating file.

Examples

/var/log/apache/access.log

host required string literal

The local hostname, equivalent to the gethostname command.

Examples

my-host.local

message required string literal

The raw line from the file.

Examples

53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] "GET /disintermediate HTTP/2.0" 401 20308

timestamp required timestamp

The exact time the event was ingested into Vector.

Examples

2020-10-10T17:07:36.452332Z

Telemetry

Metrics

link

checkpoint_write_errors_total

counter

The total number of errors writing checkpoints.

host required

The hostname of the system Vector is running on.

pid required

The process ID of the Vector instance.

checkpoints_total

counter

The total number of files checkpointed.

host required

The hostname of the system Vector is running on.

pid required

The process ID of the Vector instance.

checksum_errors_total

counter

The total number of errors identifying files via checksum.

file optional

The file that produced the error

host required

The hostname of the system Vector is running on.

pid required

The process ID of the Vector instance.

events_in_total

counter

The number of events accepted by this component either from tagged origin like file and uri, or cumulatively from other origins.

component_id required

The Vector component ID.

component_kind required

The Vector component kind.

component_name required

Deprecated, use component_id instead. The value is the same as component_id.

component_type required

The Vector component type.

container_name optional

The name of the container from which the event originates.

file optional

The file from which the event originates.

host required

The hostname of the system Vector is running on.

mode optional

The connection mode used by the component.

peer_addr optional

The IP from which the event originates.

peer_path optional

The pathname from which the event originates.

pid required

The process ID of the Vector instance.

pod_name optional

The name of the pod from which the event originates.

uri optional

The sanitized URI from which the event originates.

events_out_total

counter

The total number of events emitted by this component.

component_id required

The Vector component ID.

component_kind required

The Vector component kind.

component_name required

Deprecated, use component_id instead. The value is the same as component_id.

component_type required

The Vector component type.

host required

The hostname of the system Vector is running on.

pid required

The process ID of the Vector instance.

file_delete_errors_total

counter

The total number of failures to delete a file.

file optional

The file that produced the error

host required

The hostname of the system Vector is running on.

pid required

The process ID of the Vector instance.

file_watch_errors_total

counter

The total number of errors encountered when watching files.

file optional

The file that produced the error

host required

The hostname of the system Vector is running on.

pid required

The process ID of the Vector instance.

files_added_total

counter

The total number of files Vector has found to watch.

file optional

The file that produced the error

host required

The hostname of the system Vector is running on.

pid required

The process ID of the Vector instance.

files_deleted_total

counter

The total number of files deleted.

file optional

The file that produced the error

host required

The hostname of the system Vector is running on.

pid required

The process ID of the Vector instance.

files_resumed_total

counter

The total number of times Vector has resumed watching a file.

file optional

The file that produced the error

host required

The hostname of the system Vector is running on.

pid required

The process ID of the Vector instance.

files_unwatched_total

counter

The total number of times Vector has stopped watching a file.

file optional

The file that produced the error

host required

The hostname of the system Vector is running on.

pid required

The process ID of the Vector instance.

fingerprint_read_errors_total

counter

The total number of times Vector failed to read a file for fingerprinting.

file optional

The file that produced the error

host required

The hostname of the system Vector is running on.

pid required

The process ID of the Vector instance.

glob_errors_total

counter

The total number of errors encountered when globbing paths.

host required

The hostname of the system Vector is running on.

path required

The path that produced the error.

pid required

The process ID of the Vector instance.

utilization

gauge

A ratio from 0 to 1 of the load on a component. A value of 0 would indicate a completely idle component that is simply waiting for input. A value of 1 would indicate a that is never idle. This value is updated every 5 seconds.

component_id required

The Vector component ID.

component_kind required

The Vector component kind.

component_name required

Deprecated, use component_id instead. The value is the same as component_id.

component_type required

The Vector component type.

host required

The hostname of the system Vector is running on.

pid required

The process ID of the Vector instance.

Examples

Apache Access Log

Given this event...

53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] "GET /disintermediate HTTP/2.0" 401 20308

...and this configuration...

[sources.my_source_id]
type = "file"
include = [ "/var/log/**/*.log" ]

---
sources:
  my_source_id:
    type: file
    include:
      - /var/log/**/*.log

{
  "sources": {
    "my_source_id": {
      "type": "file",
      "include": [
        "/var/log/**/*.log"
      ]
    }
  }
}

...this Vector event is produced:

{
 "log": {
  "file": "/var/log/apache/access.log",
  "host": "my-host.local",
  "message": "53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] \"GET /disintermediate HTTP/2.0\" 401 20308",
  "timestamp": "2020-10-10T17:07:36.452332Z"
 }
}

How it works

Autodiscovery

Vector will continually look for new files matching any of your include patterns. The frequency is controlled via the glob_minimum_cooldown option. If a new file is added that matches any of the supplied patterns, Vector will begin tailing it. Vector maintains a unique list of files and will not tail a file more than once, even if it matches multiple patterns. You can read more about how we identify files in the Identification section.

Checkpointing

Vector checkpoints the current read position after each successful read. This ensures that Vector resumes where it left off if restarted, preventing data from being read twice. The checkpoint positions are stored in the data directory which is specified via the global data_dir option, but can be overridden via the data_dir option in the file source directly.

Compressed Files

Vector will transparently detect files which have been compressed using Gzip and decompress them for reading. This detection process looks for the unique sequence of bytes in the Gzip header and does not rely on the compressed files adhering to any kind of naming convention.

One caveat with reading compressed files is that Vector is not able to efficiently seek into them. Rather than implement a potentially-expensive full scan as a seek mechanism, Vector currently will not attempt to make further reads from a file for which it has already stored a checkpoint in a previous run. For this reason, users should take care to allow Vector to fully process any compressed files before shutting the process down or moving the files to another location on disk.

Context

By default, the file source augments events with helpful context keys.

File Deletion

When a watched file is deleted, Vector will maintain its open file handle and continue reading until it reaches EOF. When a file is no longer findable in the includes option and the reader has reached EOF, that file’s reader is discarded.

File Read Order

By default, Vector attempts to allocate its read bandwidth fairly across all of the files it’s currently watching. This prevents a single very busy file from starving other independent files from being read. In certain situations, however, this can lead to interleaved reads from files that should be read one after the other.

For example, consider a service that logs to timestamped file, creating a new one at an interval and leaving the old one as-is. Under normal operation, Vector would follow writes as they happen to each file and there would be no interleaving. In an overload situation, however, Vector may pick up and begin tailing newer files before catching up to the latest writes from older files. This would cause writes from a single logical log stream to be interleaved in time and potentially slow down ingestion as a whole, since the fixed total read bandwidth is allocated across an increasing number of files.

To address this type of situation, Vector provides the oldest_first option. When set, Vector will not read from any file younger than the oldest file that it hasn’t yet caught up to. In other words, Vector will continue reading from older files as long as there is more data to read. Only once it hits the end will it then move on to read from younger files.

Whether or not to use the oldest_first flag depends on the organization of the logs you’re configuring Vector to tail. If your include option contains multiple independent logical log streams (e.g. Nginx’s access.log and error.log, or logs from multiple services), you are likely better off with the default behavior. If you’re dealing with a single logical log stream or if you value per-stream ordering over fairness across streams, consider setting the oldest_first option to true.

File Rotation

Vector supports tailing across a number of file rotation strategies. The default behavior of logrotate is simply to move the old log file and create a new one. This requires no special configuration of Vector, as it will maintain its open file handle to the rotated log until it has finished reading and it will find the newly created file normally.

A popular alternative strategy is copytruncate, in which logrotate will copy the old log file to a new location before truncating the original. Vector will also handle this well out of the box, but there are a couple configuration options that will help reduce the very small chance of missed data in some edge cases. We recommend a combination of delaycompress (if applicable) on the logrotate side and including the first rotated file in Vector’s include option. This allows Vector to find the file after rotation, read it uncompressed to identify it, and then ensure it has all of the data, including any written in a gap between Vector’s last read and the actual rotation event.

Fingerprinting

By default, Vector identifies files by running a cyclic redundancy check (CRC) on the first N lines of the file. This serves as a fingerprint that uniquely identifies the file. The number of lines, N, that are read can be set using the fingerprint.lines and fingerprint.ignored_header_bytes options.

This strategy avoids the common pitfalls associated with using device and inode names since inode names can be reused across files. This enables Vector to properly tail files across various rotation strategies.

Globbing

Globbing is supported in all provided file paths, files will be autodiscovered continually at a rate defined by the glob_minimum_cooldown option.

Line Delimiters

Each line is read until a new line delimiter (by default, i.e. the 0xA byte) or EOF is found. If needed, the default line delimiter can be overriden via the line_delimiter option.

Multiline Messages

Sometimes a single log event will appear as multiple log lines. To handle this, Vector provides a set of multiline options. These options were carefully thought through and will allow you to solve the simplest and most complex cases. Let’s look at a few examples:

Example 1: Ruy Exceptions

Ruby exceptions, when logged, consist of multiple lines:

foobar.rb:6:in `/': divided by 0 (ZeroDivisionError)
	from foobar.rb:6:in `bar'
	from foobar.rb:2:in `foo'
	from foobar.rb:9:in `<main>'

To consume these lines as a single event, use the following Vector configuration:

[sources.my_file_source]
	type = "file"
	# ...

	[sources.my_file_source.multiline]
		start_pattern = '^[^\s]'
		mode = "continue_through"
		condition_pattern = '^[\s]+from'
		timeout_ms = 1000

start_pattern, set to ^[^\s], tells Vector that new multi-line events should not start with white-space.
mode, set to continue_through, tells Vector continue aggregating lines until the condition_pattern is no longer valid (excluding the invalid line).
condition_pattern, set to ^[\s]+from, tells Vector to continue aggregating lines if they start with white-space followed by from.

Example 2: Line Continuations

Some programming languages use the backslash (\) character to signal that a line will continue on the next line:

First line\
second line\
third line

To consume these lines as a single event, use the following Vector configuration:

[sources.my_file_source]
	type = "file"
	# ...

	[sources.my_file_source.multiline]
		start_pattern = '\\$'
		mode = "continue_past"
		condition_pattern = '\\$'
		timeout_ms = 1000

start_pattern, set to \\$, tells Vector that new multi-line events start with lines that end in \.
mode, set to continue_past, tells Vector continue aggregating lines, plus one additional line, until condition_pattern is false.
condition_pattern, set to \\$, tells Vector to continue aggregating lines if they end with a \ character.

Example 3: Line Continuations

Activity logs from services such as Elasticsearch typically begin with a timestamp, followed by information on the specific activity, as in this example:

[2015-08-24 11:49:14,389][ INFO ][env                      ] [Letha] using [1] data paths, mounts [[/
(/dev/disk1)]], net usable_space [34.5gb], net total_space [118.9gb], types [hfs]

To consume these lines as a single event, use the following Vector configuration:

[sources.my_file_source]
type = "file"
# ...

[sources.my_file_source.multiline]
start_pattern = '^\[[0-9]{4}-[0-9]{2}-[0-9]{2}'
mode = "halt_before"
condition_pattern = '^\[[0-9]{4}-[0-9]{2}-[0-9]{2}'
timeout_ms = 1000

start_pattern, set to ^\[[0-9]{4}-[0-9]{2}-[0-9]{2}, tells Vector that new multi-line events start with a timestamp sequence.
mode, set to halt_before, tells Vector to continue aggregating lines as long as the condition_pattern does not match.
condition_pattern, set to ^\[[0-9]{4}-[0-9]{2}-[0-9]{2}, tells Vector to continue aggregating up until a line starts with a timestamp sequence.

File permissions

To be able to source events from the files, Vector must be able to read the files and execute their parent directories.

If you have deployed Vector as using one our distributed packages, then you will find Vector running as the vector user. You should ensure this user has read access to the desired files used as include. Strategies for this include:

Create a new unix group, make it the group owner of the target files, with read access, and add vector to that group
Use POSIX ACLs to grant access to the files to the vector user
Grant the CAP_DAC_READ_SEARCH Linux capability. This capability bypasses the file system permissions checks to allow Vector to read any file. This is not recommended as it gives Vector more permissions than it requires, but it is recommended over running Vector as root which would grant it even broader permissions. This can be granted via SystemD by creating an override file using systemctl edit vector and adding:
```
AmbientCapabilities=CAP_DAC_READ_SEARCH
CapabilityBoundingSet=CAP_DAC_READ_SEARCH
```

On Debian-based distributions, the vector user is automatically added to the adm group, if it exists, which has permissions to read /var/log.

Read Position

By default, Vector will read from the beginning of newly discovered files. You can change this behavior by setting the read_from option to "end".

Previously discovered files will be checkpointed, and the read position will resume from the last checkpoint. To disable this behavior, you can set the ignore_checkpoints option to true. This will cause Vector to disregard existing checkpoints when determining the starting read position of a file.

State

This component is stateless, meaning its behavior is consistent across each input.

Aug	SEP	Oct
	21
2020	2021	2022