0

When i read data from a big txt file block by block ,I got the error as blow:

Unfinished UTF-8 octet sequence (at offset 4096) code:

File file = File(path!);
RandomAccessFile _raf = await file.open();
_raf.setPositionSync(skip ?? 0);
var data = _raf.readSync(block);// block = 64*64 
content.value = utf8.decode(data.toList());
2
  • here github.com/dart-lang/ffi/issues/32 Commented Sep 2, 2021 at 8:31
  • What I am doing is an app for reading novels. I want to read files page by page and record the location of reading, instead of loading all the original novels at one time, so ,How can I only read the contents of the specified range Commented Sep 3, 2021 at 1:50

1 Answer 1

1

UTF*8 is variable length encoding. The error come from data not align to UTF8 boundary Alternative way is to trim data byte on left and right before call utf.decode This will lost first and last character. You may read and add more bytes to cover last character and align with utf8 boundary

bool isDataByte(int i) {
  return i & 0xc0 == 0x80;
}

Future<void> main(List<String> arguments) async {
  var _raf = await File('utf8.txt').open();
    _raf.setPositionSync(skip);
    var data = _raf.readSync(8 * 8);

    var utfData = data.toList();
    int l, r;
    for (l = 0; isDataByte(utfData[l]) && l < utfData.length; l++) {}

    for (r = utfData.length - 1; isDataByte(utfData[r]) && r > l; r--) {}
    var value = utf8.decode(utfData.sublist(l, r));
    print(value);
}

Optional read more 4 bytes and expand to cover last character


bool isDataByte(int i) {
  return i & 0xc0 == 0x80;
}

Future<void> main(List<String> arguments) async {
  var _raf = await File('utf8.txt').open();
    _raf.setPositionSync(skip);
    var block = 8 * 8;
    var data = _raf.readSync(block + 4);

    var utfData = data.toList();
    int l, r;
    for (l = 0; isDataByte(utfData[l]) && l < block; l++) {}

    for (r = block; isDataByte(utfData[r]) && r < block + 4; r++) {}

    var value = utf8.decode(utfData.sublist(l, r));
    print(value);
}
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you for your help,But what I am doing is an app for reading novels. I want to read files page by page and record the location of reading, instead of loading all the original novels at one time
I see, The problem should come from utf-8 character length vary from 1 to 4 bytes you may seek to middle of the character. If this the case, 2 ways to solve the problem 1)change encoding to UCS-16 which fix 2 byte per character 2)scan from begin of the file and index location of new line and page break. then seek to index instead of block location
I Edited answer. Hope this close to your question. to record location should record location of first byte of character by checking !isdataByte()
Very useful solution,Although there is some duplication in the data obtained. Function(isDataByte) really helps me
The purpose of for(skip) loop is for testing only I try to seek as many location to ensure working of the concept. forgot to remove :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.