Revisions to Parsing a large text file and then writing to separate files the output of each

deleted 4 characters in body

Source Link

edited Nov 3, 2015 at 5:42

84.4k
9
136
205

One, hopefully final, comment. It wouldn't be difficult to modify the above script to use the perl DBI module and a postgres or mysql database to store all of these records in a searchable database....with the case name as the title index field and the text in a text field.

deleted 4 characters in body

Source Link

edited Nov 3, 2015 at 5:36

cas

84.4k
9
136
205

Using the list-of-cases-volume-1.txt (with line numbers stripped and saved as cases2.txt), the output is:

Some of the input lines (4176 out of 12861 lines) were for duplicate case names, so i modified the script above to append the extra lines for that case to the existing file, with -=-=-=-=-=-=-=-= as separator.

Some of the case titles were too long to be used as a filename, so i usedUsing substr($_,0,200)Lambert* to limit the filename to the first 200 characters. Another alternative, which would result in non- human-sensible filenames would be to use the md5sum hash of the case namecases as the filename. The case name would still be in the first line of the file.examples) is:

$ mkdir -p out/
$ ./hef.pl cases2.txt

Using Lambert* cases as example:

$ ls -1 out/Lambert*
out/Lambert v Aeretree 1 Lord Raymond 223, 91 ER 1045.txt
out/Lambert v Atkins and Another 2 Campbell 272, 170 ER 1153.txt
out/Lambert v Cook 1 Lord Raymond 237, 91 ER 1055.txt
out/Lambert v Oakes 1 Lord Raymond 443, 91 ER 1194.txt
out/Lambert v Pack 1 Salkeld 127, 91 ER 120.txt
out/Lambert v Peyton [1860] 7 House of Lords Cases 423, 11 ER 169.txt

Some of the input lines (4176 out of 12861 lines) were for duplicate case names, so i modified the script above to append the extra lines for that case to the existing file, with -=-=-=-=-=-=-=-= as separator.

Some of the case titles were too long to be used as a filename, so i used substr($_,0,200) to limit the filename to the first 200 characters. Another alternative, which would result in non- human-sensible filenames would be to use the md5sum hash of the case name as the filename. The case name would still be in the first line of the file.

Using the list-of-cases-volume-1.txt (with line numbers stripped and saved as cases2.txt), the output is:

Some of the input lines (4176 out of 12861 lines) were for duplicate case names, so i modified the script above to append the extra lines for that case to the existing file, with -=-=-=-=-=-=-=-= as separator.

Some of the case titles were too long to be used as a filename, so i used substr($_,0,200) to limit the filename to the first 200 characters. Another alternative, which would result in non- human-sensible filenames would be to use the md5sum hash of the case name as the filename. The case name would still be in the first line of the file.

$ mkdir -p out/
$ hef.pl cases2.txt

Using Lambert* cases as example:

$ ls -1 out/Lambert*
out/Lambert v Aeretree 1 Lord Raymond 223, 91 ER 1045.txt
out/Lambert v Atkins and Another 2 Campbell 272, 170 ER 1153.txt
out/Lambert v Cook 1 Lord Raymond 237, 91 ER 1055.txt
out/Lambert v Oakes 1 Lord Raymond 443, 91 ER 1194.txt
out/Lambert v Pack 1 Salkeld 127, 91 ER 120.txt
out/Lambert v Peyton [1860] 7 House of Lords Cases 423, 11 ER 169.txt

Using the list-of-cases-volume-1.txt (with line numbers stripped and saved as cases2.txt), the output (Using Lambert* cases as examples) is:

$ mkdir -p out/
$ ./hef.pl cases2.txt
$ ls -1 out/Lambert*
out/Lambert v Aeretree 1 Lord Raymond 223, 91 ER 1045.txt
out/Lambert v Atkins and Another 2 Campbell 272, 170 ER 1153.txt
out/Lambert v Cook 1 Lord Raymond 237, 91 ER 1055.txt
out/Lambert v Oakes 1 Lord Raymond 443, 91 ER 1194.txt
out/Lambert v Pack 1 Salkeld 127, 91 ER 120.txt
out/Lambert v Peyton [1860] 7 House of Lords Cases 423, 11 ER 169.txt

Some of the input lines (4176 out of 12861 lines) were for duplicate case names, so i modified the script above to append the extra lines for that case to the existing file, with -=-=-=-=-=-=-=-= as separator.

Some of the case titles were too long to be used as a filename, so i used substr($_,0,200) to limit the filename to the first 200 characters. Another alternative, which would result in non- human-sensible filenames would be to use the md5sum hash of the case name as the filename. The case name would still be in the first line of the file.

added 147 characters in body

Source Link

edited Nov 3, 2015 at 5:30

cas

84.4k
9
136
205

Some of the case titles were too long to be used as a filename, so i used substr($_,0,200) to limit the filename to the first 200 characters. Another alternative, which would result in non- human-sensible filenames would be to use the md5sum hash of the case name as the filename. The case name would still be in the first line of the file.

added 937 characters in body

Source Link

edited Nov 3, 2015 at 5:23

cas

84.4k
9
136
205

Loading

fixed some typos

Source Link

edited Nov 3, 2015 at 4:46

cas

84.4k
9
136
205

Loading

made script print all text before first filename found to stdout to avoid error message.

Source Link

edited Nov 3, 2015 at 2:43

cas

84.4k
9
136
205

Loading

updated script to match `^XXX vs YYY` pattern

Source Link

edited Nov 3, 2015 at 2:38

cas

84.4k
9
136
205

Loading

updated script to match `^XXX vs YYY` pattern

Source Link

edited Nov 3, 2015 at 2:33

cas

84.4k
9
136
205

Loading

added 16 characters in body

Source Link

edited Nov 2, 2015 at 3:11

cas

84.4k
9
136
205

Loading

defaulted output to stdout

Source Link

edited Nov 2, 2015 at 3:03

cas

84.4k
9
136
205

Loading

Source Link

answered Nov 2, 2015 at 0:23

cas

84.4k
9
136
205

Loading

Stack Exchange Network

Return to Answer