How to split Excel table into CSV files in .doc by Bold text?

Question

You have 777 .doc files where each .doc file contains a big Excel table, like one here and in Fig. 1. Here, only consider one .doc file. I want to divide the Excel table of .doc file into CSV files by any Unix programming language and/or scripting. I cannot find a way to handle Microsoft fileformats into CSV files. Pseudocode:

Extract Excel table from .doc file, which is expanded in the thread How to extract many .doc text + tabular elements into CSV by any Unix tool?
Split Excel table (maybe convert here already to CSV) into separate .CSV files by Rule:

new bolding indicates a new table i.e. a new CSV file.
Apply implicit columns Location (bottom/top) and Date (dd.mm.yyyy) in the first two lines of the .doc file on the each separate CSV file. Use Time column (morning/evening/night).

Target files with their columns by Rule

Assisstants.csv - Name, Date, Location, Time
Other.Assistants.csv - Name, Date, Location, Time
General.csv - Event, Date, Location, Time

Fig. 1 Example of Excel Table in .doc file

OS: Linux Debian Stretch 9 and others
Data: .odt file here

I'm not sure this is Linux/UNIX specific, it's more a problem of data parsing and a broad range of non-platform specific tools might work? Stack Overflow? — EightBitTony
– EightBitTony, Commented Oct 23, 2017 at 12:00
Maybe state that in the question then. Otherwise I've got an Excel macro which would probably do it in seconds (I don't, but you get my point). — EightBitTony
– EightBitTony, Commented Oct 23, 2017 at 12:06
My go-to solution for this type of problem is to create a set of tables in a database. I use postgresql, some other people may recommend others. However, if your question is: How can I process an Excel Spreadsheet File via commandline tools in GNU/Linux? ... and since you've tagged Debian, I can point you to the package libspreadsheet-parseexcel-simple-perl — RubberStamp
– RubberStamp, Commented Oct 23, 2017 at 12:34
It would be extremely helpful if you were to suggest what you want the output to look like. — glenn jackman
– glenn jackman, Commented Oct 23, 2017 at 14:13
I can give a few hints for creating databases and tables in postgresql, as well as a few select statements which will perhaps yield something useful. Unfortunately, I don't work with Excel files... I ask for or immediately export any Excel data into CSV and usually perform a lot of manual manipulation and parsing of the CSV to get the desired columns into the desired tables. Once there is proper CSV, then importing into psql is very easy via the \copy ... WITH CSV command. I'm sure one could write a program to automatically parse an Excel file, but that would be very solution specific. — RubberStamp
– RubberStamp, Commented Oct 23, 2017 at 14:25

RubberStamp · Accepted Answer · 2017-10-23 19:20:21Z

OK...

Begin Mini Tutorial

So, here are some hints on generating a postgresql database to import your daily reports.

First, install postgresql if you haven't yet:

$sudo apt-get install postgresql

Second, if you are not familiar with postgresql, the default installation of postgresql in Debian is setup to allow each user to login through peer authentication with no password. However, you've got to create a database that is owned by the user.

Here is how to do that:

Drop into a privileged shell

$ sudo -s
Become the postgres superuser

# su postgres
Create a database for the user to play in

postgres$ createdb dbname -O user
Then exit twice to get back to userland.

postgres$ exit

# exit

$
You should be ready to begin using postgresql

I've generated an SQL file that can be imported to make the tables. You can copy and paste the following into something like tables.sql

CREATE TYPE shifts AS ENUM ('morning','evening','night');
CREATE TYPE titles AS ENUM ('assistant','other_assistant');

CREATE TABLE assistants (id integer, name char(20), title titles);
CREATE TABLE disposition (id integer, name char(20), shift shifts, day date, comments text);
CREATE TABLE schedule (id integer, name1 char(2), name2 char(20), name3 char(20), name4 char(20), name5 char(20), shift shifts, day date);

And then import the tables:

psql
user=>\i tables.sql

If you parse your daily report into three separate CSV files, each file can be imported directly into each individual table using the \copy command.

Something like this:

\copy assistants FROM '~/assistants.csv' WITH (FORMAT csv);
\copy dispositions FROM '~/dispositions.csv' WITH (FORMAT csv);
\copy schedule FROM '~/schedule.csv' WITH (FORMAT csv);

This would fill in your tables with data and allow you to perform queries like finding out who made comments today and what those were...

Something like this:

 select * from disposition where day = 'TODAY';

Might produce the following output:

 id |         name         |  shift  |    day     | comments 
----+----------------------+---------+------------+----------
    | Vir                  | morning | 2017-10-23 | Peaceful

End Mini Tutorial

Is any of this helpful? Or am I thinking too deeply or just confusing you?

It is great! I am thinking the initila step. I have the Excel tables in .doc file. So I a am thinking 1) extraction of excel tables from doc files, 2) split of Excel table into 3 separate tables and 3) your proposal. - - What do you think about (1-2)? I have many such doc files containing Excel tables. — Léo Léopold Hertz 준영
– Léo Léopold Hertz 준영, Commented Oct 24, 2017 at 13:30
@LéoLéopoldHertz준영 ... If you're willing to wait a few days or so (maybe/probably longer), I can try to figure out how to work directly with the Excel files. I know this was the original question, but I'll have to mull over the libraries... I also found a C library that may work as well. This is better for me, as C is my 'native' language so to speak... My table definitions above are not very 'tidy', and I don't recommend using them for production work. However, it's good to see you liked what I presented so far. Thanks! — RubberStamp
– RubberStamp, Commented Oct 24, 2017 at 16:57
Yes, I am ready to wait. I can C too, so please, use C if you want. Any language is welcome because I really would like to do have some open-source solution for the basic problem in Linux. i think there are still some gaps in the logic about how to solve the case so any language is welcome. - - Please, note i added there a basic .doc file containing the excel tables so possibly easing our co-operation. - - I also really really love PostgreSQL! — Léo Léopold Hertz 준영
– Léo Léopold Hertz 준영, Commented Oct 24, 2017 at 16:59

Stack Exchange Network

How to split Excel table into CSV files in .doc by Bold text?

1 Answer 1

Begin Mini Tutorial

End Mini Tutorial

You must log in to answer this question.

Hot Network Questions

How to split Excel table into CSV files in .doc by Bold text?

1 Answer 1

Begin Mini Tutorial

End Mini Tutorial

You must log in to answer this question.

Related

Hot Network Questions