IYou have horrible formatted data stored in777 .doc files where each .doc file contains a big Excel sheets of Doc documenttable, like one here and in Fig. 1.
Here, which I am thinkingonly consider one .doc file.
I want to parsedivide the Excel table of .doc file into many tall arrays such I can do data analysis on some cells systematicallyCSV files by any Unix programming language and/or scripting.
StepsI cannot find a way to handle Microsoft fileformats into CSV files.
Pseudocode:
Data in theExtract Excel tables in .doc file: extraction of excel table from .doc file,.
Split of Excel table (maybe convert here already to CSV) into 3 separate tables as .CSV files, and by Rule:
new bolding indicates a new table i.e. a new CSV file.
ApplicationApply implicit columns Location (bottom/top) and Date (dd.mm.yyyy) in the first two lines of your proposal:the .doc file on the each separate CSV file. Use Time column (morning/evening/night).
- Testing soon RubberStamp's proposal by PostgreSQL
The problem is that I get many of such Excel tables every day.
I am thinking if I can do any analysis on them without heavy manual workload.
There are many parameters which I would store at least in 3 tall table:Target files with their columns by Rule
by location: bottom and top
by time: morning, evening and night
by workers: assistants and other assistants
by general description of events: General which then again grouped by time: morning, evening and night. Examples of data expanded into normalised tables
location | general | time
bottom | Mainly peaceful. ... | morning
location | time | assistants
bottom | morning | Ilk
location | time | other assistants
bottom | morning | Sat
bottom | morning | Kat
I think the initial step can be conversion to CSV, but I am thinking if there is any tool to design about how many tables we need for the data.
- Assisstants.csv - Name, Date, Location, Time
- Other.Assistants.csv - Name, Date, Location, Time
- General.csv - Event, Date, Location, Time
Fig. 1 Example of Excel Table in WPS Spreadsheet
.doc file
Copy pasted data from tabular content:
Report, date: 11.11.2011 bottom: top: Assistants morning: Ilk
Vir evening: Adr Ris night: Sai Pir Other assistants: morning: Sat,
Kat Joh, Juh evening: Sam, Mar & Sel Kir, Kar night: Osk Sam
General: morning: Mainly peaceful.
Loudy boys. Peaceful.
2 customers home. evening: Peaceful. Peaceful atmosphere. night:
One customer special help, but mainly peaceful. Extra care for one
customer.
Exported as CSV
"Report, date: 11.11.2011 ",,
"bottom: top:",,
Assistants,,
morning:,Ilk ,Vir
evening:,Adr,Ris
night:,Sai,Pir
Other assistants:,,
morning:,"Sat, Kat","Joh, Juh"
evening: ,"Sam, Mar & Sel","Kir, Kar "
night:,Osk,Sam
General:,,
morning:,Mainly peaceful. ,Peaceful.
,,
,Loudy boys. ,2 customers home.
evening:,Peaceful. ,Peaceful atmosphere.
,,
night:,,Extra care for one customer.
,"One customer special help, but mainly peaceful. ",
,,
Expected output: tall tables of normalised data in many table(s)
OS: Linux Debian Stretch 9.1 and others
Doc fileData: here Google Drive link which contains the Excel table in the original setting.odt file here