Generate Excel Validation Table (CSV) and import data

How do I get around creating a MYSQL table schema that validates an Excel (or CSV) file. Are there any Python libraries available for this task?

Column headers will be deactivated for column names. The data type will be evaluated based on the contents of the table column. When this is done, the data will be loaded into the table.

I have an excel file of ~ 200 columns that I want to start normalizing.

+6
python mysql excel csv import-from-excel
source share
5 answers

For (my) reference only, I documented below what I did:

  • XLRD is practical, however I just saved the Excel data as CSV, so I can use LOAD DATA INFILE
  • I copied the title bar and started writing import and normalization script
  • Script does: CREATE TABLE with all columns as TEXT except Primary Key
  • query mysql: LOAD DATA LOCAL INFILE Load all CSV data into TEXT fields.
  • Based on the output of PROCEDURE ANALYSE I was able to ALTER TABLE to give the columns the correct types and lengths. PROCEDURE ANALYSE returns ENUM for any column with several separate values, which is not what I need, but I found that it is useful later for normalization. The eyeballs of the 200 columns were breezes with PROCEDURE ANALYSE . The output from PhpMyAdmin suggests a table structure - it is inactive.
  • I wrote some normalization, mainly using SELECT DISTINCT in columns and INSERT results to split tables. First, I added a column for FK to the old table. Immediately after INSERT , I have its identifier and the UPDATE edited column FK. When the loop is finished, I discarded the old column, leaving only the FK column. Similarly with multiple dependent columns. It was much faster than I expected.
  • I ran (django) python manage.py inspctdb , copied the output to models.py and added all those ForeignkeyField since FKs do not exist in MyISAM. Wrote some python views.py, urls.py, several templates ... TADA
+1
source share

Use the xlrd module; start here . [Disclaimer: I am the author]. xlrd classifies cells into text, number, date, boolean, error, empty and empty. It distinguishes dates from numbers by checking the format associated with the cell (for example, "dd / mm / yyyy" compared to "0.00").

The task of programming some code to input user input to determine which type of DB data to use for each column is not something that can be easily automated. You should be able to view data and assign types such as integer, money, text, date, datetime, time, etc. And write code to test your guesses. Please note that you need to be able to cope with things like numeric data or date data entered in text fields (you can see OK in the graphical interface). You need a strategy to handle cells that do not match the “estimated” data type. You need to check and clear the data. Make sure you normalize the text lines (split the line / end of the space, replace several spaces with one space. Excel text (BMP only) is Unicode, not bash it is in ASCII or "ANSI" - it works in Unicode and encode in UTF-8, to put it in your database.

+3
source share

A quick and dirty workaround with phpmyadmin:

  • Create a table with the desired number of columns. Make sure the data matches the columns.
  • Import the CSV into the table.
  • Use the sentence table structure.
+1
source share

As far as I know, there is no tool that could automate this process (I would like someone to prove that I am wrong, because I had this exact problem before). When I did this, I came up with two options:
(1) Manually create columns in db with the appropriate types, then import or
(2) Write some kind of filter that could “figure out” what data the columns should use. I went with the first option, mainly because I did not think that I could write a program for type inference.
If you decide to write a tool or input type conversion, here are a few problems you will have to deal with:
(1) Excel dates are actually stored as the number of days from December 31, 1899; how do you conclude that a column is dates and not some partial data (e.g. population)?
(2) For text fields, do you simply create columns of type varchar (n), where n is the longest entry in this column, or do you make it an unlimited char field if one of the entries is longer than some upper bound? If so, what is a good upper limit?
(3) How do you automatically convert float to decimal with the correct precision and do not lose space?
Obviously, this does not mean that you cannot (I'm a pretty bad programmer). I hope you do this because it would be a really useful tool.

+1
source share

Pandas can return the circuit:

 pandas.read_csv('data.csv').dtypes 

Literature:

0
source share

All Articles