What is the strategy to transfer data from a table to an RDBMS?

This is due to another issue when switching from a table to an RDBMS .

Having decided to move to the DBMS from an Excel workbook, here's what I propose to do.

Existing data is loosely structured on two sheets in a workbook. The first sheet contains the master record. The second sheet allows you to get additional data.

My target DBMS is mysql, but I am open to suggestions.

  • Define an RDBMS schema
  • Define, say, web services for interacting with the database so that they can be used for both the interface and the interface.
  • Define script migration in
    • Read each group of child rows from a spreadsheet
    • Apply Validation / Limitations
    • Writing to an RDBMS Using a Web Service
  • Define macros / functions / modules in a spreadsheet to provide validation where possible. This will allow you to use the existing system while a new one appears. At the same time (I hope) this will reduce migration failures when this is ultimately done.

What strategy would you follow?

+4
database rdbms spreadsheet
source share
4 answers

There are two aspects to this issue.

Data transfer

The first step is to β€œDefine an RDBMS schema,” but how far are you going to work with it? Tables, as a rule, are not standardized and therefore have a lot of duplication. You say in your other question that "data is freely structured and there are no obvious restrictions." If you want to turn this into a strictly defined scheme (at least 3NF), you will have to do some cleanup. SQL is the best tool for data processing.

I suggest you create two staging tables, one for each sheet. Define the columns as freely as possible (large rows basically), so it’s easy to load spreadsheet data. After loading the data into the intermediate tables, you can run queries to evaluate the quality of the data:

  • how many duplicate primary keys?
  • how many different data formats?
  • What are the search codes?
  • all rows in the second sheet have parent records in the first?
  • How compatible are code formats, data types, etc.
  • etc.

These studies will give you a good foundation for writing SQL, with which you can fill in your actual schema.

Or it may be that the data is so hopeless that you decide to stick to only two tables. I think this is an unlikely result (most applications have some basic structure, we just need to calculate deep enough).

Data loading

It is best to export spreadsheets to CSV format. Excel has a wizard for this. Use it (instead of doing Save As... ). If your spreadsheets contain any free text, you will have sentences containing commas, so make sure you choose a really safe delimiter, like ^^~

Most RDBMS tools have the ability to import data from CSV files. Postgresql and Mysql are obvious options for NGOs (I think cost is a consideration), but both SQL Server and Oracle are shipped in free (if limited) Express editions. SQL Server obviously has better integration with Excel. Oracle has a great feature called external tables, which allows us to define a table in which data is stored in a CSV file, eliminating the need for intermediate tables.

Another thing to keep in mind is the Google App Engine. This uses a Big Table rather than RDBMS, but it might be more suitable for your poorly structured data. I suggest this because you mentioned Google Docs as an alternative solution. GAE is an attractive option because it is free (more or less, they begin to charge if use exceeds some very generous thresholds) and this will solve the problem of sharing applications with these other NGOs. Obviously, your organization may have some doubts that Google is posting its data. It depends on the field in which they work, and on the sensitivity of the information.

+1
source share

Obviously, you need to create the target database and the necessary table structure. I would skip the web services and write a groovy script that reads .xls (using the POI library), validates and stores the data in the database.

In my opinion, more and more active participation (web services, GUI ...) is not justified: these tasks are very suitable for scripts, because they are compressed and extremely flexible, while performance, code scalability and such issues are here not so important. Once you have something that works, you can adapt the script to any future document with various data anomalies that you encounter in minutes or hours.

All this assumes that your data is not in perfect order, and they need to be filtered and / or cleaned.

Alternatively, if the data and validation rules are not too complicated, you can get good results with a visual data transfer tool like Kettle : you simply define .xls as the source, the database table as the table, some validation / filtering rules if necessary, and start the boot process. Pretty painless.

+1
source share

Perhaps you are doing more work than you need. Excel tables can be saved as CVS or XML files, and many RDBMS clients support importing these files directly into tables.

This may allow you to skip the creation of web service covers and migration scripts. During any import, your database restrictions will be properly respected. However, if your RDBMS model or data schema is very different from your Excel spreadsheets, then some translations, of course, must be done through scripts or XSLT.

0
source share

If you prefer to use your own tool, check out SeekWell , which allows you to write to your database from Google Sheets. After you define your schema, select the tables in the sheet, then edit or paste the records and mark them for the corresponding action (for example, update, paste, etc.). Set a schedule for updates, and you're done. Read more about this here . Disclaimer - I am a co-founder.

Hope this helps!

0
source share

All Articles