Redundant solution for extracting and converting data from various third-party APIs

Question

Redundant solution for extracting and converting data from various third-party APIs

We are creating new features for one of our financial applications. We have our own SQL Server database, and we will call several RESTful APIs that return JSON responses. E.g. some return news data, some return stock information, some return financial data, and our SQL Server database has employee data. Thus, they all have their own data format. This new application that we are creating is going to collect all this data, turn it into a meaningful display on the Internet, for example, mint.com.

The web application will display analytical reports based on this data.
It will be possible to download reports through various templates

We are fully open in terms of technology stack for our backend and mid-level. As NoSQL's first thought, as mongodb and elasticsearch for search and reporting, comes to mind. A web application (saved or retrieved from the API) will be created on top of this data, most likely in Asp.net MVC.

We need your input, especially if you have experience creating such a corporate solution.

Can you share your opinion

What is a good technical stack that you would choose for this application?
How it will scale now and in the future when changing the data format of the API.
Performance is also important as the data will be displayed in the web interface.

+7

api architecture database-design aggregation-framework middleware

Priyank Jun 20 '15 at 17:17

source share

3 answers

From my experience, mongodb is the worst choice for reporting, especially for aggregation. It does not have good aggregation functionality, it has some data type conflicts (such as decimal places, which are stored as strings that you cannot use in it, built into the api aggregation structure), and you may have to support zoom out functions in javascript for most scripts.

If your true nature of the application is just reports, and they do not need to be updated in real time, I would refuse RPC calls at the request of external APIs. I would like to consider the possibility of copying as much data as possible and save them in accordance with the scheme that is most convenient for you, and then synchronize it using planned, predicted intervals.

I would not rush to make assumptions that this data will be available all the time or always in the format that you expect. You also get the benefits of optimization when starting your own copy indexed the way you want, rather than trying to determine which of the RPCs is your bottleneck.

Regarding your questions:

1) If you don't mind using Python, I would choose Django on top of the PostgresSQL database. Django is a full-featured, robust ORM + Web framework that is great for this kind of work. If not, just stick with a relational SQL database. I have heard the wonders of Cassandra, but have not tried it yet.

2 + 3) As I mentioned earlier, replicating the data as much as possible for your own good. Once everything is "in the house", you can group it and easily configure it. Using a distributed cache against heavy client requests is also a good idea (like REDIS), rather than generating these reports every time you request it.

+3

antonzy Jun 20 '15 at 10:23

source share

I use Jasper reports and the Jasper report server to integrate into our web application. Jasper accepts many different types of data sources, including JSON and SQLServer. The basic version is free and allows you to get highly accurate html amd pdf information. The paid version with the server makes it easy to integrate into your web application. The core is Java spring (partially open source) running on tomcat / jboss, and you can interact with it using REST web services or the visualize.js library for your web interface. It uses high-performance charts that can produce great results and have options for adhoc reports and dashboards built from many reports.

See the demo here: http://www.jaspersoft.com/

It has an alleged stack of your db backends and data sources, tomcat with Java Spring, HTML / Javascript web interface.

The tool is used by many large enterprises, including Amazon scalability, so it should not be a problem.

If the data format changes, you will need to change the report. This is an xml format editable by the graphical interface with WYSIWYG.

+1

kayakpim Jun 25 '15 at 10:43

source share

Calle · Accepted Answer · 2015-06-24T13:03:14+0000

We have a similar setup for what you mention, using ASP.Net MVC with ElasticSearch (SQL server for relational data, periodically updating ES), data aggregation (XML / JSON) from several sources, although in order to improve the search and filter the results instead of reporting. However, I expect that the script you are looking at will also be suitable for ElasticSearch, depending on your specific requirements.

1) Since you are already using SQL Server (and, I believe, are familiar with this), I would suggest combining this with ElasticSearch - an additional mongodb layer seems unnecessary in terms of supporting another technology and development, so that it matches this integration, There is a very A good C # library (actually ElasticSearch.Net and NEST used together) that provides most of the ES functionality.

2) We have chosen ElasticSearch for its scalability combined with flexibility and ease of use. You may encounter a problem with displaying documents from C # classes to ElasticSearch documents. In essence, this is incredibly easy to set up, however you need to do some planning to index the data the way you want to search and retrieve it. Therefore, if you choose ES as a platform, spend some time on the document structure - dynamic mapping enabled by default, so you can pretty much throw any JSON into the document. However, for a production environment, it is best to disable this and establish one or more mappings so that they can be requested in a standard way.

3) Performance is also a key factor for us, so we looked at Lucene engines, such as Solr and ElasticSearch, when conducting research, as well as NoSQL databases. In most scenarios, ElasticSearch outperforms SQL Server by 10 or 1 or higher. Efficiency Solr vs. ElasticSearch depends on the script, benchmarks, and comparisons if you google them. An exception may be if many documents must be received in one request - ES (or Lucene) is not created for this use case, it is best to quickly find fewer results (similar to the number of results per page on page) per page. If you need 1000 documents per page / result, a NoSQL database might be the best option.

ElasticSearch gets up and running quickly - install it in the local development block and try it, you will feel whether it suits.

Redundant solution for extracting and converting data from various third-party APIs

More articles: