Like a site like kayak.com aggregate content?

Hi, I was busy with the idea of ​​a new project and was wondering if anyone knows how a service like Kayak.com is able to quickly and accurately collect data from so many sources. In particular, do you think Kayak.com interacts with the API, or do they scan / clean the websites of airlines and hotels to fulfill user requests? I know that there is no such right answer for this kind of thing, but I am curious to know that, according to others, it would be a good way to deal with it. If that helps, pretend you're going to create kayak.com tomorrow ... where does your data come from?

+73
api architecture aggregate screen-scraping
Jan 05 '11 at 17:27
source share
7 answers

I work in the travel industry as an architect / software project on exactly the same project that you describe - in our region we work directly with suppliers, but for outgoing we connect to several aggregators.

To answer your question ... some data you have, some you get in different ways, and some you have to torture and twist, until it is recognized.

What is your angle?

The questions you need to ask are ... Do you want to sell ads like Kayak, or do you take an abbreviation like Expedia? Are you looking for or selling travel services? Do you focus on a niche (for example, only on air transportation) or everything (accommodation, airlines, car rental, additional services such as transport / excursions / conferences, etc.)? Are you targeting a region (USA or part of the USA) or the world? How deep do you go - do you just display multiple sites on one screen or combine different services and dynamically pack them?

Data retrieval

If you are going with the Kayak business model, you technically do not need permission to the site ... but many sites have affiliate programs with IFrames or other simple ways to direct customers to their site. On the plus side, you don’t have to deal with payments / complaints and travelers themselves. As for the minuses ... if you want to compare prices yourself and present the cheapest option for the user, you will have to integrate at a deeper level, which means API and web scraping.

As for web scraping ... avoid this. This sucks. Indeed. Just don't do it. Believe me in that. For example, some things like lowcosters that you cannot get without web scraping. Low cost airlines live on additional services. If the user does not see their site, they do not sell unnecessary things, and they earn nothing. Therefore, they do not have branches, they do not offer an API, and they change their site layout almost constantly. However, there are companies that make a living by web scrambling low-level sites and complete them in nice APIs. If you can afford it, you can give your users a comparison of the costs of low-cost flights and huge ones.

On the other hand, there are "normal" media that offer an API. This is not such a big problem to get into the airline, since they are all integrated under IATA ; basically, you buy from IATA, and IATA distributes the money to carriers. However, you probably do not want to connect directly to the network of telecom operators. These days they have web services and SOAP, but believe me, when I say that there are SOAP protocols that are just insanely thin wrappers around a text prompt through which you can interact with the mainframe with the 80s protocol (think about Unix tell me where you are counting for a team, and it takes about 20 teams for one search). That's why you probably want to connect to someone else a bit along the product chain, with a better API.

Airlines, therefore, are at both extremes of the Gaussian curve; on the one hand, there are individual suppliers, and on the other, centralized systems in which you implement one API, and you can fly anywhere in the world. Accommodation and the rest of the tourist products are located between them. There are several major players who combine hotels and a ton of small suppliers with a multitude of aggregators that cover only part of the spectrum. For example, you can rent a lighthouse, and it’s not even so expensive, but you can’t compare the prices of different lighthouses in one place.

If you work in the Kayak business model, you are likely to end up scraping websites. If you want to integrate different vendors, you'll often work with APIs, some of which are pretty good, and most of them are acceptable. I have not worked with RSS, but there are not many differences between RSS and web scraping. There is also a fourth option not mentioned in Jeff's answer ... one where you get your data at night, for example. CSV files via FTP, etc.

Life sucks (mini flatulence)

And then the complexity. The more values ​​you want to add, the more complexity you will have to handle. Can you find housing that allows pets? For a hostel located less than 5 km from the city center? Are you planning to fly, and can you guarantee that the traveler will have enough time to get from one airport to another ... Can you sell transport in advance? The famous cellist does not want to part with his precious 18th-century cello; can you sell him another place for a cello (yes, without doing it)?

Want to compare prices? Of course, the room is 30 euros per night. But you can get one double for 30 and one single for 20, or you can get one extra bed twice and get a 70% discount for a third person. But only if it is a child under the age of 12; Our extra beds are not suitable for adults. And you will not get the price for an extra bed in the search results - only when calculating the final price.

And don’t even make me work with dynamic packaging. Want to sell your home + rent a car? No problems; integrate with two different providers, and you leave ... manually updating the list of places in the city (from the car lessor) in accordance with the hotels (from the provider, which gives you only the city for each hotel). Of course, provided that you have already compared the list of cities with two, as there is no international standard for city codes.

Unlike many other industries that have many products, the tourism industry has many very complex products. Amazon has it easy; selling books and selling potatoes is the same; You can even send them in one box. They are easily combined and are not assembled from many parts. :)

PS Link to an interesting recent thread in Hacker News with some insider information about flights . PPS Recently I came across a large, albeit rather old, blogpost on the IATA NDC protocol with an overview of how the travel industry is connected, and a history lesson on how it became .

+126
Jan 6 2018-11-11T00:
source share

They use a software package such as ITA Software , which is one of the companies that Google is in the process of collecting.

+9
Jan 6 '11 at 3:23
source share

There are only 3 ways that I know to receive data from websites.

RSS feeds. We often use rss feeds in my company to integrate existing site data with our applications. It's fast, and most sites already have access to the RSS feed. The problem is that not all sites correctly implement the RSS standard, so if you are extracting data from many RSS feeds on many sites, then make sure that you write your code so that you can easily add exceptions and filters.

An API is good if they are well designed and have all the necessary information, but this is not always the case, plus if the sites do not use the standard api format, you will have to support several APIs.

Web scraper - this method will be the most unreliable, as well as the most expensive to maintain. But if you are not left with anything else, you can do it.

+6
Jan 05 2018-11-11T00:
source share

This article states that Kayak was asked to stop canceling a specific airline page. This makes me think that they are probably scraping on sites with which they have no relationship (and the data channel that comes with this relationship).

+3
Apr 13 2018-12-12T00:
source share

Travelport offers a product called the "Universal API", which connects with flights and hotels and car rental companies, and also deals with transaction packages and all the difficulties associated with taxes and exchange rates:

https://developer.travelport.com/app/developer-network/resource-centre-uapi

I just started using it, and so far it seems perfect. Requests are a bit slow, but then every request on every OTA website (online travel agent).

+2
Nov 03 '13 at 9:02
source share

There are two good APIs I recently found on flight comparison sites.

There is one from Wego , and one from Skyscanner . Both seem to have a wide range and breadth of data from a number of airlines and good documentation.

Wego pays every time a user clicks on your application on the booking website, and Skyscanner pays a membership fee of 50% of the revenue (I assume this means the commission they make with airlines)

+1
May 27 '14 at 9:23
source share

This is an old post, but I thought I would just add it. I am a data architect who works for a company that feeds these travel sites with content. This company enters into contracts with many hotel brands, individual hotels and other content providers. We collect this information, then transmit it to different channels. Then they reunite in their system. Large GDS systems are also content providers. Aggregation is performed by many methods ... matching algorithms (within the company) and keys. As an aggregation service, we need to communicate at the client level.

Hope this helps! Hurrah!

-2
Aug 16 '16 at 13:10
source share



All Articles