Early High Level AI Observations for Travel Startups

Larger travel companies have and will continue to dominate the initial wave of large language model AI benefits. They have very large data sets full of transaction, touch points, preferences, search requests, and usage to train their AI models.

Where does that leave travel startups? I was listening to a podcast with the former head of Facebook AI and here are my takeaways:

Leveraging Proprietary Datasets Is Critical (in general and for VC funding).

I posit the true winners will be travel companies training ChatGPT (or other LLM’s) on their own proprietary set of data. There’s rapidly diminishing value in leveraging universally available ChatGPT API requests (eg: give me the top 10 restaurants in Milan). Effective prompts, with effective results based on proprietary datasets is the minimum a travel startup requires (for me at least) to quality for VC funding. Travel companies, including travel startups, will need proprietary, open source and commercially available datasets to train upon. Comically, I’ve had a few pitches so far that, putting my former CTO and travel technology developer hat on, employed a cursory use of the universally available ChatGPT API data trying to qualify for VC funding.

Startups can train on high quality smaller datasets.

This small dataset requirement pleasantly surprised me and challenged my assumption that 500k+ (and potentially many millions) of records were required to fuel effective AI training. My takeaway from the former Facebook AI interview was a smaller yet pristine set of training data trumped mediocre high quantity records which is a good early indicator for travel startups. The more quality data the better, that’s a given and there’s a second set of ‘testing data’ required to gauge performance and accuracy. To share a high-level ballpark dataset record quantity number, the interviewee mentioned a figure of ~70,000 (high quality) records to effectively start training your AI.

Unique, reliable, comprehensive and ‘clean’ travel direct data required.

I’ve been advocating ‘consumer direct’ relationships (and resulting consumer direct data) for over a decade. I realize this direct relationship isn’t always the case and certainly does not disqualify you from compiling a dataset. Data has always been a need whether it was for business intelligence, recommendation engines and now for more sophisticated AI training and testing data. I realize how difficult this is from a commercial and a technical point of view; I’ve personally experienced firsthand integrating online and offline purchases when I had my online OTA (in a former life). On a side note, although not foolproof, I remember deploying smart ways to link offline and online travelers by leveraging offering unique inbound telephone numbers, leveraging caller ID’s, deploying cookies, ‘ask for quote number ____ when calling’ and others ‘scrappy startup hacks’ to capture as 360 degree a view of the traveler. It’s considerably easier for travel startups these days to start capturing relevant data even if you can’t embark on AI training; you will someday!

Types of Data

The type of data needed depends on the use case, the complexity of models to be trained, the training method used, and the diversity of input data required. Raw data is gathered from multiple sources, including IoT devices, social media platforms, websites, and customer feedback (via g2.com). For now just collect the data (in a compliant and transparent manner of course). You can even apply AI to unstructured data so fear not if everything isn’t tagged and labeled. Bringing this to travel, at the very least, ensure there’s a universal view of the customer, ensure all traveler touchpoints are captured, ensure data is structured where possible, ensure data integrity and consistency across formats (online, app, telephone data streams), and ensure direct and 3rd party data is captured and centralized. For example, does the data you have on a given traveler accurately reflect that traveler? Are you saving business travel itineraries as personal vacation travel itineraries to corrupt future training data? Are you linking traveler search data to traveler bookings to the maximum extent possible? Are you maximizing the indirect sources of data like the date of birth from a travelers booking confirmation (where applicable)? These are just some of the questions you’ll need to ask to start saving future AI training data. If you don’t know whether to save or not save it, just save it.

I’ll Keep Everyone Posted

I am diving deeper into ways travel startups (and broadly travel companies) can effectively leverage large language models (ChatGPT, Bard, others) to gain a competitive advantage, identifying low hanging highly impactful potential benefits for travelers, travel companies and travel startups. Of course, I’m diving into the countless entirely new future VC fundable business models this technology spawns. It’s overwhelming in the most exciting of ways.

It’s early days but I have absolutely no doubt this technology has the potential to revolutionize every aspect of the travel industry, to permeate every aspect of the traveler journey and to benefit every travel product (air, hotel, car, cruise, travel tech, meta search…) to say the very least. I sound like a cliche but in this case I stand behind such a hyperbolic sounding statement.

I can’t wait to start investing in travel AI startups and to reap the benefits of travel AI as a traveler.


Don’t forget to signup for future predictions and observations.