An End-to-End ML Destination Similarity Project using Flyte

💡 The development of machine learning projects is complex and full of challenges, from the scope and problem definitions all the way to model governance and the user interface

October 1, 2022

Demetrios Brinkmann

💡 The development of machine learning projects is complex and full of challenges, from the scope and problem definitions all the way to model governance and the user interface. In this blog post, we detail how we built an end-to-end recommendation system from scratch consuming open source data, orchestrating the entire pipeline of a machine learning model, and delivering it in a web interface.

Who doesn’t like to travel? It’s the best way to discover new places, cultures, and people. Here at Hurb — one of Brazil’s largest online travel agencies — our mission is to optimize travel through technology. Furthermore, given our country’s continental dimension, we see tremendous undiscovered tourist potential.

One of the pain points for travelers is defining their next destination. With plenty of places to go, there’s nothing worse than choosing the wrong destination, right? In this sense, we have combined our passion for travel and data to develop a product that helps our clients explore upcoming travel destinations. This prototype was conceived and built on a hackathon promoted by the MLOps Community in collaboration with Flyte, a new machine learning pipeline orchestration platform, of which we were the happy winners!

In this article, we’re going to talk about our project, all the way from the data extraction to the machine learning model (a Natural Language Processing transformer), resulting in a system that can compute the similarities between places and recommend similar destinations to a specified one.

Let’s go deeper

When we talk about recommendation system, the most common solution is the collaborative-filtering approach, but that was not possible for us for three main reasons:

As a hackathon project, we couldn’t use internal data from the company.

Traveling is very seasonal (people usually travel for leisure once or twice a year), making it almost impossible to extract patterns without an exorbitant amount of data.

We wanted to be able to recommend atypical places, even those Hurb does not operate yet.

So, given the above constraints, how can we develop a robust recommendation system? Our solution was to retrieve public data about cities, and use this data to build a vector representation for each one; then, we can recommend travel destinations for a customer based on a given city they like. This solution solves all of our limitations, but how does it work?

Dataset extraction

We wanted to collect data on as many cities as possible to recommend unusual and unexplored places. However, since we had time constraints in the context of being a hackathon, we decided to limit our scope to Brazilian cities.

To work with somewhat complete and structured data, we sought to work with known open-source databases. Amongst the possibilities, the Wikimedia ecosystem proved to be the best option. The crowdsourced nature of the databases results in a rich and (mostly) reliable source of information about the cities. Furthermore, all Wikimedia components have APIs that help to retrieve the data without the need for a web scraper.

We extracted the public database of Brazilian cities from Wikidata’s database using the Wikidata Query Service, like so:

request = requests.get(
    "https://query.wikidata.org/sparql",
    params={
        "query": WIKIDATA_QUERY,  # Query that retrieves city name
        "format": "json",         # and Wikipedia/Wikivoyage URLs
    },
    allow_redirects=True,
    stream=True,
)

This request is then parsed and organized so that it can be used as our base data. With it, we were able to retrieve the articles for each city from Wikipedia and Wikivoyage using their REST APIs. We realized that Portuguese Wikipedia had more information about each city than the English version, so we decided to use it to populate our dataset. From it, we extracted the summary and the “History”, “Geography”, and “Climate” sections from each page. On the other hand, the English versions of Wikivoyage pages were more complete than the Portuguese ones, so we used those instead; from them, we extracted the summary, the “See” section, and the “Do” section.

Since both APIs were exactly the same, we developed a class called WikiExtractor that received the desired Wikimedia page and its language:

extractor = WikiExtractor(wiki="wikivoyage", lang="en")

Then, we defined a method responsible for retrieving and formatting the data:

page_content = extractor.extract_content(
   "Rio de Janeiro", summary=True, sections=["See", "Do"]
)

This method sends a request to the /api/rest_v1/page/mobile-sections/<PAGE> endpoint, where PAGE is the name of the page (in the above example, “Rio de Janeiro”). The response is a JSON with all of the text data from the page, which is parsed by the method.

Preprocessing

As we decided to suggest cities based on the similarity amongst them, and we will calculate them from textual data, a preprocessing step was necessary to remove unnecessary “noise”, standardize the texts and prepare them so that we can capture only the most relevant information. This process included:

Translate the English texts from Wikivoyage to Portuguese, this was accomplished using the Google Translate API for Python;

Lowercase the text;

Remove unnecessary characters, such as punctuation, numbers, single characters, excessive whitespace, API messages, HTML characters

Transliterate the text into ASCII;

Remove Portuguese stopwords using the Natural Language Toolkit library.

Furthermore, after analyzing and experimenting with the dataset, we ended up using only the cities that had data on the Wikivoyage page, which was about 440 Brazilian cities. This decision was made so that we could work with a “complete” dataset, and it is something we can improve in the future.

Modeling

As we only have texts about cities and we need to measure the similarity amongst them, we chose to use vector representation learning of texts.

We used a pre-trained state-of-the-art language model based on transformers (BERTimbau, which was trained in Portuguese) to generate vector representations for the features of each city. We use the output of the last layer of the language model as the embedding of each feature. The last layer takes advantage of all the parameters learned in the language model training, that is, obtaining a good extraction of the semantics of the sentence. Although the original BERT article suggests using the last 4 hidden layers, for simplicity we chose to follow only the last one as feature extraction. Then, the final vector representation for a given city is defined as the mean of the city feature embeddings.

Finally, the similarity between the vector representations of each city can be computed using the Euclidean distance between an input vector query (the benchmark city) and all the other vectors of the cities available in our dataset.

Deployment

The user interface was built using Streamlit; you can access it here. Below you can see an example of how the results are presented: you specify a city and a state, and how many recommendations you want; then, the system returns a list of recommendations, ordered by most similar.

The deployment of the project consists of two parts that can be separated into different stages. The first and most important one is the deployment of workflows in Flyte. The second part consists of publishing the app, which uses Streamlit to expose the search functionality to the user.

Flyte is an open-source orchestration platform that helps with the machine learning lifecycle. With it you define reproducible and strongly typed units of work — called tasks — and create flows by coupling tasks together, resulting in workflows. You can then schedule your workflows to run periodically. Flyte is a Kubernetes-native tool, and so each task runs in a Docker container, which guarantees isolation and reproducibility.

Using an orchestration tool is very important to machine learning projects since data can change over time and your model performance will probably decay. We can prevent it or at least monitor it by launching workflows that can automate this for you, making it easier to maintain ML models in production.

You can define Flyte tasks and workflows using pure Python, and that’s exactly what we did. Our system consisted of two workflows. The first one, called generate_dataset, is responsible for extracting data from the Wikimedia pages. The other one, called build_knowledge_base, is responsible for the preprocessing and the generation of the embeddings (the modeling step).

The overall project structure is represented in the image below. Each dashed block represents a Flyte workflow, while the gray blocks are Flyte tasks. As previously stated, the app lives in a different infrastructure and consumes the DataFrames generated by both of the workflows.

Our takings from Flyte were:

Pros:

The cache system: Wikimedia pages don’t frequently change, so when launching the workflow, if everything is still the same, we won’t spend resources on it, and if something did change, then the dataset will be updated.

Ability to parallelize tasks: As seen above, extracting data from Wikipedia can be done at the same time as extracting data from Wikivoyage, since they are independent. Flyte automatically identifies and executes concurrent tasks, making the workflow more efficient.

Trivial to request more resources: Training this complex model required the use of GPU and we found it very easy to request more resources to the cluster since it required just input parameters in the task decorator.

Cons:

At the time of this writing, the documentation for the platform is still a bit confusing, but it is improving over time.

The deployment process can be tiresome for more complex projects, but it can be automated with scripts (such as the one given to us during the hackathon). Alternatively, you can use UnionML, which is built on top of Flyte and is more user-friendly.

Given that the tool is still relatively new, there are some bugs here and there, but they have a great community and a Slack channel with people that can assist you. You can also report any bug you find as an Issue in the official GitHub repository of the platform.

Conclusions and future works

During the study and development of the project, we managed to implement a model capable of answering the initial question raised: Given a city a user likes, which one should they visit?

During this time, we had the opportunity to try out Flyte, using it to implement the entire project in an end-to-end fashion. We used the platform’s strengths — like the scheduling of workflows and automatic concurrency — to deploy almost the entire system inside a Kubernetes cluster. And, using the remote API, we were able to retrieve the model and use it to build a UI with Streamlit.

However, due to time constraints, we did not implement any strategy to evaluate the model results. Furthermore, we have not developed techniques to further enrich the data that feeds our model. Thus, as future work for the project, we should look for ways to enhance our training database and develop methodologies to evaluate the developed model. One way to evaluate the model would be to ask for customers’ feedback on the recommendations. Those could also be used to improve the model.

If you want to know more, you can check out the project’s repository on Flyte’s GitHub here.

Acknowledgments

We would like to thank Hurb for supporting and influencing our participation in the Hackathon as training and recognition of the team’s potential. Hurb is a travel tech company from Brazil, we are looking forward to connecting with machine learning passionate people, so feel free to reach us! Currently, we are hiring in Brazil and soon in other countries. You can check out more about our work at Hurb on our Medium page. Furthermore, Flyte’s team also helped us throughout the entirety of the hackathon, answering questions and providing all the needed resources.

Authors Bio: Sérgio Barreto, Renata Gotler,and Matheus Moreno are Machine Learning Engineers at Hurb; they are currently working on machine learning projects that affect different areas of the company: demand generation, operations, and MLOps, respectively. Meanwhile, Patrick Braz is a Data Engineer at Hurb, where he works with the rest of his team on ingestion pipelines, data quality, and data governance, which it’s his main focus.

An End-to-End ML Destination Similarity Project using Flyte

💡 The development of machine learning projects is complex and full of challenges, from the scope and problem definitions all the way to model governance and the user interface

Popular

Related