Apartment hunting

18 Sep 2024

Table of content

The repository can be found on GitHub. The repository contains the application code and Terraform to deploy it to AWS.

Introduction

I was apartment hunting in a new area and didn’t really know what a good deal looked like. Normally, I would check apartment listings for a couple of months to get a feel for the market. But this time, I wanted to gather some data to make things easier in the future.

I noticed that some listings were taken down really fast, while others—despite being labeled as “new”—were just relisted. I thought it would be useful to track how long each listing stayed on the website. With that information, I could figure out which properties were selling quickly in different areas and spot the ones that kept getting relisted.

This post covers the app I built and deployed to AWS to help me with apartment hunting. Here’s a quick rundown of how it works:

The app fetches property listing data from a property listing website
The data is saved to a DynamoDB table
It periodically checks whether the listings are still active. If a listing is gone, the entry gets updated
The frontend allows users to interact with removed property listings

Sceeenshot of the webpage

Application

Let’s go through the different parts of the application.

Fetching

First, the data needs to be fetched from the website. Scraping websites can be a heavyweight task. Sometimes websites are so JavaScript-heavy that they simply cannot be used without JavaScript. In those cases, something like Selenium might have to be used. Selenium uses an actual browser to execute the JavaScript to make sense of the website. Thankfully, I was able to find the API endpoint by inspecting the requests made by my browser.

Since we want to get new data from the API, we need to pull it periodically. In my use case, I found pulling data every 30 minutes works perfectly. Since this task only requires a couple of seconds of compute every 30 minutes, it is a perfect use case for Lambda, which we can schedule with EventBridge.

Checking

After a while, we need to verify if the data we have gathered is still active on the website. Each listing has its own webpage. One way to check whether the listing still exists is to open the page. There is nothing wrong with that, except when you want to check multiple pages. That would not only be slow but also waste your and the website owner’s bandwidth. Instead of requesting the whole page, we should request only the headers. While the listing exists, we receive HTTP status code 200, and once the listing has been removed, we get status code 410.

This job should not be run very often—perhaps once a day. Again, a good use case for Lambda. We could run this job with EventBridge as well. However, we should remember that a Lambda function can run for a maximum of 15 minutes, which might not be enough time to check all listings. Another way is to divide the listings into batches and then check each batch one by one. I decided to use a fanout model: have one Lambda function divide the data into batches and send those batches to an SQS queue. The checking function can then be run every time there is a message in the queue. Depending on the data and batch size, that could mean multiple Lambda functions checking the listing availability at the same time. I wanted to limit that to only two functions running at the same time since I don’t want to stress the website. My new AWS account had a limit of only 10 concurrent Lambda functions, which meant that I could not limit my function concurrency before my limit was increased. I opened a ticket for AWS to increase the limit. A couple of days later, my limit was increased to 100, and I was able to limit the checking function concurrency to two.

Storage

I chose to use DynamoDB as the storage for listing data. This was my first time using DynamoDB. I was surprised by how much time I ended up investing in table design.

First, I had to list all the access patterns. Our goal is to show removed apartment listings to the user. After some thought, I decided I wanted the data to be searchable via three attributes: city, listing type (selling or renting), and the date range when the listing was removed. I chose the following partition and sort keys:

Partition key: type#active
- Possible values:
  - type: SELL or RENT
  - active: 0 = removed and 1 = active listing
- Example: SELL#0
Sort key: city#date published#id
- Example: Helsinki#2024-02-11#17865000

This design allows us to search listings by type, city and date range. The ID was added to the sort key to ensure all entries are unique. It is also used by the checking function.

Another access pattern is to get all active listings for the checking function. That is also possible with this design via two queries (one for renting and one for selling). However, the checking function only needs type, city, and ID. We should avoid scanning all data, as that would be wasteful. Instead, I chose to create a Global Secondary Index (GSI) that would only store the partition and sort keys. That way, around seven times less data is returned, making queries cheaper.

Backend

For the backend, I chose to use Lambda and API Gateway. The backend simply fetches data from DynamoDB and returns it in JSON format. In addition to returning JSON data, the Lambda function also adds an Expires header to the response. With that header, we can control how CloudFront caches the response. Since the checking function runs only once a day, we can cache the result for up to 24 hours. The Expires header value is calculated by taking the next check start time and adding one hour to it. If the Lambda function returns an error, then the result is cached for only one minute.

The API Gateway is accessed via CloudFront. This gives us a simple URL for the API endpoint (/search) and caching.

Frontend

The frontend is a HTML/CSS/JS application stored in S3 and served via CloudFront. JavaScript is used to fetch data from the API and then display it using the Tabulator javascript library. Tabulator was chosen because it is easy to use and supports column filters. While the data is fetched using only type, city and date, with column filters the user can easily apply more specific filters to the data. You could, for example, search all sold apartments in Helsinki and then filter those down to apartments with sizes of 35–65 m² and build years newer than 2000. Since the data is already fetched, the filters are instantly applied.

Logging and monitoring

All Lambda functions have some error handling code that logs to CloudWatch. I wanted to be notified if any function runs into problems. I ended up creating two Lambda functions: log_sqs and log_sns. log_sqs will be run every time an error is found in CloudWatch logs. The function will read the error, format it, and then add it to an SQS queue. I first had this function directly send the message to SNS, but that was not a good idea. My application broke, and I ended up receiving way too many emails. Now these messages are sent to an SQS queue, where they are read by the log_sns function. log_sns is run every X minutes to check if there are any errors. If errors are found, then all messages are combined into one email. Now monitoring is not real-time, but I only receive a maximum of one email every X minutes. This is not a critical application, so that is more than fine.

Example email alert message:

Function: <application_name>-<lambda_function>
LogGroup Name: /aws/lambda/<application_name>-<lambda_function>
LogStream: <year>/<month>/<day>/[$LATEST]<log stream hash>
Log Message(s):
[ERROR] <error message>
#########################################
Function: <application_name>-<lambda_function>
LogGroup Name: /aws/lambda/<application_name>-<lambda_function>
LogStream: <year>/<month>/<day>/[$LATEST]<log stream hash>
Log Message(s):
[ERROR] <error message number 2>

Terraform

This was my first time deploying a real application with Terraform. I wanted to use the least amount of dependencies. I chose to create my own modules for every resource created more than once.

I chose to create one Terraform file for every AWS service, except for Lambda, where I created a separate file for every function named as lambda_<function name>.tf. I think this makes it quite readable. I didn’t quite know how to name resources, so I ended up having resources with filenames as prefixes; for example, lambda_api_iam_policy_doc. I’m sure that is not the best way, but at least I can say what it is and where to find it by just looking at the name.

I chose to use one repository for the application and Terraform. Terraform has its own directory. In addition to that, there are directories for Lambda and frontend code. The Lambda directory has a subdirectory for every Lambda function that contains the Python code and a possible requirements file. The subdirectory and function file are named as the function name.

I tried to avoid hardcoding values as much as possible and instead utilizing variables. This makes the infra more customizable but also allows you to run multiple versions easily. You might, for example, want to run separate dev, QA and prod instances.

Here is a diagram of the infrastructure deployed to AWS with Terraform:

Diagram of the infra

Conclusion

I ran the application for a couple of months and quickly realized that most of the user-submitted data on apartment listing website was neither validated nor normalized. Users would regularly misspell districts and street addresses, and the room configurations were a complete wild west: some used abbreviations, others didn’t, and some did something in between.

These issues with misspellings and inconsistent room configurations could be addressed in code. Given that we received the zipcode from the API, we could use it to guess the correct district and street address when the submitted ones didn’t yield any results. Most of the misspellings were minor typos. Handling the room configurations could be managed with regex—and a few cups of coffee. However, I didn’t end up implementing these changes—not yet, at least.