D ata is one of the world’s newest and most precious resources. Most data gathered by companies is held privately and rarely shared with the public. This data can include a person’s browsing habits, financial information, or passwords. In the case of companies focused on dating such as Tinder or Hinge, this data contains a user’s personal information that they voluntary disclosed for their dating profiles. Because of this simple fact, this information is kept private and made inaccessible to the public.
However, what if we wanted to create a project that uses this specific data? If we wanted to create a new dating application that uses machine learning and artificial intelligence, we would need a large amount of data that belongs to these companies. But these companies understandably keep their user’s data private and away from the public. So how would we accomplish such a task?
Well, based on the lack of user information in dating profiles, we would need to generate fake user information for dating profiles. We need this forged data in order to attempt to use machine learning for our dating application. Now the origin of the idea for this application can be read about in the previous article:
The previous article dealt with the layout or format of our potential dating app. We would use a machine learning algorithm called K-Means Clustering to cluster each dating profile based on their answers or choices for several categories. Also, we do take into account what they mention in their bio as another factor that plays a part in the clustering the profiles. The theory behind this format is that people, in general, are more compatible with others who share their same beliefs (politics, religion) and interests (sports, movies, etc.).
With the dating app idea in mind, we can begin gathering or forging our fake profile data to feed into our machine learning algorithm. If something like this has been created before, then at least we would have learned a little something about Natural Language Processing (NLP) and unsupervised learning in K-Means Clustering.
The first thing we would need to do is to find a way to create a fake bio for each user profile. There is no feasible way to write thousands of fake bios in a reasonable amount of time. In order to construct these fake bios, we will need to rely on a third party website that will generate fake bios for us. There are numerous websites out there that will generate fake profiles for us. However, we won’t be showing the website of our choice due to the fact that we will be implementing web-scraping techniques.
We will be using BeautifulSoup to navigate the fake bio generator website in order to scrape multiple different bios generated and store them into a Pandas DataFrame. This will allow us to be able to refresh the page multiple times in order to generate the necessary amount of fake bios for our dating profiles.
The first thing we do is import all the necessary libraries for us to run our web-scraper. We will be explaining the exceptional library packages for BeautifulSoup to run properly such as:
The next part of the code involves scraping the webpage for the user bios. The first thing we create is a list of numbers ranging from 0.8 to 1.8. These numbers represent the number of seconds we will be waiting to refresh the page between requests. The next thing we create is an empty list to store all the bios we will be scraping from the page.
Next, we create a loop that will refresh the page 1000 times in order to generate the number of bios we want (which is around 5000 different bios). The loop is wrapped around by tqdm in order to create a loading or progress bar to show us how much time is left to finish scraping the site.
In the loop, we use requests to access the webpage and retrieve its content. The try statement is used because sometimes refreshing the webpage with requests returns nothing and would cause the code to fail. In those cases, we will just simply pass to the next loop. Inside the try statement is where we actually fetch the bios and add them to the empty list we previously instantiated. After gathering the bios in the current page, we use time.sleep(random.choice(seq)) to determine how long to wait until we start the next loop. This is done so that our refreshes are randomized based on randomly selected time interval from our list of numbers.
Once we have all the bios needed from the site, we will convert the list of the bios into a Pandas DataFrame.
In order to complete our fake dating profiles, we will need to fill in the other categories of religion, politics, movies, tv shows, etc. This next part is very simple as it does not require us to web-scrape anything. Essentially, we will be generating a list of random numbers to apply to each category.
The first thing we do is establish the categories for our dating profiles. These categories are then stored into a list then converted into another Pandas DataFrame. Next we will iterate through each new column we created and use numpy to generate a random number ranging from 0 to 9 for each row. The number of rows is determined by the amount of bios we were able to retrieve in the previous DataFrame.
Once we have the random numbers for each category, we can join the Bio DataFrame and the category DataFrame together to complete the data for our fake dating profiles. Finally, we can export our final DataFrame as a .pkl file for later use.
Now that we have all the data for our fake dating profiles, we can begin exploring the dataset we just created. Using NLP (Natural Language Processing), we will be able to take an in depth look at the bios for each dating profile. After some exploration of the data we can actually begin modeling using K-Mean Clustering to match each profile with each other. Lookout for the next article which will deal with using NLP to explore the bios and perhaps K-Means Clustering as well.