Association Rule Mining

Overview

Association Rule Mining (ARM) is a popular data mining technique used to uncover interesting relationships, patterns, and associations among a set of items in large datasets. It is widely used in market basket analysis, where the goal is to find associations between different products purchased together by customers. ARM helps in identifying rules such as "If a customer buys item A, they are likely to buy item B". These rules are represented in the form of "if-then" statements, known as association rules.

Whereas clustering aims to group the values, ARM aims to associate the values.

ARM involves two main steps: finding frequent itemsets and generating association rules. Frequent itemsets are groups of items that frequently occur together in transactions. Once these itemsets are identified, association rules are generated to highlight the relationships between items. The strength of these rules is measured using metrics such as support, confidence, and lift. Support indicates how frequently an itemset appears in the dataset, confidence measures the likelihood of the consequent given the antecedent, and lift indicates the strength of the rule compared to random chance.

Since I am working with mostly numeric data, I decided that it would be interesting to examine recent press regarding climate change and the Colorado River. Using a combination of newsapi and beautiful soup, I extracted around 50 recent articles (as of October 2024) to analyze. The goal would be to uncover the overall sentiment of these topics and if there are any interesting connections amongst the articles.

Data Prep and Code

To perform ARM, we need to preprocess the data to ensure it is in the correct format. This involves cleaning the data, removing irrelevant items, and converting the data into a transactional format. Below is the code used for data preparation and ARM:

Conclusion (Non-technical)

Overall, not a whole lot of use out of this exercise, but I did notice at least one interesting pattern. There tends to be a lot of discussion about what's "going" to happen, not so much what is happening, or what "people" are doing to either help or hurt the situation of climate change. When "water" is mentioned, there seems to be more discussion around what "would" happen with that water. Some limitations: this exercise is highly limited based on the query I used to generate the transaction data. It is essentially confirming what I queried. Another limitation is the ability to accurately clean and summarize the data. While I did attempt to clean up the values, ensure only english words, remove stop words, remove fillers, and summarize them to their key points, there are still confusing or mixed results. Improvements for future iterations: Try social data. Perhaps articles are written with a degree of formality that is masking any true sentiment from citizens surrounding the Colorado River. I can also run using different queries entirely, dropping key words like "Colorado River" and instead focusing on "water" and "climate change".