Multi-Class Text Classification

collinmutembei,Tue Apr 18 2023•artificial intelligence text classification

Suppose you have a dataset of financial transactions, and each transaction has a text notification associated with it. The text notification may contain information such as the name of the merchant, the amount of the transaction, the date and time of the transaction, and the type of transaction (e.g., purchase, withdrawal, deposit). For example, suppose you have the following transactions:

Text Notification	Category
Carrefour ...	Shopping
Java ...	Dining
Uber ...	Travel
ATM Withdrawal ...	Banking

Your task is to build a model that can automatically classify new transactions into one of these categories based on their text notifications.

To solve this problem, we will use scikit-learn, a popular machine learning library in Python. We will use the following steps:

Load the dataset
Preprocess the data
Extract features from the data
Train a multi-class classification model
Evaluate the model
Predict the category of new transactions

Let's get started!

Step 1: Load the dataset

First, we need to load the dataset. We will use a CSV file containing a sample of financial transactions with their text notifications and categories. You can download the dataset here.

import pandas as pd
 
# Load the dataset
df = pd.read_csv('transactions.csv')
 
# Display the first few rows of the dataset
print(df.head())

Step 2: Preprocess the data

Next, we need to preprocess the data to prepare it for feature extraction. We will perform the following steps:

Convert the text to lowercase
Remove any punctuation and special characters
Tokenize the text into words
Remove stop words (common words such as "the" and "and")
Stem or lemmatize the words (reduce them to their base form)

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
 
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
 
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    words = text.split()
    words = [word for word in words if word not in stop_words]
    words = [stemmer.stem(word) for word in words]
    return words
 
data['Text Notification'] = data['Text Notification'].apply(preprocess)

Step 3: Extract features from the data

To train a machine learning model for text classification, we need to extract features from the preprocessed text. The Bag-of-Words model is a popular method for feature extraction in text classification. It represents each document as a vector of word counts, where each element in the vector corresponds to a word in the vocabulary.

We can use Scikit-learn's CountVectorizer to convert the preprocessed text into a matrix of word counts:

from sklearn.feature_extraction.text import CountVectorizer
 
# Extract features from the data
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Text Notification'].apply(lambda x: ' '.join(x)))
y = df['Category']

Step 4: Train a multi-class classification model

We will use the Multinomial Naive Bayes algorithm to train a multi-class classification model. Naive Bayes is a probabilistic algorithm that calculates the probability of a text notification belonging to each category and selects the category with the highest probability.

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
 
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# Train a multi-class classification model
model = MultinomialNB()
model.fit(X_train, y_train)

Step 5: Evaluate the model

We will evaluate the performance of the model using the accuracy score, which measures the proportion of correctly classified transactions.

from sklearn.metrics import accuracy_score
 
# Evaluate the performance of the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

The output should look something like this:

Accuracy: 0.80

Step 6: Predict the category of new transactions

Finally, we can use the trained model to predict the category of new transactions based on their text notifications.

# Predict the category of new transactions
new_data = pd.DataFrame({
    'Text Notification': [
        'ATM WITHDRAWAL 1234',
        'QUICK MART PURCHASE #1234',
        'TRANSFER FROM JOHN',
    ]
})
 
new_data['Text Notification'] = new_data['Text Notification'].apply(preprocess)
X_new = vectorizer.transform(new_data['Text Notification'].apply(lambda x: ' '.join(x)))
 
y_new = model.predict(X_new)
print(y_new)

Bonus Step: Saving the model for reuse

In this example, we're saving the trained model to a file named text_classification_model.joblib. We can then load the model from the file using the joblib.load() function, and use it to make predictions on new data. Note that we're using the joblib module instead of the built-in pickle module because it's optimized for storing large NumPy arrays, which are often used in machine learning models.

from sklearn.externals import joblib
 
# ...training code here...
 
# Save the model
joblib.dump(model, 'text_classification_model.joblib')
 
# Load the model
model = joblib.load('text_classification_model.joblib')
 
# Predict using the loaded model
y_new = model.predict(X_new)
print(y_new)

Conclusion

In this blog post, we've explored how to perform multi-class text classification using Scikit-learn with an example dataset of financial transactions. We've learned how to preprocess the text, extract features using the Bag-of-Words model, train a machine learning model using the Multinomial Naive Bayes algorithm, and evaluate the performance using the accuracy score. We've also seen how to use the trained model to predict the category of new transactions based on their text notifications.

Text classification is a powerful tool for automating the categorization of text data and can be used in a variety of applications, from sentiment analysis to content moderation. By using Scikit-learn and the techniques outlined in this blog post, you can easily perform multi-class text classification on your own datasets.