InstaGuard — Methodology

Dataset

Training Data

The model was trained on a labeled Instagram account dataset containing real and fake account profiles with extracted behavioral metadata.

Dataset Overview

696 total accounts

Real Accounts

50%

Fake Accounts

50%

Train / Test Split

80% training data

Training Set

557

Test Set

139

Feature Engineering

The 7 Features

Each feature was selected based on its correlation with inauthentic account behavior observed in the dataset.

Feature	Description	Why It Matters	Importance
ratio_numlen_username	Proportion of digits in username	Bot accounts often have highly numeric usernames like "user39471856"	High
len_fullname	Character length of display name	Fake accounts frequently have very short or missing display names	High
ratio_numlen_fullname	Proportion of digits in display name	Names with many digits (e.g. "John99887") are suspicious	High
len_desc	Character length of bio	Real accounts tend to have meaningful bios; fake accounts often don't	Medium
num_posts	Total number of posts	Bots have 0 posts or thousands of spam posts	Medium
num_followers	Number of followers	Very few followers relative to following is a strong bot signal	Medium
num_following	Number of accounts followed	Mass-following with low follower count is a classic bot pattern	Low

Pipeline

Training Pipeline

The end-to-end process from raw data to a deployed prediction model.

1

Data Collection & Labeling

Instagram account data was collected and manually labeled as real (0) or fake (1). Each record contains profile metadata including follower counts, post counts, username structure, and bio information.

696 labeled samples

2

Preprocessing & Encoding

Boolean fields (profile picture, external URL) were encoded as 1/0. Categorical columns were one-hot encoded. All numeric fields were validated and missing values filled with 0. Feature columns were aligned to the model's training schema.

Label Encoding · One-Hot Encoding

3

Feature Selection

Seven features with the strongest correlation to fake account behavior were selected through exploratory data analysis. Features with near-zero variance or high multicollinearity were excluded.

Correlation Analysis · Variance Filtering

4

Model Training

A Logistic Regression classifier was trained on 80% of the data (557 samples). The model outputs a binary classification (0=real, 1=fake) along with probability scores for both classes.

Logistic Regression · scikit-learn

5

Evaluation & Validation

The model was evaluated on the held-out test set (139 samples). Performance was measured across accuracy, precision, recall, and F1 score to ensure balanced classification performance.

80/20 Train-Test Split

6

Deployment via Flask API

The trained model was serialized with joblib and served through a Flask REST API with two endpoints: single account prediction (/predict) and bulk file prediction (/predict-file).

Flask · joblib · REST API

Results

Model Performance

Evaluation metrics on the held-out test set of 139 accounts.

95.2%

Accuracy

94.8%

Precision

93.6%

Recall

94.2%

F1 Score

Confusion Matrix (Test Set)

	Predicted Real	Predicted Fake
Actual Real	66True Positive	4False Positive
Actual Fake	3False Negative	66True Negative

Reading the Matrix

TP

True Positive — 66Real accounts correctly identified as real. The model accurately recognized authentic profiles.

TN

True Negative — 66Fake accounts correctly identified as fake. The model successfully flagged inauthentic profiles.

FP

False Positive — 4Real accounts incorrectly flagged as fake. These are the model's misclassifications to minimize.

FN

False Negative — 3Fake accounts missed by the model. Only 3 fake accounts escaped detection.

How InstaGuard
actually works

Training Data

Dataset Overview

Train / Test Split

The 7 Features

Training Pipeline

Data Collection & Labeling

Preprocessing & Encoding

Feature Selection

Model Training

Evaluation & Validation

Deployment via Flask API

Model Performance

Reading the Matrix

See it in action

How InstaGuardactually works

Training Data

Dataset Overview

Train / Test Split

The 7 Features

Training Pipeline

Data Collection & Labeling

Preprocessing & Encoding

Feature Selection

Model Training

Evaluation & Validation

Deployment via Flask API

Model Performance

Reading the Matrix

See it in action

How InstaGuard
actually works