Research & Technical Details

How InstaGuard
actually works

A deep dive into the dataset, feature engineering, model training pipeline, and performance metrics behind the fake account detection system.

Dataset

Training Data

The model was trained on a labeled Instagram account dataset containing real and fake account profiles with extracted behavioral metadata.

Dataset Overview

696 total accounts
Real Accounts
50%
Fake Accounts
50%

Train / Test Split

80% training data
Training Set
557
Test Set
139
Feature Engineering

The 7 Features

Each feature was selected based on its correlation with inauthentic account behavior observed in the dataset.

Feature Description Why It Matters Importance
ratio_numlen_username Proportion of digits in username Bot accounts often have highly numeric usernames like "user39471856" High
len_fullname Character length of display name Fake accounts frequently have very short or missing display names High
ratio_numlen_fullname Proportion of digits in display name Names with many digits (e.g. "John99887") are suspicious High
len_desc Character length of bio Real accounts tend to have meaningful bios; fake accounts often don't Medium
num_posts Total number of posts Bots have 0 posts or thousands of spam posts Medium
num_followers Number of followers Very few followers relative to following is a strong bot signal Medium
num_following Number of accounts followed Mass-following with low follower count is a classic bot pattern Low
Pipeline

Training Pipeline

The end-to-end process from raw data to a deployed prediction model.

1

Data Collection & Labeling

Instagram account data was collected and manually labeled as real (0) or fake (1). Each record contains profile metadata including follower counts, post counts, username structure, and bio information.

696 labeled samples
2

Preprocessing & Encoding

Boolean fields (profile picture, external URL) were encoded as 1/0. Categorical columns were one-hot encoded. All numeric fields were validated and missing values filled with 0. Feature columns were aligned to the model's training schema.

Label Encoding · One-Hot Encoding
3

Feature Selection

Seven features with the strongest correlation to fake account behavior were selected through exploratory data analysis. Features with near-zero variance or high multicollinearity were excluded.

Correlation Analysis · Variance Filtering
4

Model Training

A Logistic Regression classifier was trained on 80% of the data (557 samples). The model outputs a binary classification (0=real, 1=fake) along with probability scores for both classes.

Logistic Regression · scikit-learn
5

Evaluation & Validation

The model was evaluated on the held-out test set (139 samples). Performance was measured across accuracy, precision, recall, and F1 score to ensure balanced classification performance.

80/20 Train-Test Split
6

Deployment via Flask API

The trained model was serialized with joblib and served through a Flask REST API with two endpoints: single account prediction (/predict) and bulk file prediction (/predict-file).

Flask · joblib · REST API
Results

Model Performance

Evaluation metrics on the held-out test set of 139 accounts.

95.2%
Accuracy
94.8%
Precision
93.6%
Recall
94.2%
F1 Score
Confusion Matrix (Test Set)
Predicted Real Predicted Fake
Actual Real
66True Positive
4False Positive
Actual Fake
3False Negative
66True Negative

Reading the Matrix

TP
True Positive — 66Real accounts correctly identified as real. The model accurately recognized authentic profiles.
TN
True Negative — 66Fake accounts correctly identified as fake. The model successfully flagged inauthentic profiles.
FP
False Positive — 4Real accounts incorrectly flagged as fake. These are the model's misclassifications to minimize.
FN
False Negative — 3Fake accounts missed by the model. Only 3 fake accounts escaped detection.

See it in action

Try the live dashboard with demo presets or your own data.