A deep dive into the dataset, feature engineering, model training pipeline, and performance metrics behind the fake account detection system.
The model was trained on a labeled Instagram account dataset containing real and fake account profiles with extracted behavioral metadata.
Each feature was selected based on its correlation with inauthentic account behavior observed in the dataset.
| Feature | Description | Why It Matters | Importance |
|---|---|---|---|
| ratio_numlen_username | Proportion of digits in username | Bot accounts often have highly numeric usernames like "user39471856" | High |
| len_fullname | Character length of display name | Fake accounts frequently have very short or missing display names | High |
| ratio_numlen_fullname | Proportion of digits in display name | Names with many digits (e.g. "John99887") are suspicious | High |
| len_desc | Character length of bio | Real accounts tend to have meaningful bios; fake accounts often don't | Medium |
| num_posts | Total number of posts | Bots have 0 posts or thousands of spam posts | Medium |
| num_followers | Number of followers | Very few followers relative to following is a strong bot signal | Medium |
| num_following | Number of accounts followed | Mass-following with low follower count is a classic bot pattern | Low |
The end-to-end process from raw data to a deployed prediction model.
Instagram account data was collected and manually labeled as real (0) or fake (1). Each record contains profile metadata including follower counts, post counts, username structure, and bio information.
696 labeled samplesBoolean fields (profile picture, external URL) were encoded as 1/0. Categorical columns were one-hot encoded. All numeric fields were validated and missing values filled with 0. Feature columns were aligned to the model's training schema.
Label Encoding · One-Hot EncodingSeven features with the strongest correlation to fake account behavior were selected through exploratory data analysis. Features with near-zero variance or high multicollinearity were excluded.
Correlation Analysis · Variance FilteringA Logistic Regression classifier was trained on 80% of the data (557 samples). The model outputs a binary classification (0=real, 1=fake) along with probability scores for both classes.
Logistic Regression · scikit-learnThe model was evaluated on the held-out test set (139 samples). Performance was measured across accuracy, precision, recall, and F1 score to ensure balanced classification performance.
80/20 Train-Test SplitThe trained model was serialized with joblib and served through a Flask REST API with two endpoints: single account prediction (/predict) and bulk file prediction (/predict-file).
Flask · joblib · REST APIEvaluation metrics on the held-out test set of 139 accounts.
| Predicted Real | Predicted Fake | |
|---|---|---|
| Actual Real | 66True Positive |
4False Positive |
| Actual Fake | 3False Negative |
66True Negative |
Try the live dashboard with demo presets or your own data.