What is Exploratory Factor Analysis (EFA)?

EFA is a data-driven statistical technique used to identify hidden patterns (factors) in a set of variables, such as survey questions. Instead of analyzing each variable separately, EFA finds relationships among them and groups similar ones together. This is useful in research where the underlying structure of data is unknown or needs to be explored.

🧩 Breaking It Down with an Example

Imagine you are conducting a survey on student well-being, and you ask students 10 different questions related to their daily life. The students rate their agreement on a scale from 1 (Strongly Disagree) to 5 (Strongly Agree).

Some of these questions might be about:

Mental Health (e.g., happiness, stress, anxiety)
Social Support (e.g., friendships, family support)
Physical Health (e.g., sleep, exercise, nutrition)

At first, all these questions seem separate. But when you analyze the data, you realize some questions are strongly correlated with each other. This is where EFA helps:

🔄 How EFA Works in This Case

EFA groups related questions together into factors, such as:

1️⃣ Mental Well-Being Factor:

“I feel generally happy with my life.”
“I find it easy to manage academic stress.”
“I feel anxious about meeting deadlines.” (Reverse-scored)

2️⃣ Physical Health Factor:

“I get enough sleep regularly.”
“I eat a balanced and nutritious diet.”
“I feel physically active and healthy.”

3️⃣ Social Support Factor:

“I feel supported by my friends and family.”
“I feel connected to my peers at college.”

🚀 Why is This Useful?

Instead of dealing with 10 separate survey questions, we now have 3 meaningful factors that summarize the student’s well-being.
This makes it easier to interpret results and use the data for further research, like predicting academic performance or mental health trends.
If certain factors (like mental well-being) score low, universities can develop targeted interventions to improve student support.

📏 Understanding Latent Variables in Factor Analysis

Latent variables are one of the most important concepts in Exploratory Factor Analysis (EFA) and psychometric research. They allow researchers to measure abstract concepts that cannot be observed directly but can be inferred using related measurable indicators.

Alright, let’s break this down super simply with an everyday example.

What is a Latent Variable? 🤔

A latent variable is something we cannot directly see or measure, but we know it exists because it influences things we can measure.

Think of happiness. You can’t directly measure happiness like you can measure height or weight. But you can ask questions like:
✔ Do you smile often?
✔ Do you enjoy spending time with friends?
✔ Do you wake up feeling excited about the day?

If someone says “Yes” to all these questions, we can assume they are probably happy—even though we never measured happiness directly! That means happiness is a latent variable, and the survey questions are observable indicators of it.

Latent Variables in Research 🧐

In research, we often study things we cannot measure directly (like self-confidence, stress, or motivation). Instead, we create surveys or tests to measure behaviors that reflect these hidden traits.

Example:

If we want to measure self-confidence, we might ask:
- Do you feel comfortable speaking in front of a crowd?
- Do you believe you can achieve your goals?
- Do you often doubt yourself? (Reverse scored)

The answers help us estimate a person’s self-confidence level without actually measuring confidence itself.

Why Do We Use Latent Variables?

✔ They help us understand abstract things (like intelligence or personality).
✔ They allow us to group related ideas (like different signs of happiness).
✔ They improve research accuracy by reducing error from a single question.

Super Simple Analogy: A Cake 🎂

You can’t see sugar in a cake after it’s baked, but you know it’s there because the cake is sweet.
The sweetness is like a latent variable—you can’t measure it directly, but you can tell it’s there by tasting the cake.

In the same way, you can’t directly see self-confidence, but you know it’s there based on a person’s behavior and responses!

Would you like more examples? 😃

🎯 How Does Factor Analysis Help?

Factor analysis identifies hidden structures in data by grouping highly correlated variables under a single factor (latent variable).

For example, if we conduct Exploratory Factor Analysis (EFA) on the self-confidence survey, we might find:

Survey Question	Factor 1 (Self-Confidence)
I believe I can overcome obstacles.	0.82
I feel confident in my ability to succeed.	0.89
I feel good about myself.	0.85

Since all three questions load highly on Factor 1, we conclude that this factor represents Self-Confidence.

🎭 Why Not Just Use a Single Question?

Better Measurement: A single question may not fully capture the complexity of self-confidence.
Reduces Error: Some students might misinterpret a single question, but using multiple indicators minimizes this risk.
Improves Reliability: When multiple items consistently measure the same concept, the results are more stable.

📊 Examples of Latent Variables in Research

Latent variables are widely used in social sciences, business research, and psychology. Here are some examples:

Field	Latent Variable	Measured Using…
Psychology	Depression	Symptoms (sleep, mood, appetite changes)
Education	Motivation to Learn	Questions on study habits, goal-setting, interest in subjects
Marketing	Brand Loyalty	Repeat purchase behavior, willingness to recommend
Human Resources	Job Satisfaction	Questions about work environment, salary, and growth opportunities

💡 Key Takeaway:
Latent variables allow us to scientifically measure abstract human traits using observable data.

📌 Latent Variables vs. Observed Variables

To summarize the difference:

Type of Variable	Definition	Example
Observed Variable	A variable we can directly measure	Weight, age, income
Latent Variable	A hidden concept that must be inferred from other data	Intelligence, anxiety, happiness

👉 Factor analysis bridges the gap between observed data and latent constructs!

🔥 Why Are Latent Variables Important?

They allow us to measure abstract concepts.
- Without latent variables, we couldn’t scientifically study things like happiness, confidence, or stress.
They make survey research more reliable.
- Instead of using one question, factor analysis finds patterns across multiple related questions.
They improve accuracy in statistical modeling.
- Latent variables filter out measurement errors, improving prediction models.

🔎 How to Identify Latent Variables Using Stata

Latent variables are hidden constructs that cannot be directly measured (like self-confidence, job satisfaction, or motivation). Instead, they are estimated using observable variables (like survey questions). In Stata, we can use Exploratory Factor Analysis (EFA) to identify latent variables by examining patterns in the data.

This guide will take you through the step-by-step process of identifying latent variables in Stata using EFA.

🛠 Step 1: Preparing the Data for Exploratory Factor Analysis (EFA)

Before conducting Exploratory Factor Analysis (EFA), we need to ensure that our dataset is structured correctly. This includes choosing appropriate survey questions, defining response scales, and understanding how latent variables might emerge from observed variables.

🔎 What Makes a Good Dataset for EFA?

A dataset suitable for EFA should have:
✅ Observed Variables (Survey Questions): These are measurable items that respondents answer.
✅ Potential Latent Variables: These are the hidden factors that might be driving the observed responses.
✅ Adequate Sample Size: At least 5–10 respondents per variable is recommended to get reliable factor extraction.
✅ Likert Scale or Continuous Data: EFA works best with ordinal (Likert-scale) or continuous variables rather than categorical ones.
✅ Correlations Among Variables: If variables aren’t related, factor analysis won’t be meaningful.

📂 Example Dataset: College Student Well-Being

Let’s assume we are conducting a survey on college student well-being. The dataset consists of 10 questions, where students respond on a 1-5 Likert scale (1 = Strongly Disagree, 5 = Strongly Agree).

The goal is to use EFA to determine how these questions cluster into different underlying well-being factors.

📊 Survey Questions and Possible Latent Variables

Variable	Survey Question	Possible Latent Factor
Q1	I feel happy with my life.	🧠 Mental Health
Q2	I feel supported by friends and family.	👥 Social Support
Q3	I find it easy to manage academic stress.	🧠 Mental Health
Q4	I get enough sleep regularly.	💪 Physical Health
Q5	I eat a balanced and nutritious diet.	💪 Physical Health
Q6	I feel connected to my peers at college.	👥 Social Support
Q7	I have enough time for hobbies and relaxation.	🧠 Mental Health
Q8	I feel anxious about meeting deadlines. *(Reverse-scored)*	🧠 Mental Health
Q9	I feel physically active and healthy.	💪 Physical Health
Q10	I feel confident about my academic performance.	🧠 Mental Health

💡 Objective: Use Exploratory Factor Analysis (EFA) in Stata to identify if these survey questions group into meaningful latent variables like Mental Health, Physical Health, and Social Support.

🛠 Step 1: Preparing the Data for Exploratory Factor Analysis (EFA)

🔎 What Makes a Good Dataset for EFA?

📂 Example Dataset: College Student Well-Being

The goal is to use EFA to determine how these questions cluster into different underlying well-being factors.

📊 Survey Questions and Possible Latent Variables

Variable	Survey Question	Possible Latent Factor
Q1	I feel happy with my life.	🧠 Mental Health
Q2	I feel supported by friends and family.	👥 Social Support
Q3	I find it easy to manage academic stress.	🧠 Mental Health
Q4	I get enough sleep regularly.	💪 Physical Health
Q5	I eat a balanced and nutritious diet.	💪 Physical Health
Q6	I feel connected to my peers at college.	👥 Social Support
Q7	I have enough time for hobbies and relaxation.	🧠 Mental Health
Q8	I feel anxious about meeting deadlines. *(Reverse-scored)*	🧠 Mental Health
Q9	I feel physically active and healthy.	💪 Physical Health
Q10	I feel confident about my academic performance.	🧠 Mental Health

📌 Understanding Response Scaling

Since we are measuring subjective experiences, we use a Likert Scale (1 to 5):

Scale Value	Meaning
1	Strongly Disagree
2	Disagree
3	Neutral
4	Agree
5	Strongly Agree

Likert-scale data is treated as continuous in EFA because it measures the degree of agreement.
If responses are categorical (Yes/No), EFA is not appropriate and we would need a different method like Categorical Principal Component Analysis (CATPCA).

Why Do We Reverse-Score Some Questions? 🤔

When designing surveys, some questions are negatively worded—meaning that a high score represents something negative rather than positive. But in factor analysis, we often want all high scores to mean the same thing (e.g., more well-being, more confidence, more happiness).

Let’s break it down with an example:

Imagine we have two questions on a mental health survey:

1️⃣ “I feel happy with my life.” (Positive statement)

1 = Strongly Disagree → Low happiness
5 = Strongly Agree → High happiness ✅

2️⃣ “I feel anxious about meeting deadlines.” (Negative statement)

1 = Strongly Disagree → Low anxiety (which is good!)
5 = Strongly Agree → High anxiety (which is bad!) ❌

The Problem 🤯

In the first question, higher scores mean better well-being.
In the second question, higher scores mean worse well-being.
This creates confusion because the numbers don’t have the same meaning.

The Solution: Reverse-Scoring 🔄

To make sure all scores point in the same direction, we reverse-score negatively worded items. This means:

A high score (5) on anxiety becomes a low score (1) (because high anxiety is bad).
A low score (1) on anxiety becomes a high score (5) (because low anxiety is good).

How to Reverse-Score in Stata 📊

We use the formula:

gen Q8_rev = 6 – Q8

Here’s how it works:

If a student answered 5 (high anxiety), the new score becomes 1 (low well-being).
If they answered 4, the new score becomes 2.
If they answered 3, it stays 3.
If they answered 2, it becomes 4.
If they answered 1, it becomes 5 (good well-being).

Final Result ✅

Now, all scores point in the same direction, making it easier to interpret results in factor analysis. Instead of mixing positive and negative meanings, all high values now consistently indicate good well-being.

Super Simple Analogy 🎭

Imagine a car’s speedometer:

One car shows fast speeds with high numbers (100 mph).
Another car shows fast speeds with low numbers (10 mph, reversed scale).

To compare them, we’d need to reverse the second car’s scale so that high numbers always mean high speed. This is exactly what we do with survey questions when we reverse-score! 🚗💨

🔎 Handling Missing Data Before Running Exploratory Factor Analysis (EFA) in Stata

Before running Exploratory Factor Analysis (EFA), we must ensure that our data meets key assumptions. One of the most important assumptions is that our dataset should not have too many missing values, as missing data can skew results and affect the accuracy of factor extraction.

✅ 1. Why is Missing Data a Problem?

Missing values in a dataset can lead to biased results in factor analysis. If too many responses are missing:

Stata may exclude cases, reducing the sample size.
Factor loadings may not reflect the true relationships between variables.
The pattern of missingness may introduce systematic bias in the study.

Example Scenario: Missing Data in a Survey

Imagine you conducted a student well-being survey with 10 questions, and some students forgot to answer one or two questions.

Student	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9	Q10
A	4	5	3	5	5	2	3	3	3	5
B	4	3	(missing)	2	4	2	4	5	1	4
C	2	5	4	1	(missing)	3	3	2	4	4

Here, Student B did not answer Q3, and Student C did not answer Q5.

If too many people skip a question, Stata may drop the variable entirely during analysis.
If too many students have missing responses, Stata may drop those cases and reduce the sample size.

🛠 2. Checking for Missing Data in Stata

Before running EFA, we need to check if there are any missing values in the dataset.

🔹 (A) Use misstable summarize to Count Missing Values

misstable summarize

This command tells us:

How many values are missing for each variable.
The percentage of missing values in each column.

🔹 (B) Use misstable patterns to Identify Missing Data Patterns

misstable patterns

🚦 3. How Much Missing Data is Too Much?

Once we check for missing values, the next step is to decide whether to remove or impute the missing values.

General Rules for Handling Missing Data:

% of Missing Data	What to Do?
Less than 5%	✅ Safe to impute missing values.
5% to 10%	⚠️ Consider imputation or dropping cases if needed.
More than 10%	❌ Too much missing data—reconsider using this variable in analysis.

🛠 4. Handling Missing Data in Stata

There are multiple strategies to handle missing data, depending on how much is missing and why.

🔹 (A) Mean Imputation (Replace with the Average)

If a variable has less than 5% missing data, we can replace missing values with the average response.

replace Q3 = r(mean) if missing(Q3)

💡 What This Does:

If a student didn’t answer Q3, their missing value is replaced with the average of all other students’ responses to Q3.
This keeps the sample size intact without creating major bias.

🔹 (B) Median Imputation (Replace with the Middle Value)

If the data is not normally distributed, using the median instead of the mean might be better.

stata

CopyEdit

egen Q3_median = median(Q3)

replace Q3 = Q3_median if missing(Q3)

💡 Why Use This?

If data is skewed (e.g., salary, income), the median is a better estimate than the mean.

🔹 (C) Regression-Based Imputation

For more advanced cases, we can use regression imputation, where missing values are predicted based on other available data.

mi impute regress Q3 Q4 Q5, add(1)

💡 What This Does:

Stata predicts the missing values in Q3 based on Q4 and Q5.
This is useful when the missing data depends on other related variables.

🔹 (D) Dropping Observations with Too Many Missing Values

If a respondent didn’t answer multiple questions, we might need to remove them.

stata

CopyEdit

drop if missing(Q1) | missing(Q2) | missing(Q3)

💡 When to Use This?

If a student skipped half the survey, it’s better to drop them instead of guessing too many answers.

🚀 5. Final Checklist Before Running EFA

Before proceeding with factor analysis, ensure:

✅ Missing values are handled (either imputed or removed).
✅ No more than 5-10% of responses are missing for any variable.
✅ You checked for patterns of missingness (if certain groups skipped certain questions).
✅ Variables are still relevant after cleaning (no highly incomplete variables).

1️⃣ Checking for Outliers and Normality in Stata (Expanded Guide)

Before running factor analysis, we must ensure that our variables meet key assumptions, including normality and absence of extreme outliers. If the data is not normally distributed, it could distort factor loadings and affect the interpretation of latent variables. Here’s a step-by-step expanded guide on how to check for outliers and normality in Stata.

📌 Why Do We Check for Normality & Outliers?

Factor Analysis Assumptions

Factor analysis assumes multivariate normality (especially if you plan to use Maximum Likelihood Estimation).
Severe outliers can distort results, leading to misleading factor structures.
If normality is violated, we may need transformations (log, square root, etc.).

🔍 Step 1: Check Summary Statistics for Normality

To get a detailed summary of each variable, run:

summarize S1_IC1 S1_PD1 S1_UA1, detail

📊 What to Look for in the Output

Statistic	Meaning
Mean & Median	If very different, the data might be skewed.
Minimum & Maximum	Look for unusually large or small values (potential outliers).
Skewness	Measures asymmetry (should be close to 0 for normality).
Kurtosis	Measures peakedness (should be around 3 for normality).

📌 Rule of Thumb:

Skewness > ±1 → Data is not normally distributed (too skewed).
Kurtosis > 3 → Data has heavy tails (many outliers).

👉 Example Interpretation:

If Skewness = 2.1, the data is positively skewed (long right tail).
If Kurtosis = 5.0, the distribution has too many extreme values (heavy tails).
If Mean ≠ Median, the data is likely skewed.

1️⃣ Check for Outliers and Normality

Before running factor analysis, you should verify whether your variables are normally distributed or contain extreme values (outliers).

(A) Check Summary Statistics for Skewness & Kurtosis

summarize S1_IC1 S1_PD1 S1_UA1, detail

If the mean and median are very different, your data might be skewed.
If Skewness > 1 or Kurtosis > 3, your data is non-normal.

(B) Generate Histograms to Visualize Distributions

histogram S1_IC1, normal

histogram S1_PD1, normal

histogram S1_UA1, normal

If the histogram looks symmetrical, data is normal.
If the histogram is skewed, consider log transformation for normality.

(C) Check for Outliers Using Boxplots

graph box S1_IC1 S1_PD1 S1_UA1

If extreme values appear as dots outside the boxplot, you may have outliers.
Outliers can distort factor analysis, so consider winsorizing or transforming the data.

Checking Correlations Between Variables in Stata (Expanded Guide)

Before running Bartlett’s Test, it’s important to check if your variables are correlated. If your variables aren’t correlated, Bartlett’s Test will likely fail, meaning factor analysis may not be appropriate.

🔍 Why Do We Check Correlations?

Factor analysis is only useful if your variables share a common underlying pattern. If variables are not related, they should not be grouped together.

Strong correlation (above 0.3) → Indicates that variables may belong to the same latent factor.
Weak correlation (close to 0 or negative) → Suggests variables are unrelated, and factor analysis may not work well.

📌 How to Check Correlations in Stata

To generate a correlation matrix, run:

corr S1_MF1 S1_MF2 S1_MF3

Example Output of a Correlation Matrix

After running the command, Stata will return something like this:

	S1_MF1	S1_MF2	S1_MF3
S1_MF1	1.000	0.45	0.32
S1_MF2	0.45	1.000	0.28
S1_MF3	0.32	0.28	1.000

🔍 How to Interpret the Correlation Matrix

Look at the numbers above 0.3 to decide whether Bartlett’s Test will likely succeed.

✅ Good Correlation (Above 0.3) → Suitable for Factor Analysis

S1_MF1 & S1_MF2 = 0.45 (Strong correlation ✅)
S1_MF1 & S1_MF3 = 0.32 (Acceptable correlation ✅)
S1_MF2 & S1_MF3 = 0.28 (Slightly weak, but close to 0.3)

📌 What this means: Since most values are above 0.3, Bartlett’s Test should pass, and we can continue with factor analysis.

❌ Weak Correlation (Below 0.3) → Factor Analysis May Fail

If the correlation matrix looks like this:

	S1_MF1	S1_MF2	S1_MF3
S1_MF1	1.000	0.12	0.05
S1_MF2	0.12	1.000	0.10
S1_MF3	0.05	0.10	1.000

📌 What this means:

The correlations are weak (< 0.3), meaning Bartlett’s Test may fail.
This suggests that factor analysis may not work well, and you might need to remove weakly correlated variables.

🛠 What to Do if Correlations Are Weak?

🔹 Option 1: Remove Weakly Correlated Variables

If some variables are not correlated, remove them and rerun the test:

drop S1_MF3

corr S1_MF1 S1_MF2

📌 Why? Removing weak variables makes factor analysis more effective.

🔹 Option 2: Use Principal Component Analysis (PCA) Instead

If variables are weakly correlated but still important, try PCA instead of factor analysis:

pca S1_MF1 S1_MF2 S1_MF3

📌 Why? PCA works even if variables are not strongly correlated.

🔹 Option 3: Combine Weakly Correlated Variables

If two weakly correlated variables measure the same concept, combine them:

gen new_var = (S1_MF2 + S1_MF3) / 2

corr S1_MF1 new_var

📌 Why? This improves correlation strength and makes factor analysis more reliable.

🚀 Summary of Key Steps

1️⃣ Run corr S1_MF1 S1_MF2 S1_MF3 to check correlations.
2️⃣ If most values are above 0.3, Bartlett’s Test should pass ✅.
3️⃣ If correlations are weak (< 0.3), consider dropping weak variables ❌.
4️⃣ If factor analysis fails, try PCA instead.

📌 Next Steps

Would you like help with:
✅ Interpreting your actual Stata correlation results?
✅ Deciding whether to drop or combine weak variables?
✅ Switching to PCA if needed?

🔍 Why is Bartlett’s Test Important?

Factor analysis only works if variables are correlated with each other.

If variables aren’t correlated, they don’t share underlying patterns, and factor analysis will be meaningless.
Bartlett’s Test helps confirm whether enough correlation exists to continue with factor analysis.

Imagine You’re Organizing a Party 🎉

You invite three groups of friends to your party:
1️⃣ Sports Friends (they love playing soccer and basketball)
2️⃣ Movie Buffs (they talk about films and TV shows)
3️⃣ Music Fans (they love concerts and discussing albums)

If you group random people together with nothing in common, the conversation will be awkward. But if you group similar people together (sports friends, movie buffs, and music fans), the conversation will flow naturally!

How Does This Relate to Bartlett’s Test?

Factor analysis is like organizing a party—it works only if variables belong together (are correlated).

Bartlett’s Test checks if the variables are related (like checking if people at your party have common interests).
If the test fails, it means your variables are too different (like forcing a sports fan to talk about classical music).

Example 1: Bartlett’s Test for Student Performance 📚

Let’s say a school collects data on student performance:
✔ Math Scores
✔ Science Scores
✔ Reading Scores
✔ Favorite Movie Genre

You want to see if these variables are related so you can use factor analysis to group them into academic ability and personal interests.

🔹 If Bartlett’s Test Passes (p < 0.05) → Math, Science, and Reading are correlated ✅ (Factor Analysis can proceed!)
🔹 If Bartlett’s Test Fails (p > 0.05) → Math and Movie Genres aren’t related ❌ (Factor Analysis won’t work!)

Example 2: Employee Job Satisfaction Survey 💼

A company surveys employees about work satisfaction:
✔ Do you enjoy your work?
✔ Do you feel valued by your boss?
✔ Do you feel secure in your job?
✔ Do you like pineapple on pizza? 🍕

You want to group similar questions together using factor analysis.

🔹 If Bartlett’s Test Passes (p < 0.05) → The first three questions are related to job satisfaction ✅
🔹 If Bartlett’s Test Fails (p > 0.05) → “Pineapple on pizza” is unrelated to job satisfaction ❌ (Remove this question from analysis!)

Super Simple Explanation

Bartlett’s Test checks if variables are related before running factor analysis.

If p < 0.05, the test says “Yes! These variables belong together.” ✅
If p > 0.05, the test says “No! These variables are too different.” ❌

1️⃣ What Does the KMO Test Measure?

The KMO test checks if your dataset has enough common variance to perform factor analysis by comparing:

The sum of squared correlations (shared variance) among variables.
The sum of squared partial correlations (unique variance).

📌 Interpretation of KMO Score:

High KMO (≥ 0.7) → The dataset is well-suited for factor analysis.
Low KMO (≤ 0.6) → Factor analysis may not work well because variables don’t share enough variance.

2️⃣ How to Run the KMO Test in Stata

Since Stata does not have a direct kmo command, you first run Principal Component Analysis (PCA) before using estat kmo:

stataCopyEditpca var1 var2 var3 var4 var5
estat kmo

📌 Explanation of Commands

pca var1 var2 var3 var4 var5 → Runs Principal Component Analysis (PCA), which organizes variables based on shared variance.
estat kmo → Calculates the KMO statistic for overall sampling adequacy.

3️⃣ How to Interpret KMO Results

After running estat kmo, Stata will display a KMO statistic between 0 and 1.

KMO Value	Interpretation
≥ 0.90	Excellent (Ideal for factor analysis) ✅
0.80 – 0.89	Great (Very good for factor analysis) ✅
0.70 – 0.79	Good (Acceptable for factor analysis) ✅
0.60 – 0.69	Mediocre (May need variable removal) ⚠️
≤ 0.59	Unacceptable (Factor analysis is not appropriate) ❌

📌 If KMO is below 0.6, consider:

Removing weakly correlated variables (check the correlation matrix).
Running the test again after dropping problematic variables.

4️⃣ Example Interpretation of KMO Output

Let’s say after running estat kmo, Stata gives the following output:

javaCopyEditKMO measure of sampling adequacy = 0.72

✔ Interpretation:

Since 0.72 > 0.70, your dataset is suitable for factor analysis.
You don’t need to remove variables, and you can proceed with Exploratory Factor Analysis (EFA).

5️⃣ What to Do If KMO Is Too Low?

If KMO < 0.6, your dataset is not ideal for factor analysis. Here’s how to fix it:

Option 1: Remove Weak Variables

If some variables do not correlate well, remove them and rerun the test:

stataCopyEditdrop var3  // Remove weak variable
pca var1 var2 var4 var5
estat kmo

📌 Goal: Removing low-correlation variables should increase the KMO value.

Option 2: Check Correlations

Run:

stataCopyEditcorr var1 var2 var3 var4 var5

If variables are weakly correlated (< 0.3), remove them.
If variables are strongly correlated (> 0.3), try using more data.

🚀 Next Steps

1️⃣ If KMO > 0.7 (Good to Excellent)

✅ Proceed to Bartlett’s Test:

stataCopyEditfactor var1 var2 var3 var4 var5
estat factor

2️⃣ If KMO < 0.6 (Poor Sampling Adequacy)

❌ Fix the issue by:

Dropping weakly correlated variables (drop var3)
Rechecking correlation matrix (corr var1 var2 var3 var4 var5)
Running the test again (pca ... estat kmo)

Final Summary

1️⃣ KMO checks if factor analysis is appropriate.
2️⃣ Run pca ... estat kmo in Stata.
3️⃣ If KMO > 0.7, proceed with factor analysis. ✅
4️⃣ If KMO < 0.6, remove weak variables and rerun. ❌

✅ 1️⃣ What Does EFA Do?

When you run EFA in Stata using the factor command, it performs the following tasks:

✔ Identifies the number of factors in your data
✔ Shows how strongly each variable loads onto each factor
✔ Reduces many observed variables into fewer latent factors
✔ Helps determine if variables should be grouped together in further analysis

Example Use Cases of EFA:

📌 In Psychology: Grouping personality traits into broader dimensions (e.g., “Openness,” “Extraversion”).
📌 In Business: Identifying key customer satisfaction drivers from survey data.
📌 In Education: Understanding how different test questions contribute to subjects like “Math Ability” or “Reading Comprehension.”

🛠 2️⃣ How to Run EFA in Stata

To perform factor analysis on selected variables, run:

stataCopyEditfactor var1 var2 var3 var4 var5

What Happens When You Run This?

Stata will compute factor loadings, which tell us which variables strongly associate with each factor.
It will also show eigenvalues, which help determine how many factors to keep.

📊 3️⃣ Interpreting Factor Analysis Output

Once you run factor, you’ll see a table with factor loadings for each variable.

📌 What Are Factor Loadings?

Factor loadings indicate how strongly a variable is associated with a particular factor.

Values close to ±1.0 → The variable strongly belongs to the factor.
Values close to 0 → The variable does not belong to the factor.

Example Factor Loading Output

Variable	Factor 1	Factor 2
Math_Score	0.85	0.12
Science_Score	0.78	0.18
Reading_Score	0.30	0.75
Writing_Score	0.22	0.80

📌 How to Interpret This Table

Math_Score & Science_Score load highly onto Factor 1, meaning they likely represent a “STEM Ability” factor.
Reading_Score & Writing_Score load onto Factor 2, meaning they likely represent a “Literacy Ability” factor.
A variable should load at least 0.4 onto one factor to be considered meaningful.

📉 4️⃣ Determining the Number of Factors

(A) Eigenvalues (Kaiser’s Rule)

After running factor, check eigenvalues to see how many factors to keep:

stataCopyEditfactor var1 var2 var3 var4 var5, mineigen(1)

📌 Rule: Keep factors with Eigenvalues > 1.

(B) Scree Plot (Elbow Rule)

To visualize the number of factors:

stataCopyEditscreeplot

📌 Look for the “elbow point” where the curve levels off → Keep factors before the elbow.

🔄 5️⃣ Improve Factor Structure Using Rotation

Rotation helps make factors more interpretable by adjusting how variables load onto factors.

(A) Varimax Rotation (If Factors Are Uncorrelated)

If you believe your factors do not overlap, use varimax rotation:

stataCopyEditrotate, varimax

📌 What this does: Ensures each variable loads strongly onto only one factor, making the interpretation clearer.

(B) Promax Rotation (If Factors Are Correlated)

If factors are expected to be related, use promax rotation:

stataCopyEditrotate, promax

📌 Example: In psychology, “Extraversion” and “Agreeableness” might be related, so Promax rotation is more appropriate.

📁 6️⃣ Save Factor Scores for Further Analysis

If you plan to use the factors in regression, clustering, or machine learning, generate factor scores:

stataCopyEditpredict factor1 factor2 factor3

📌 What this does:

Creates new variables (factor1, factor2, etc.) in your dataset that represent the underlying latent factors.
These can be used as independent variables in regression:

stataCopyEditregress job_satisfaction factor1 factor2

📌 Example: Predicting Job Satisfaction based on extracted work environment factors.

🚀 Final Step: Summary of Running EFA

Step	Command	Purpose
1️⃣ Run Factor Analysis	`factor var1 var2 var3`	Extracts factor loadings
2️⃣ Check Eigenvalues	`factor ..., mineigen(1)`	Selects the number of factors
3️⃣ Create Scree Plot	`screeplot`	Visualizes the factor cutoff
4️⃣ Rotate Factors	`rotate, varimax` or `rotate, promax`	Improves factor clarity
5️⃣ Save Factor Scores	`predict factor1 factor2`	Generates new factor variables

Determine the Number of Factors in Stata (Expanded Guide)

After running Exploratory Factor Analysis (EFA), the next critical step is determining how many factors to keep. If we keep too few factors, we may lose important information. If we keep too many factors, the model may become too complex.

🔍 Why is This Important?

When analyzing survey data, psychological tests, or business metrics, we want to reduce many observed variables into a few underlying dimensions (factors). This step ensures we only keep meaningful factors while removing unnecessary ones.

✅ 1️⃣ Method 1: Using Eigenvalues (Kaiser’s Criterion)

Eigenvalues measure how much variance a factor explains in the data. The larger the eigenvalue, the more variance that factor explains.

📌 Kaiser’s Rule (Mineigen Rule):

Keep only factors with Eigenvalues > 1.0
Drop factors with Eigenvalues < 1.0 (they explain less variance than a single variable)

How to Run This in Stata

stataCopyEditfactor var1 var2 var3 var4 var5, mineigen(1)

📌 What Happens?

Stata will list factors and their eigenvalues.
You should keep only factors with eigenvalues greater than 1.0.

Example Output (Eigenvalues Table)

mathematicaCopyEditFactor   |  Eigenvalue  |  % Variance Explained
-----------------------------------------------
Factor 1 |   3.45      |     34.5%
Factor 2 |   2.10      |     21.0%
Factor 3 |   1.20      |     12.0%
Factor 4 |   0.75      |      7.5%
Factor 5 |   0.55      |      5.5%

✔ Interpretation:

Factor 1, Factor 2, and Factor 3 have eigenvalues > 1.0, so we keep them.
Factor 4 and Factor 5 have eigenvalues < 1.0, so we discard them.

📌 Final Decision: Retain 3 factors because they explain enough variance.

📈 2️⃣ Method 2: Using Scree Plot (Elbow Rule)

A Scree Plot visually represents eigenvalues to help determine how many factors to keep.

How to Run the Scree Plot in Stata

stataCopyEditscreeplot

📌 What This Does:

Stata will generate a graph of eigenvalues.
The graph plots factors on the x-axis and their eigenvalues on the y-axis.

How to Interpret the Scree Plot

Look for the “elbow” point where the eigenvalues drop significantly.
Keep the factors before the elbow, as they explain the most variance.
Factors after the elbow contribute little variance and should be removed.

📌 Example Scree Plot Interpretation:

luaCopyEditEigenvalue
│
│  *  
│  *   *
│  *   *  
│  *   *   *
│  *   *   *   *  <-- Elbow (Keep first 3 factors)
│  *   *   *   *   *
│------------------------------------
   1   2   3   4   5  (Factors)

✔ Interpretation:

The elbow occurs at Factor 3, meaning Factors 1, 2, and 3 should be kept.
Factors beyond the elbow contribute little and should be discarded.

🔄 3️⃣ Method 3: Parallel Analysis (More Advanced)

Parallel analysis compares your eigenvalues against randomly generated data to determine how many factors are actually meaningful.

How to Run Parallel Analysis in Stata

First, install the paran command if it’s not already installed:

stataCopyEditssc install paran

Then, run:

stataCopyEditparan var1 var2 var3 var4 var5, iterations(1000) centile(95)

📌 What This Does:

Stata compares your real eigenvalues against random eigenvalues.
If a factor’s eigenvalue is greater than the randomly generated eigenvalue, keep it.
If a factor’s eigenvalue is lower than the random data, discard it.

✔ Parallel analysis is one of the most accurate methods to determine true factors.

🎯 Final Decision: How Many Factors to Keep?

Summary of Rules

Method	How Many Factors to Keep?
Eigenvalues (Kaiser’s Rule)	Keep factors with eigenvalues > 1.0
Scree Plot (Elbow Rule)	Keep factors before the elbow
Parallel Analysis	Keep factors above random eigenvalues

Example Final Decision

Factor	Eigenvalue	Scree Plot Position	Parallel Analysis	Final Decision
Factor 1	3.45	Before Elbow ✅	Above Random Data ✅	Keep ✅
Factor 2	2.10	Before Elbow ✅	Above Random Data ✅	Keep ✅
Factor 3	1.20	Before Elbow ✅	Above Random Data ✅	Keep ✅
Factor 4	0.75	After Elbow ❌	Below Random Data ❌	Drop ❌
Factor 5	0.55	After Elbow ❌	Below Random Data ❌	Drop ❌

📌 Final Model: Keep 3 factors and drop the rest.

🚀 Next Steps in Stata

1️⃣ If You’ve Identified the Number of Factors: Rotate Factors

Once you decide how many factors to keep, apply rotation to improve factor clarity.

stataCopyEditrotate, varimax

📌 Varimax rotation helps ensure each variable loads strongly onto only one factor.

If you expect factors to be correlated, use Promax rotation instead:

stataCopyEditrotate, promax

2️⃣ Save Factor Scores for Further Analysis

If you want to use these factors in regression or clustering:

stataCopyEditpredict factor1 factor2 factor3

📌 Creates new variables (factor1, factor2, etc.) representing each factor.

🚀 Final Summary: Stata Commands for Determining Factors

Step	Command	Purpose
Check Eigenvalues	`factor var1 var2 var3, mineigen(1)`	Identify factors with eigenvalues > 1
Generate Scree Plot	`screeplot`	Identify the “elbow” point
Run Parallel Analysis	`paran var1 var2 var3, iterations(1000) centile(95)`	Compare real vs. random factors
Rotate Factors	`rotate, varimax` or `rotate, promax`	Improve factor structure
Save Factor Scores	`predict factor1 factor2`	Create new variables for each factor

Step 5: Improving Factor Structure Using Rotation (Expanded Guide)

After extracting factors in Exploratory Factor Analysis (EFA), the next step is rotation, which makes it easier to interpret the factor loadings. Without rotation, factor loadings can be spread across multiple factors, making it hard to determine which variables truly belong together.

1️⃣ Why Do We Need Rotation in Factor Analysis?

Factor rotation adjusts the factor structure so that:

Each variable loads strongly on only one factor
It minimizes cross-loadings, making factor interpretation clearer
Makes the results more stable and replicable

📌 Example: Before Rotation (Difficult to Interpret)

Variable	Factor 1	Factor 2
Math Skill	0.50	0.40
Science Skill	0.55	0.30
Reading Skill	0.45	0.50
Writing Skill	0.40	0.60

📌 Problem:

Each variable loads on multiple factors, making it unclear which factor represents what.
Math & Science should belong together, and Reading & Writing should belong together, but the results are messy.

2️⃣ Types of Factor Rotation in Stata

Factor rotation can be orthogonal (Varimax) or oblique (Promax) depending on whether factors are related or not.

✅ (A) Varimax Rotation (If Factors Are Uncorrelated)

📌 Use Varimax if you assume that factors are independent (not related to each other).

How to Apply Varimax Rotation in Stata

stataCopyEditrotate, varimax

✔ What Varimax Does:

Maximizes the variance of factor loadings, making large values larger and small values smaller.
Ensures each variable loads on only one factor, improving interpretation.
Assumes that factors are not correlated (e.g., in a study on cognitive abilities, math and reading might be unrelated).

✅ (B) Promax Rotation (If Factors Are Correlated)

📌 Use Promax if you expect that some factors might be related.

How to Apply Promax Rotation in Stata

stataCopyEditrotate, promax

✔ What Promax Does:

Allows factors to be correlated (e.g., “Math Ability” and “Science Ability” might be related).
Improves interpretability without forcing independence.
Better for psychological and behavioral research, where factors often have some relationship.

3️⃣ Example: How Rotation Improves Factor Loadings

📌 Before Rotation (Confusing Interpretation)

Variable	Factor 1	Factor 2
Math Skill	0.50	0.40
Science Skill	0.55	0.30
Reading Skill	0.45	0.50
Writing Skill	0.40	0.60

📌 After Varimax Rotation (Clearer Interpretation)

Variable	Factor 1 (STEM Skills)	Factor 2 (Language Skills)
Math Skill	0.80	0.10
Science Skill	0.75	0.15
Reading Skill	0.12	0.85
Writing Skill	0.10	0.82

✔ Now, each variable loads on only one factor, making interpretation much clearer.

4️⃣ When to Use Varimax vs. Promax?

Scenario	Best Rotation Method	Reason
Factors are unrelated (e.g., Math vs. Reading)	Varimax	Makes loadings clearer, forces independence
Factors are expected to be correlated (e.g., Extraversion & Agreeableness)	Promax	Allows overlap, better for psychology research
Exploratory research with unknown relationships	Start with Varimax, then test Promax	Helps determine if correlation exists

5️⃣ Next Steps After Rotation

(A) Check the Rotated Factor Loadings

After applying rotation, inspect the factor loadings to see which variables belong together:

stataCopyEditfactor var1 var2 var3 var4 var5
rotate, varimax

📌 Keep only variables that load strongly (>0.4) on a single factor.

(B) Save Rotated Factor Scores for Further Analysis

If you want to use the factors in regression analysis or clustering:

stataCopyEditpredict factor1 factor2

📌 This creates new variables (factor1, factor2, etc.) that represent the underlying factors.

🚀 Final Summary: How to Use Rotation in Stata

Step	Command	Purpose
Run Factor Analysis	`factor var1 var2 var3`	Extracts factor loadings
Apply Varimax Rotation	`rotate, varimax`	Ensures factors remain uncorrelated
Apply Promax Rotation	`rotate, promax`	Allows factors to be correlated
Check Rotated Loadings	`factor ..., rotate(varimax)`	See how variables now load onto factors
Save Factor Scores	`predict factor1 factor2`	Creates new factor variables for further analysis

Expanding on Exploratory Factor Analysis (EFA)

What is Exploratory Factor Analysis (EFA)?

🧩 Breaking It Down with an Example

🔄 How EFA Works in This Case

🚀 Why is This Useful?

📏 Understanding Latent Variables in Factor Analysis

What is a Latent Variable? 🤔

Latent Variables in Research 🧐

Why Do We Use Latent Variables?

Super Simple Analogy: A Cake 🎂

🎯 How Does Factor Analysis Help?

🎭 Why Not Just Use a Single Question?

📊 Examples of Latent Variables in Research

📌 Latent Variables vs. Observed Variables

🔥 Why Are Latent Variables Important?

🔎 How to Identify Latent Variables Using Stata

🛠 Step 1: Preparing the Data for Exploratory Factor Analysis (EFA)

🔎 What Makes a Good Dataset for EFA?

📂 Example Dataset: College Student Well-Being

📊 Survey Questions and Possible Latent Variables

🛠 Step 1: Preparing the Data for Exploratory Factor Analysis (EFA)

🔎 What Makes a Good Dataset for EFA?

📂 Example Dataset: College Student Well-Being

📊 Survey Questions and Possible Latent Variables

📌 Understanding Response Scaling

Why Do We Reverse-Score Some Questions? 🤔

Let’s break it down with an example:

The Problem 🤯

The Solution: Reverse-Scoring 🔄

How to Reverse-Score in Stata 📊

Final Result ✅

Super Simple Analogy 🎭

🔎 Handling Missing Data Before Running Exploratory Factor Analysis (EFA) in Stata

✅ 1. Why is Missing Data a Problem?

Example Scenario: Missing Data in a Survey

🛠 2. Checking for Missing Data in Stata

🔹 (A) Use misstable summarize to Count Missing Values

🔹 (B) Use misstable patterns to Identify Missing Data Patterns

🚦 3. How Much Missing Data is Too Much?

General Rules for Handling Missing Data:

🛠 4. Handling Missing Data in Stata

🔹 (A) Mean Imputation (Replace with the Average)

🔹 (B) Median Imputation (Replace with the Middle Value)

🔹 (C) Regression-Based Imputation

🔹 (D) Dropping Observations with Too Many Missing Values

🚀 5. Final Checklist Before Running EFA

1️⃣ Checking for Outliers and Normality in Stata (Expanded Guide)

📌 Why Do We Check for Normality & Outliers?

Factor Analysis Assumptions

🔍 Step 1: Check Summary Statistics for Normality

📊 What to Look for in the Output

1️⃣ Check for Outliers and Normality

(A) Check Summary Statistics for Skewness & Kurtosis

(B) Generate Histograms to Visualize Distributions

(C) Check for Outliers Using Boxplots

Checking Correlations Between Variables in Stata (Expanded Guide)

🔍 Why Do We Check Correlations?

📌 How to Check Correlations in Stata

Example Output of a Correlation Matrix

🔍 How to Interpret the Correlation Matrix

✅ Good Correlation (Above 0.3) → Suitable for Factor Analysis

❌ Weak Correlation (Below 0.3) → Factor Analysis May Fail

🛠 What to Do if Correlations Are Weak?

🔹 Option 1: Remove Weakly Correlated Variables

🔹 Option 2: Use Principal Component Analysis (PCA) Instead

🔹 Option 3: Combine Weakly Correlated Variables

🚀 Summary of Key Steps

📌 Next Steps

🔍 Why is Bartlett’s Test Important?

Imagine You’re Organizing a Party 🎉

How Does This Relate to Bartlett’s Test?

Example 1: Bartlett’s Test for Student Performance 📚

Example 2: Employee Job Satisfaction Survey 💼

Super Simple Explanation

1️⃣ What Does the KMO Test Measure?

2️⃣ How to Run the KMO Test in Stata

📌 Explanation of Commands

3️⃣ How to Interpret KMO Results

4️⃣ Example Interpretation of KMO Output

5️⃣ What to Do If KMO Is Too Low?