WGU Capstone

Scenario

An educational agency intends to develop an application that predicts graduate school admission chances using a deep learning process. By estimating these chances for its students, the agency enables them to set more informed goals. The application should take inputs such as University Rating, Letter of Recommendation Strength, and Research Experience to predict the likelihood of admission.
However, the agency encountered a challenge: crucial columns—namely, GRE Scores, Undergraduate GPA, and TOEFL Scores—were absent from their dataset. Fortunately, the educational agency still possesses old survey data—namely, Average Daily Study Hours, English Self-Confidence Scale, Address (proximity), and Age—that may correlate with the admission chances. By leveraging State Knowledge Graph (SKG) of the data, a Large Language Model (LLM) will recommend adding new columns that probably align with the admission chances. This addition is expected to enhance the deep learning process.
During the evaluation phase, we will assess how the deep-learning predictions perform with the newly suggested columns of data compared to their performance without it.

Dataset

Content

The dataset has different important things that are needed when applying for Master's programs.
These things include:

GRE Scores ( out of 340 )

TOEFL Scores ( out of 120 )

University Rating ( out of 5 )

Statement of Purpose Strength ( out of 5 )

Letter of Recommendation Strength ( out of 5 )

Undergraduate GPA ( out of 10 )

Research Experience ( either 0 or 1 )

Chance of Admit ( ranging from 0 to 1 )

Source

https://www.kaggle.com/datasets/mohansacharya/graduate-admissions
Acharya, M. S., Armaan, A., & Antony, A. S. (2019, February). A comparison of regression models for prediction of graduate admissions. In 2019 international conference on computational intelligence in data science (ICCIDS) (pp. 1-5). IEEE.

Deep learning trial 1: training Artificial Neural Network (ANN) with missing data

The scnerio dictates the input data for deep learning are missing three important columns: GRE Scores, Undergraduate GPA, and TOEFL Scores.
Before we enhance our data by leveraging SKG later, let's see how the admission prediction model with these missing data performs for the comparison.

ChatGPT analyzes SKG: expanding connections

Now that we have seen how the prediction model with the missing data performs, the next step is to include data to the dataset that may help the model perform better.
According to the scenario, in addition to the original dataset, there are old surveys that include information about Average Daily Study Hours, English Self-Confidence Scale, Address (proximity), and Age. Each student answered these surveys.
Let's see the current SKG, then ask ChatGPT to expand connections in such SKG.

SKG with the missing data

import networkx as nx
import matplotlib.pyplot as plt

# Create a directed graph
G = nx.DiGraph()

# Add nodes
G.add_node("Student", type="object")
G.add_node("University Rating", type="state", state="unknown")
G.add_node("Statement of Purpose Strength", type="state", state="unknown")
G.add_node("Letter of Recommendation Strength", type="state", state="unknown")
G.add_node("Research Experience", type="state", state="unknown")
G.add_node("Chance of Admit", type="state", state="unknown")
G.add_node("Average Daily Study Hours", type="state", state="unknown")
G.add_node("English Self-Confidence Scale", type="state", state="unknown")
G.add_node("Address (proximity)", type="state", state="unknown")
G.add_node("Age", type="state", state="unknown")

# Add unweighted undirected edges(i.e. objects will be connected to functions and states)
G.add_edge('Student', 'University Rating')
G.add_edge('Student', 'Statement of Purpose Strength')
G.add_edge('Student', 'Letter of Recommendation Strength')
G.add_edge('Student', 'Research Experience')
G.add_edge('Student', 'Chance of Admit')
G.add_edge('Student', 'Average Daily Study Hours')
G.add_edge('Student', 'English Self-Confidence Scale')
G.add_edge('Student', 'Address (proximity)')
G.add_edge('Student', 'Age')

# Add weighted directed edges(i.e. functions or states will affect states)
G.add_edge('University Rating', 'Chance of Admit', weight=0.69)
G.add_edge('Statement of Purpose Strength', 'Chance of Admit', weight=0.68)
G.add_edge('Letter of Recommendation Strength', 'Chance of Admit', weight=0.65)
G.add_edge('Research Experience', 'Chance of Admit', weight=0.55)

plt.figure(figsize=(12, 8), dpi=70)  # Example figsize, adjust as needed    #dpi adjusts overall size

# Separate calls to draw nodes and edges
pos = nx.spring_layout(G, k=3.5)    # Adjust the 'k' parameter to control node spacing

# Draw edges with arrows
# Draw unweighted edges first so that weighted edges will be drawn over
for u, v, data in G.edges(data=True):
    if 'weight' not in data:
        edge_color = 'yellow'
        nx.draw_networkx_edges(G, pos, edgelist=[(u, v)], edge_color=edge_color, arrows=False, width=3.0)
# Draw weighted edges
for u, v, data in G.edges(data=True):
    if 'weight' in data:
        weight = data.get('weight')  # Default weight if not specified
        edge_color = 'g' if weight >= 0 else 'r'  # Green for positive, red for negative
        nx.draw_networkx_edges(G, pos, edgelist=[(u, v)], edge_color=edge_color, arrows=True, width=abs(weight) * 1.5)

# Draw nodes
for node in G.nodes():
    node_type = G.nodes[node].get("type")
    if node_type == "object":
        nx.draw_networkx_nodes(G, pos, node_color='blue', node_size=400, node_shape='^', nodelist=[node])
    elif node_type == "state":
        nx.draw_networkx_nodes(G, pos, node_color='#99D9EA', node_size=400, node_shape='o', nodelist=[node])

node_labels = {
    n: str(n) + '\nstate: ' + str(G.nodes[n]['state']) if 'state' in G.nodes[n] else str(n)
    for n in G.nodes
}

nx.draw_networkx_labels(G, pos, labels=node_labels)

# Draw edge weights
edge_labels = {}

for u, v, data in G.edges(data=True):
    if 'weight' in data:
        edge_labels[(u, v)] = str(data.get('weight'))

nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_color='black')

plt.axis('off')  # Hides the axis
plt.tight_layout()  # Optionally, could help fit everything into the figure
plt.margins(0.2, 0.2) # x and y directions respectively. between 0.0 and 1.0
plt.subplots_adjust(top=1.00, 
                    bottom=0.00, 
                    left=0.00, 
                    right=1.00, 
                    hspace=0.01, 
                    wspace=0.01)

plt.savefig('public/graph.png')  # Make sure the path is correct

Result:

Ask ChatGPT

<The graph definition>
The definition of nodes - There are three types of nodes: Object, function, and state.
Objects are connected to functions and states through unweighted and undirected edges, meaning that objects have such functions and states.
Note that regular objects have states, while a Robot is a special object that not only has states but also functions.
A state has a value in [0,1] indicating the intensity of the state. For example, House_brightness state = 0.0 means dark, while House_brightness state = 0.92 means bright.

<Weighted and directed edges>
A function will change a state: For example, "Robot_turn_on(AirConditioner_on/off)" will change AirConditioner_on/off state = 0.0 to 1.0.
A change of state can possibly change another object's states as well: For example, House_temperature state = 0.6 will change to 0.5 because the air conditioner would lower the temperature of the house. For another example, if Robot is busy, its battery would deplete.
Note that changes in the states of objects could continue to affect the states of other objects: For example, Resident_temperature state = 0.4 will change to 0.2 because the low temperature of the house would lower the temperature of its residents.
Note that each edge has a weight in the range [-1, +1]. For example, "G.add_edge('Robot_recharge()', 'Robot_battery', weight=0.94)" means that Robot_recharge() will positively (positive weight means positive correlation, i.e., increasing the Robot_battery) and almost certainly (weight is close to 1) change Robot_battery. Another example, "G.add_edge('AirConditioner_on/off', 'House_temperature', weight=-0.64)" means that AirConditioner_on/off will negatively (negative weight means negative correlation, i.e., decreasing the House_temperature) and somewhat likely (weight is not too close to -1) change House_temperature.

<Request>
Now that I have explained the following code, could you edit the code to:
- Connect possible State Nodes to "Chance of Admit" Node

<The code you have to edit. Please refrain from providing any code other than your new ‘G.add_edge’ along with python comments explaining your reasoning. Do not include the following code or any already present code to your answer. The form of your answer should be easy to copy and paste into python code. Do not use "```python" notation or such>
import networkx as nx
import matplotlib.pyplot as plt

# Create a directed graph
G = nx.DiGraph()

ChatGPT answer

Try it yourself:
1. Remove lines of ChatGPT's previous response between "=== what LLM suggests Start ===" and "=== what LLM suggests End ===" in the following code.
2. Copy and paste a new response generated by you (ChatGPT often gives different answers each time) into the previous response's place.

SKG with the expanded connections

import networkx as nx
import matplotlib.pyplot as plt

# Create a directed graph
G = nx.DiGraph()

# === what LLM suggests Start ===

G.add_edge('Average Daily Study Hours', 'Chance of Admit', weight=0.50)  # More study hours likely improve admission chances.
G.add_edge('English Self-Confidence Scale', 'Chance of Admit', weight=0.40)  # Higher confidence in English likely improves admission chances.
G.add_edge('Address (proximity)', 'Chance of Admit', weight=0.20)  # Being closer to the university might slightly improve admission chances.
G.add_edge('Age', 'Chance of Admit', weight=0.10)  # Age might have a minor effect on admission chances.

# === what LLM suggests End ===

plt.figure(figsize=(12, 8), dpi=70)  # Example figsize, adjust as needed    #dpi adjusts overall size

# Separate calls to draw nodes and edges
pos = nx.spring_layout(G, k=3.5)    # Adjust the 'k' parameter to control node spacing

node_labels = {
    n: str(n) + '\nstate: ' + str(G.nodes[n]['state']) if 'state' in G.nodes[n] else str(n)
    for n in G.nodes
}

nx.draw_networkx_labels(G, pos, labels=node_labels)

# Draw edge weights
edge_labels = {}

for u, v, data in G.edges(data=True):
    if 'weight' in data:
        edge_labels[(u, v)] = str(data.get('weight'))

nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_color='black')

plt.savefig('public/graph.png')  # Make sure the path is correct

Result:

Deep learning trial 2: training ANN with a reinforced dataset

Now that the expanded SKG has suggested that "Average Daily Study Hours" and "English Self-Confidence Scale" (among other survey data) may have the strongest correlations with admission chances, let's add those variables to the dataset.
Since the scenario "Fortunately, there was survey data that were answered by each student" is hypothetical, we need to create new data for "Average Daily Study Hours" and "English Self-Confidence Scale". Based on the missing data, new columns have been created/added to the dataset: "Average Daily Study Hours" has a Pearson product-moment correlation coefficient of 0.94 with Undergraduate GPA and 0.79 with GRE Scores, while "English Self-Confidence Scale" has a correlation coefficient of 0.96 with TOEFL Scores. Note that while these correlation coefficients are arbitrary values, they are derived from the "missing" data because it is only natural that if you study more, your grades will likely improve, and if you're confident in English, you'll likely achieve a better English test score.

Evaluation: Deep learning trial 1 vs trial 2

Deep learning trial 1: The input data is missing GRE Scores, Undergraduate GPA, and TOEFL Scores.

Deep learning trial 2: The same input data as trial 1, but it incorporates Average Daily Study Hours and the English Self-Confidence Scale.

Trial 2 outperforms Trial 1 in terms of Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE), indicating better predictive accuracy. Trial 2 also has a higher variance regression score, suggesting that it explains more of the variance in the data. Overall, Trial 2 is the more effective model for predicting admission chances.

Deep Learning Trial	MAE	MSE	RMSE	Variance Regression Score
Trial 1	0.061997806525230396	0.00720333908910754	0.08487248723295164	0.6707320647852236
Trial 2	0.04799928128719329	0.004544161076868883	0.06741039294403263	0.7863313792402611

Conclusion

We have achieved a better prediction model by leveraging SKG. In this case, the SKG was quite simple, consisting of only one object node with several states. We could easily expand connections in the SKG ourselves.
However, in reality, many things are interconnected in complex ways, and sometimes the right solutions don't immediately surface. For instance, if you're unable to use your car for an appointment, SKG could suggest considering alternative transportation options such as a taxi or flight, or even postponing the appointment.
SKG is a useful structure that helps identify meaningful connections in the environment (see Advantage of SKG 1: expanding connections).

The application: predict your admission chance

It is based on the trained model from Deep learning trial 2.

Average Daily Study Hours

English Self-Confidence Scale

University Rating

Statement of Purpose Strength

Letter of Recommendation Strength

Research Experience

Result: