Referendum

Analysis of the socio-cultural influence on the voting result of a referendum

Introduction

Analysis of the socio-cultural influence on the voting result of a referendum.

In this project, we were given one dataset named ‘Referendum.csv’ representing the voting results by municipality in France from mock referendum that took place in 2013.

and three links to INSEE website, that is the French Institute of Statistics and Economic Studies.

We want to answer the following question: What can we show that is instructive about this data?

For the complete notebook of this project, please visit my github.

Approach

Data set #1: Data related to a fictional Referendum in France in 2013, including department code, department label, city, number of registered voters, abstention rate, choice A count, choice B count, etc. (Referendum.csv)

Data set #2: Evolution and structure of the population in France in 2013, including population breakdown by gender and age groups, professions, etc. (https://www.insee.fr/fr/statistiques/2044751#dictionnaire)

Data set #3: Income and poverty of households in 2013, including poverty rate, fiscal households, median income level, etc. (https://www.insee.fr/fr/statistiques/2388572)

Data set #4 (additional): Education and training data, including the number of people enrolled by age group, educational attainment level, etc. (https://www.insee.fr/fr/statistiques/2386698)

Data set #5 (additional): France department boundaries for mapping purposes (GitHub repository)

The questions we will try to answer are:

How are election votes distributed in France?
Are socio-cultural factors likely to influence a vote? If yes, which ones?
In the context of a predictive model, which factor(s) carry the most weight in the prediction?

DataFrame creation:

Merge on the ‘Department Code’ for all dataframes Aggregation of data at the department level instead of municipalities Removal of overseas territories Adding certain columns, including percentages (professions, education) relative to the total population per department

For Machine Learning:

Creation of a column containing the max(Choice A %, Choice B %, Blank and null %) per department to make it the target variable.

Data Visualization

We used ipywidget to create a drop-down menu that allows us to draw the cartography of France with the different data for each department.

Note: for the names of the department, we will directly mention them by their name. The list of french departments can be found here.

% of Choice A per department

% of Choice B per department

% of blank and null per department

% of Abstention per department



import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.colors import LinearSegmentedColormap
from mpl_toolkits.axes_grid1 import make_axes_locatable
import ipywidgets as widgets
from IPython.display import display, clear_output


cmap = LinearSegmentedColormap.from_list('rg', ["red", "yellow", "lawngreen"], N=256)

def plot_data(choice):
    # On filtre les data pour afficher seulement l'ile de France
    codes_to_show = ['75', '92', '93', '94']
    filtered_data = df_final[df_final['Code du département'].isin(codes_to_show)]

    # Creation des subplots
    fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20, 6))
    
    # Plot de toute la France
    divider1 = make_axes_locatable(ax1)
    cax1 = divider1.append_axes("right", size="5%", pad=0.1)

    df_final.plot(column=choice, cmap=cmap, linewidth=0.8, ax=ax1, edgecolor='0.8', legend=False)

    norm1 = mpl.colors.Normalize(vmin=df_final[choice].min(), vmax=df_final[choice].max())
    cbar1 = plt.colorbar(mpl.cm.ScalarMappable(norm=norm1, cmap=cmap), cax=cax1)
    cbar1.set_label(f'{choice}', fontsize=15)

#    ax1.set_title(f'{choice} par Département', fontsize=15)
    ax1.set_axis_off()

    # Plot de l'ile de France
    divider2 = make_axes_locatable(ax2)
    cax2 = divider2.append_axes("right", size="5%", pad=0.1)

    filtered_data.plot(column=choice, cmap=cmap, linewidth=0.8, ax=ax2, edgecolor='0.8', legend=False)

    norm2 = mpl.colors.Normalize(vmin=df_final[choice].min(), vmax=df_final[choice].max())
    cbar2 = plt.colorbar(mpl.cm.ScalarMappable(norm=norm2, cmap=cmap), cax=cax2)
    cbar2.set_label(f'{choice}', fontsize=15)
#    ax2.set_title(f'{choice} en Ile-de-France', fontsize=15)
    ax2.set_axis_off()

    plt.show()
    
def on_change(change):
    if change['name'] == 'value' and (change['new'] != change['old']):
        clear_output()
        display(dropdown)
        plot_data(change['new'])

choices = df_final.columns.tolist()
dropdown = widgets.Dropdown(options=choices, value=choices[0], description='Choix:')
dropdown.observe(on_change)

display(dropdown)
plot_data(choices[-14])

From these figures, one can already make some observations:

South west and Seine-Saint-Denis is likely voting for Choice A
A lot of abstention in Seine-Saint-Denis and the North of France
In the Ile de France region, there is an obvious discrepancy between the west (92) and north east (93) departments in the choices type and abstention

Distribution of the level of education by departments in France

% of population with no diploma

% of population with a higher education diploma

Poverty rate by department

The population with the highest education is located in Paris and its neighboring Hauts-de-Seine (92) department. In opposite, in the North-East of Paris, Seine-Saint-Denis department, the education rate is low and, a fortiori, the poverty rate is higher. Does is translate to the voting result? In some extent yes, since we saw that abstention rate is high in Seine-Saint-Denis and those who vote, are likely voting Choice A. In the richest parts of Ile-de-France, choice B had the highest percentage and lowest abstention rate.

Machine Learning

Which socio-cultural category would be predominant in predicting the outcome of a vote?

Training on 4 models with GridSearch

We created a column containing the max(Choice A %, Choice B %) per department to make it the target variable, so we are dealing with a binary classification problem. In such a problem, the model’s goal is to classify each instance into one of two classes.

We trained 4 different models, Linear Regression, Random Forest, Decision Tree and XGBoost.



from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder

# Encode the target variable
le = LabelEncoder()
target_encoded = le.fit_transform(target)

x_train, x_test, y_train, y_test = train_test_split(features, target_encoded, test_size=0.2, random_state=42)

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

params_lr = {'C': [10**(i) for i in range(-4, 3)],
             'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
             'penalty': ['l1', 'l2', 'elasticnet', 'none']}

params_rf = {'n_estimators': [10, 50, 100, 200],
             'max_depth': [None, 10, 20, 30],
             'min_samples_split': [2, 5, 10],
             'min_samples_leaf': [1, 2, 4]}

params_dt = {'criterion': ['gini', 'entropy'],
             'max_depth': [None, 10, 20, 30],
             'min_samples_split': [2, 5, 10],
             'min_samples_leaf': [1, 2, 4]}

params_xgb = {'n_estimators': [10, 50, 100, 200],
              'learning_rate': [0.01, 0.05, 0.1, 0.2],
              'max_depth': [3, 4, 5, 6, 7]}

params_classifiers = {
    'Logistic Regression': (LogisticRegression(), params_lr),
    'Random Forest': (RandomForestClassifier(), params_rf),
    'Decision Tree': (DecisionTreeClassifier(), params_dt),
    'XGBoost': (XGBClassifier(), params_xgb)
}

# Train and evaluate classifiers using GridSearchCV
for clf_name, (clf, params) in params_classifiers.items():
    grid_search = GridSearchCV(clf, param_grid=params, cv=5)
    grid_search.fit(x_train_scaled, y_train)
    y_pred = grid_search.predict(x_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'{clf_name} Accuracy: {accuracy:.4f}')
    print(classification_report(y_test, y_pred, target_names=le.classes_))
    print(f'Best parameters for {clf_name}: {grid_search.best_params_}')

Here are the results:

Logistic Regression Accuracy: 0.8500
              precision    recall  f1-score   support

     Choix A       0.50      0.33      0.40         3
     Choix B       0.89      0.94      0.91        17

    accuracy                           0.85        20
   macro avg       0.69      0.64      0.66        20
weighted avg       0.83      0.85      0.84        20

Best parameters for Logistic Regression: {'C': 10, 'penalty': 'l2', 'solver': 'newton-cg'}
Random Forest Accuracy: 0.8500
              precision    recall  f1-score   support

     Choix A       0.00      0.00      0.00         3
     Choix B       0.85      1.00      0.92        17

    accuracy                           0.85        20
   macro avg       0.42      0.50      0.46        20
weighted avg       0.72      0.85      0.78        20

Best parameters for Random Forest: {'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 10}

Decision Tree Accuracy: 0.7000
              precision    recall  f1-score   support

     Choix A       0.20      0.33      0.25         3
     Choix B       0.87      0.76      0.81        17

    accuracy                           0.70        20
   macro avg       0.53      0.55      0.53        20
weighted avg       0.77      0.70      0.73        20

Best parameters for Decision Tree: {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 10}
XGBoost Accuracy: 0.8500
              precision    recall  f1-score   support

     Choix A       0.00      0.00      0.00         3
     Choix B       0.85      1.00      0.92        17

    accuracy                           0.85        20
   macro avg       0.42      0.50      0.46        20
weighted avg       0.72      0.85      0.78        20

Best parameters for XGBoost: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 200}

Based on the accuracy, the best models are XGBoost, Logistic Regression and Random Forest with an equal score of 0.85.

Importance Features for XGBoost:

Importance Features for Random Forest:

Conclusion

The abstention rate, poverty rate, and education rate play significant roles in elections, and they’re often intertwined.

Abstention Rate: The abstention rate refers to the percentage of eligible voters who choose not to vote. This is critical as high abstention rates can indicate dissatisfaction with the available candidates, disenchantment with the political process, or barriers to voting. Abstention can skew election results if certain groups are more likely to abstain than others, as it means that the voters may not accurately represent the preferences of the entire population.
Poverty Rate: The poverty rate reflects the percentage of the population whose income falls below the poverty line. These individuals may have different priorities compared to wealthier voters, often focusing on immediate economic relief, job security, access to affordable healthcare, and social safety nets. Their voting behavior can greatly influence election outcomes, especially in regions where the poverty rate is high.
Education Rate: The education level of a population can significantly affect its political leanings. Often, more educated individuals are more likely to vote as they may have a better understanding of the electoral process, the significance of voting, and the policies at stake. Additionally, they might be more likely to engage in political discussions and consume news media, further reinforcing their likelihood to vote.

When it comes to less-educated, poorer, and retired individuals, their participation in the voting process is incredibly important for a few reasons:

Representation: If these groups do not vote, their needs and concerns may not be adequately represented in the government. Voting ensures their voices are heard and their issues are addressed.
Impact of Policies: Government policies often directly affect these groups. For example, social security policies impact retired individuals, education policies can provide opportunities for those less educated to gain skills, and economic policies can address income inequality and poverty.
Social Inclusion: Voting can also foster a sense of social inclusion and empowerment. It gives individuals the opportunity to influence the direction of their country, which can be particularly important for groups that may feel marginalized.

Therefore, it’s crucial to encourage these groups to vote and to work on removing any barriers that might prevent them from doing so, such as lack of information about the voting process, physical accessibility issues, or inconvenient voting hours.

0 commentaire

Laisser un commentaire Annuler la réponse

Publié par Adrian Chmielewski le 05/05/202305/05/2023

0 commentaire

Laisser un commentaire Annuler la réponse