In this project, we will show some workflows that are useful for EELS data analysis. We will cover multiple topics such as background removal, Hartree-Slater cross section function, low loss analysis (band gap, plasmon shift calculations), high loss analysis (L2/L3 ratio analysis) and so on.

Data loading and Background substraction

This script handles a dataset of Electron Energy Loss Spectroscopy (EELS) spectra. Here’s what the code does:

  1. Import necessary modules and packages: This includes numpy for general numerical operations, pylab, sympy, and matplotlib.pyplot for plotting and visualization, scipy.optimize and scipy.integrate for curve fitting and integration, glob for file handling, re for regular expression operations, nmmn.plots for specialized plots, and hyperspy.api for EELS data analysis.
  2. Define paths and load data: The script sets paths to load EELS data files and a path for saving results. The EELS spectra are then loaded into a list named spectra from files named according to a specified pattern.
  3. Define and apply shifts: The script defines shifts in nanometers due to the drift of the sample during multiple line scan acquisitions. These shifts are then applied to the loaded spectra.
  4. Slice and rebin spectra: The script identifies the smallest navigation and signal dimensions across all loaded spectra. It then « rebins » each spectrum in the list to match these smallest dimensions, effectively ensuring all spectra have the same dimensions.
  5. Sum spectra: The script creates a deep copy of the first rebinned spectrum, and then adds the rest of the rebinned spectra to it.
  6. Plot averaged spectrum: Finally, the script plots the summed spectra. A commented line of code indicates that at one point, the author considered averaging the signal across the navigation axis before plotting.

Keep in mind that some portions of the script are commented out, which means they’re not currently executing. We left these parts in, likely for reference or potential future use.

When processing Electron Energy-Loss Spectroscopy (EELS) data, one of the key steps is background removal. This step is crucial as it reduces noise and enhances the accuracy of subsequent analyses. This particular Python code that we’re examining embodies this important process.

In the beginning, the code delves into the collected EELS data by extracting individual spectra from the summed EELS spectrum. This dissection of the data provides a granular view of each spectrum, a necessary step for the upcoming analysis.

Having isolated each spectrum, the code then retrieves the corresponding energy values for every data point in each spectrum. These energy values will serve as critical markers in the following steps.

Next, the code outlines a power-law function. This is no random choice; this function has been specifically designed to model the background of the EELS spectra, a prerequisite for effective background removal.

The next task involves identifying the energy range within which the power-law function will be fit to the data. This requires defining start and end points for the fitting process. Once defined, the code proceeds to carefully identify the data within these specified energy ranges for each spectrum. These data are stored in arrays which will be accessed later for fitting.

With the fitting range defined and data prepared, it’s time to get into the heart of the operation. Utilizing the curve_fit function from the SciPy package, the code meticulously fits the power-law function to the background of each spectrum within the defined energy range. This is a pivotal step in the background removal process.

Following the fitting, it’s time to subtract the calculated background. This action effectively removes the background from each spectrum, bringing us closer to the clean, noise-reduced data we’re aiming for.

But the journey doesn’t end there. Now that the background has been subtracted, the code plots these new, cleaner spectra to visualize the result of the background removal process.

However, we’re dealing with multiple spectra, and for better comparison, we need uniformity. To ensure all spectra have the same dimension in the energy axis, the code resamples each background-subtracted spectrum to have the same energy values, creating a new energy axis.

The final stages involve creating a new EELSSpectrum object with the background-subtracted, resampled data. The metadata and axes properties are updated to match the original summed EELS spectrum. This attention to detail ensures that the final data is as accurate and as comparable to the original as possible.

The process concludes with the plotting of the final EELSSpectrum after background removal, a testament to the intricate journey we’ve taken through the data.

The Python code here is an in-depth analysis of spectra data with a particular focus on peak analysis. The aim is to isolate and analyze two distinct peaks within each spectrum via Gaussian fitting, a common technique in data analysis when you’re dealing with bell-shaped distribution of data points.

The first segment of the code outlines two critical functions – a simple linear function and a double Gaussian function. The linear function serves as a model for a linear background, while the double Gaussian function models two distinct peaks in the data, identified by their heights, positions, and widths.

In the data visualization phase, the code plots the spectra in various colors for better distinction. It also defines two fitting windows within which the Gaussian peaks are expected to reside. The ‘ratios’ list is also initialized, which will store the ratio of the integrated signals under the two Gaussian curves.

The heart of the analysis lies within a loop that cycles through each spectrum in the data. Here, the code slices the energy range to focus on the region of interest. Boolean masks are also created to mark the fitting windows for the Gaussian peaks.

By using curve fitting, the code extracts initial guesses for the parameters of the linear background model based on the two fitting windows. These initial estimates help in forming a more accurate fit for the background subtraction later on.

The code then calculates an estimate of the background, a combination of two linear functions across different energy ranges. This background is then subtracted from the original intensity to give a background-subtracted spectrum, ready for Gaussian peak analysis.

Fitting the subtracted data with two Gaussian peaks is the next major task. The code carries out this operation meticulously, setting initial guesses and bounds for the parameters to ensure an accurate fit. It then stores the positions of the two Gaussian peaks for further plotting.

To gauge the prominence of the two peaks, the code calculates the integrated signals under each Gaussian curve. This measure essentially quantifies the area under each peak, providing an insight into their relative magnitudes.

Next, a ratio of the two integrated signals is calculated and stored in the ‘ratios’ list. This ratio serves as a quantitative measure of the balance between the two peaks in each spectrum, a vital piece of information for further analysis.

As the loop completes, each spectrum is plotted, now background-subtracted and adorned with the Gaussian peak fits. This visualization paints a clear picture of the two distinct peaks within each spectrum.

After processing all spectra, the elapsed time for the operation is printed, indicating the computational efficiency of the analysis. Lastly, the final plot of the processed spectra is displayed, marking the completion of the task.

Background substraction with Hartree-Slater cross-section function and fitting with double Gaussians:

Batch fitting plot:

Mn L2&L3 edge shifts plot and L3/L2 ratio:

Data augmentation

This script is designed to implement a convolutional neural network (CNN) for a multiclass classification task using Keras, with the data being spectra acquired from certain experiments and preserved in pickle files. The initial segment of the script deals with the import of necessary libraries and modules. These comprise of Python’s standard libraries such as numpy and pandas, in addition to machine learning libraries like Keras and sklearn, and even specific functions and classes from these libraries.

The subsequent phase is data loading, wherein three datasets, ‘Mn2_C’, ‘Mn3_C’, and ‘Mn4_C’, from a specified directory path are incorporated. Each dataset signifies a different class, which are initially loaded into pandas dataframes, followed by a combination into a single dataframe Mn_All and subsequent conversion into a numpy array.

Simultaneously, a list labels is generated in which the spectra from the ‘Mn2_C’, ‘Mn3_C’, and ‘Mn4_C’ datasets are labelled as 0, 1, and 2 respectively. Following this, the data is divided into training and testing sets with an 85% to 15% ratio, using the train_test_split() function from sklearn.

Data augmentation then takes place, leveraging principal component analysis (PCA). A noise model is produced and applied with diverse signal-to-noise ratios (SNR) to the training data to generate additional training samples. This stage is crucial to the robustness of the model.

Before the data is fed to the CNN, the script undertakes several preprocessing steps. The spectra are cropped and then reshaped according to the input format that Keras expects. The data is also mean-centered and normalized to ensure the scale of input features is compatible.

The labels for the classes (0, 1, and 2) are one-hot encoded in preparation for use in the CNN model. This encoding process transforms each integer label into a binary vector with the index of the integer label marked as 1 while the remainder of the vector is populated with 0’s. This format is more appropriate for classification tasks where classes are not ordinal.

Finally, the script plots one spectrum from the training set and one from the testing set as a visual representation. While this marks the end of the data preparation process, the script is primed to move forward with defining the CNN model’s architecture, compiling it, and subsequently training it using the prepared data.

Creation of the model

The Sequential model is a linear stack of layers that allows the user to build a neural network layer by layer, from input to output. Here, each layer has exactly one input tensor and one output tensor.

The model architecture is defined by sequentially adding various layers:

  1. A 1-dimensional convolutional layer (Convolution1D) is added as the input layer of the model, taking an input shape of (500, 1). This layer has 2 filters, each of size 9, and uses a rectified linear unit (ReLU) activation function.
  2. An average pooling layer (AveragePooling1D) is added next. This layer down-samples the input along its temporal dimension (time dimension in case of time-series data) by taking the average value over a window in the input.
  3. Batch normalization (BatchNormalization) is then applied which normalizes the activations of the previous layer at each batch, i.e., it applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1. This often improves the model’s performance.

The same structure (convolution, pooling, batch normalization) is then repeated four more times, each with slightly different parameters. Notably, the size of the convolutional kernels decreases each time while the number of filters increases.

A dropout layer (Dropout) is added after the fifth convolutional block to prevent overfitting. It randomly sets 10% of input units to 0 at each update during training time.

Next, another 1D convolutional layer is added, followed by a global average pooling layer (GlobalAveragePooling1D). This layer will compute a global average of its inputs for each feature and produce a 2D tensor. This helps reduce the dimensionality of the input and is particularly useful to reduce overfitting and computational cost.

Finally, a softmax activation function is applied which transforms the output to a probability distribution over the target classes. This function is typically used in multiclass classification problems.

The model is then compiled with the Adam optimizer, using categorical cross-entropy as the loss function (which is suitable for multiclass classification tasks) and tracking accuracy as the metric.

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv1d (Conv1D)             (None, 492, 2)            20        
                                                                 
 average_pooling1d (AverageP  (None, 246, 2)           0         
 ooling1D)                                                       
                                                                 
 batch_normalization (BatchN  (None, 246, 2)           8         
 ormalization)                                                   
                                                                 
 conv1d_1 (Conv1D)           (None, 240, 2)            30        
                                                                 
 average_pooling1d_1 (Averag  (None, 120, 2)           0         
 ePooling1D)                                                     
                                                                 
 batch_normalization_1 (Batc  (None, 120, 2)           8         
 hNormalization)                                                 
                                                                 
 conv1d_2 (Conv1D)           (None, 114, 4)            60        
                                                                 
 average_pooling1d_2 (Averag  (None, 57, 4)            0         
 ePooling1D)                                                     
                                                                 
 batch_normalization_2 (Batc  (None, 57, 4)            16        
 hNormalization)                                                 
                                                                 
 conv1d_3 (Conv1D)           (None, 53, 4)             84        
                                                                 
 average_pooling1d_3 (Averag  (None, 26, 4)            0         
 ePooling1D)                                                     
                                                                 
 batch_normalization_3 (Batc  (None, 26, 4)            16        
 hNormalization)                                                 
                                                                 
 conv1d_4 (Conv1D)           (None, 24, 8)             104       
                                                                 
 average_pooling1d_4 (Averag  (None, 12, 8)            0         
 ePooling1D)                                                     
                                                                 
 batch_normalization_4 (Batc  (None, 12, 8)            32        
 hNormalization)                                                 
                                                                 
 dropout (Dropout)           (None, 12, 8)             0         
                                                                 
 conv1d_5 (Conv1D)           (None, 12, 3)             27        
                                                                 
 global_average_pooling1d (G  (None, 3)                0         
 lobalAveragePooling1D)                                          
                                                                 
 loss (Activation)           (None, 3)                 0         
                                                                 
=================================================================
Total params: 405
Trainable params: 365
Non-trainable params: 40
_________________________________________________________________
None
CNN Model created.

Model fitting

This block of code is fitting (i.e., training) the previously defined model to the data and saving the weights for the best model during training.

Here’s a step-by-step explanation:

  1. np.random.seed(seed): This line is setting the seed for the random number generator in numpy. Setting the seed ensures that the random numbers generated will be the same each time the code is run. This is important for the reproducibility of experiments in machine learning.
  2. best_model_file = '/Users/chmilew/Documents/Python Projects/EELS/Mn L2_L3 dans STO film - LSMO/Mn_Classifier_CNNs-master/best_weights/highest_val_acc_weights_epoch199-train_loss0.026_.h5': This line defines the filepath where the weights of the best model (i.e., the model with the highest validation accuracy) will be saved during training.
  3. best_model = ModelCheckpoint(best_model_file, monitor='val_acc', verbose = 1, save_best_only = True): Here, a Keras callback is created to save the model’s weights after every epoch. Only the weights of the model with the highest validation accuracy (monitor='val_acc') are saved (save_best_only = True).
  4. hist = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=epochs, batch_size=batch_size, callbacks = [best_model], shuffle = True, verbose=1): This is the line where the model training happens. The model is trained on X_train and y_train, with a specified number of epochs and batch size. The validation_data parameter allows the model to evaluate its performance on the test data (X_test, y_test) after each epoch. The shuffle parameter being True means that the order of the samples will be shuffled in each epoch. The callbacks parameter includes the best_model function, which will be called after each epoch. Finally, verbose=1 indicates that the progress of the training will be printed to the console after each epoch.

Testing model accuracy

The given table is a confusion matrix for the test set of a machine learning model. This matrix helps to visualize the performance of a classification model by outlining the true and false positives/negatives. Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class.

Here’s a breakdown of this specific confusion matrix:

  • For the Mn2+ class, 155 instances were correctly classified (true positives), while 2 instances were wrongly classified as Mn3+ and none were classified as Mn4+.
  • For the Mn3+ class, 141 instances were correctly classified (true positives), and none were misclassified as other classes.
  • For the Mn4+ class, 118 instances were correctly classified (true positives), while 58 instances were wrongly classified as Mn3+.

The overall accuracy of the model is 87.34%, which suggests that it correctly classified 87.34% of all instances.

From the confusion matrix, we can conclude that the model performs well for the Mn2+ and Mn3+ classes but struggles with the Mn4+ class. Particularly, the model seems to misclassify a significant number of Mn4+ instances as Mn3+, which is an area that could potentially be improved in future iterations of the model. A detailed analysis could involve checking the features of these misclassified instances and adjusting the model’s architecture or training strategy accordingly.

Confusion Matrix of Test Set
      Mn2+  Mn3+  Mn4+
Mn2+   155     2     0
Mn3+     0   141     0
Mn4+     0    58   118
Accuracy: 87.34%

Data Augmentation effect

Catégories : TEM analysis

0 commentaire

Laisser un commentaire

Emplacement de l’avatar

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *