.. _`chap:repo2`: Chemistry Simplifier ==================== Chemistry Simplifier software is dedicated to improve the understanding of micro-analytical maps from rock thin sections (mounts, pucks) obtained from analysing large sample surfaces (mm\ :math:`^{2}`). The software dimensionally reduces and saves false-colour maps as single RGB images (TIF format). The main objective is documenting and discovering new, often cryptic, patterns within the mineral assemblage that are saved in a vast input image stack. This is possible by summarising the sample content into three dimensions (red-green-blue channels) to ease and speed up descriptions and characterisations. Therefore, the software gives people and AI-enabled software a kick start on high-level information about the rock (veining, fabric, matrix) and minerals (chemical zonation, resorption, diffusion pattern, sector zoning) that often require browsing the maps for hundreds of hours in only a few minutes. For instance, an SEM-EDX image (10’000x20’000 px) with 100 chemical elements (characteristic X-ray lines) can be processed in  15 minutes. The user can import chemical maps as single channel images after custom search (in most formats), selection, and sorting during the Analysis Definition step. The available models are the Principal Component Analysis (PCA), Deep Sparse Autoencoder (DSA), and the Uniform Manifold Approximation and Projection (UMAP). Each have strengths and weaknesses that complement each other and depend on the sample ‘composition’, making the software a good choice to study high-resolution and highly dimensional geochemical maps and other data types (any microscopy/spectroscopy technique, satellite imagery, art paintings, etc.) The results are saved in a folder hierarchy to ease saving output trials with different tags and metadata. This allows remembering the processing parameters and re-loading previous unsupervised learning models to enable inter-sample scalability (same parameters applied to new samples). After extensive trialling on already well-characterised samples, I demonstrated that: - PCA can more robust to noisy and/or artefact image inputs than DSA. Therefore, PCA is the default choice (for initial assessment and image registration). - PCA tends to highlight mineralogy while DSA is attentive to mineralogy and their compositional zonation within crystals. - UMAP is good at distinguishing mineralogy and superior at denoising the output (grain/zone boundaries) than PCA because it is a non-linear method. - UMAP slows down when fitting the manifold to millions of pixels, making it slower than PCA and DSA. The image alignment section requires control points that have been placed manually using ImageJ BigWarp plugin (Bogovic et al., 2016). The plugin allows exporting a ’landmarks.csv’ file containing the ID and locations (X, Y) of the placemarks accross the moving and fixed images (at least 4 points are required to fit the transform models). This repository contains an example CSV to show the user for required format (see ’landmarks_bse_xpl.csv’). Image processing ================ Data management --------------- - Input: Folder path containing the input images. Browse shows both folders and files and allows selecting a specific folder. - Image format: File path extension of the input images. - Search (optional): Python regular expression to parse the experimental files required. Use when files have a particular prefix/suffix. - Output: Name of the results folder. We call them running trials after defining the Image Processing step. It is generated when refreshing the application. After running the application, the output folder contains the following metadata, intermediate image folders, and outputs: - : Linear, natural log, and original pyramidal stacks (BigTiff format). - : Montages of every input saved using the chosen settings (contrast, colourmaps). - : Tag sub-folder outputs. - descriptiveStats.csv: Statistics of the input images and calculated min/max boundaries for reproducing the colour schema. - tileConfiguration.csv: Bounding boxes of every tile to be exacted from the BigTiffs - Output sub-folder: Name of the iteration folder. We call them running tags after defining the Analysis and Output adjustments. It is generated when running the application. It is found as - input_run.csv: List of chemical maps used for future reference. - {model}_model.{extension}: Dimensionality reduction models saved in different formats like PKL (for PCA), XML (for UMAP) and TAR (for DSA). - {model}_tiles: Transformed tiles ready for stitching. - montage\_{model}: outputs saved as float and uint8 images (flat TIF) for opening downstream. - Descriptive stats of the montages (CSV): Metadata required for reproducing the colour schema. Image pyramid operations ------------------------ An image pyramid is a hierarchical representation of a very large original image at different spatial scales. This format is required in modern image analysis software for speeding up processing. - Tile size: Image size (pixels) given to the squared image pyramid tiles. - Bit-depth scaling: Check boxes of the first pixel re-scaling operation. The image is scaled within min/max bounds to vary between 0 and 1 (float32). It can be linear (arithmetic operation), natural logarithm (algebraic operation), and original (no scaling applied). - Contrast 1: The min/max bounds above are found using percentiles. The percentage to be capped at the top and bottom of an input montage channel histogram to reduce noise and outliers pixels. - Median filter: Sliding window or kernel size (3x3 px) for calculating a median value that will replace the neighborhood center pixel. It eliminates salt-and-pepper noise while preserving edges. - Normalisation: Second pixel re-scaling operation to fit the 0-1 range. - Normalisation (Min-Max scaling) subtracts the minimum on the whole image and divides by the range assuming the data have similar magnitudes. - Standarisation subtracts the mean and divides by the standard deviation assuming the data have different magnitudes and follow Gaussian distributions. Analysis definition =================== Input list ---------- - Scaling: Selection of the first pixel re-scaling to be processed further. We can only use log scale if it has been pre-computed. - Selected chemical maps: Custom list of file paths of the input images. They can also be in different folders. After running, the metadata saves them in exactly the same order (see ``inputs_run.csv``) - Refresh button: Search the images summoned in the Image Processing step. - Add, Remove, Clear all buttons: Customise the image list. Dimensionality reduction ------------------------ A transformation of data (pixel channels) from a high-dimensional space into a low-dimensional space (representation) minimising the loss of information. We use the 3D embedded space to represent red, green, and blue colours in the image space. The checkbox allow enabling the desired representation for the tag folder. The model (hyper-)parameters can be adjusted clicking the gearbox buttons. Optionally, previously saved models be retrieved from a previous iteration (tag) using the ‘...’ button or filling the path text boxes on the corresponding model. - **Principal component analysis (PCA):** Factorisation method to obtain the rotation and translation operations required to represent the data in the space of maximum variability (orthogonal principal component axes) using a combination of eigen-vectors calculated from the input variables. - **Deep sparse autoencoder (DSA):** Neural network model that learns a meaningful representation of the input data. It has symmetrical, fully-connected layers where the input and output nodes are identical, surrounding a central bottleneck (3 channels). A cost regularisation term centers the batch nodes around a normally distributed value (KL sparsity target :math:`\rho \approx 0.5`). This ensures half of the nodes are deactivated (moderately dense representation) and provides sufficient colour contrast if the learning rate is not too high (see ``loss_plot.png``). - **Uniform manifold approximation and projection (UMAP):** Graph-based method that preserves both local and global data structure. It uses fuzzy topology to create a high-dimensional node graph and iteratively optimizes a low-dimensional layout (embedding) to reproduce that structure. The local connectivity of the data points is quantified by the number of neighbours and the distance metric (Euclidean). Processed volume ~~~~~~~~~~~~~~~~ A user needs to have results as fast as possible or in a few iterations only. This section allows fast turnaround by reducing the computational load with a trade-off on the quality of image representation. This feature was inspired by QuPath software. - Downsampling: Re-sizing factor of the input and output images. - Scale: Factor controlling the input image size to fit or train the models. - Resolution: Factor controlling the output image size that the models will have to predict. - Subsampling: Randomly sampled pixels are obtained from the downsampled image. The spin boxes (0-1) determine the fraction of pixels to factorise (PCA), connect (UMAP), and fit (DSA) the models. Pixels with a constant value across all channels are ignored (background). Output adjustments ================== Colour (RGB) and image alignment modifications on the dimensionally reduced montages. The final option is the number of cores to use when running the parallel prediction of the underlying image pyramid tiles. Montage operations ------------------ - Recoloured chemical maps: Whether or not to save the re-scaled images into recoloured maps. - Colormap: Selection of heatmap to recolour those maps following Python. - Contrast 2: The percentage to be capped at the top and bottom of an output montage channel histogram to reduce noise and outliers pixels. - Flip upside down: Whether or not to flip the image vertically in case the image was saved in a different orientation after the instrument acquisition. Image alignment --------------- - Control points: File path to the landmarks file (CSV) saved from ImageJ BigWarp plugin. - Moving images: Selection to include the recoloured, dimensionaly reduced, and original images in the image aligning loop. - Moving image (optional): Additional image that needs to be included in the loop. For example, if all input images are chemical images. The optional image could be the individual Backscattered Electron image (BSE) that represents density (not chemistry). - Fixed image: File path to the fixed (reference) image in the image registration process. - Target WxH: Information about the size of the fixed image. If the fixed image is missing, you only need to remember the original size. - Transformation: Selection of the rigid (Affine, Similarity) or non-rigid (Thin-plate spline) image transform for the inverse estimation. - Interpolation: Selection of the pixel value interpolation method to estimate the registered moving image pixels. It can be nearest neighbour (original values), bi-linear, or bi-cubic interpolations, in order of increasing computation speeds and spatial resolution detail.