grama.tran package

Submodules

grama.tran.tran_matminer module

grama.tran.tran_matminer.tran_feat_composition

Featurize a dataset using matminer

Featurize chemical composition using matminer package.

Parameters:
  • df (DataFrame) – Data to featurize
  • var_formula (string) – Column in df with chemical formula; formula given as string
  • append (bool) – Append results to original columns?
  • preset_name (string) – Matminer featurization preset
Kwargs:
ignore_errors (bool): Do not throw an error while parsing formulae; set to
True to return NaN’s for invalid formulae.

Notes

  • A pre-processor and wrapper for matminer.featurizers.composition

References

Ward, L., Dunn, A., Faghaninia, A., Zimmermann, N. E. R., Bajaj, S., Wang, Q., Montoya, J. H., Chen, J., Bystrom, K., Dylla, M., Chard, K., Asta, M., Persson, K., Snyder, G. J., Foster, I., Jain, A., Matminer: An open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60-69 (2018).

Examples:

import grama as gr
from grama.tran import tf_feat_composition
(
    gr.df_make(FORMULA=["C6H12O6"])
    >> gr.tf_feat_composition()
)

grama.tran.tran_scikitlearn module

grama.tran.tran_scikitlearn.tran_tsne

t-SNE dimension reduction of a dataset

Apply the t-SNE algorithm to reduce the dimensionality of a dataset.

Parameters:
  • df (DataFrame) – Hybrid point results from gr.eval_hybrid()
  • var (list or None) – Variables in df on which to perform dimension reduction. Use None to compute with all variables.
  • out (string) – Name of reduced-dimensionality output; indexed from 0 .. n_dim-1
  • keep (bool) – Keep unused columns (outside var) in new DataFrame?
  • append (bool) – Append results to original columns?
  • n_dim (int) – Target dimensionality
Kwargs:
n_iter (int): Maximum number of iterations for optimization. As Wattenberg et al. note, this is the most important parameter in using t-SNE. If you see strange “pinched” shapes, increase n_iter. perplexity (int): Usually between 5 and 50. Low perplexity means local variations dominate; High perplexity tends to merge clusters. early_exaggeration (float): learning_rate (float):

Notes

  • A wrapper for sklearn.manifold.TSNE

References

Scikit-learn: Machine Learning in Python, Pedregosa et al. JMLR 12, pp. 2825-2830, 2011.

Wattenberg, Viegas, and Johnson, “How to use t-SNE effectively” (2016) Distil.pub

Examples:

grama.tran.tran_umap module

grama.tran.tran_umap.tran_umap

UMAP dimension reduction of a dataset

Apply the UMAP algorithm to reduce the dimensionality of a dataset.

Parameters:
  • df (DataFrame) – Data to summarize
  • var (list or None) – Variables in df on which to perform dimension reduction. Use None to compute with all variables.
  • out (string) – Name of reduced-dimensionality output; indexed from 0 .. n_dim-1
  • keep (bool) – Keep unused columns (outside var) in new DataFrame?
  • append (bool) – Append results to original columns?
  • n_dim (int) – Target dimensionality
Kwargs:
n_neighbors (int): A smaller value emphasizes local structure, larger value emphasizes global structure. Assumed number of nearest-neighbors in clusters. Coenen and Pearce claim this is the most important hyperparameter for UMAP. default=15 min_dist (float): Minimum distance between mapped points. default=0.1 metric (str or function): Metric used for distance computations. See url: https://umap-learn.readthedocs.io/en/latest/parameters.html#metric

Notes

A wrapper for umap.UMAP

References

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018 Andy Coenen, Adam Pearce “Understanding UMAP” url: https://pair-code.github.io/understanding-umap/

Examples

import grama as gr from grama.data import df_diamonds (

df_diamonds >> gr.tf_sample(1000) # For speed >> gr.tf_umap(var=[“x”, “y”, “z”, “carat”]) >> gr.ggplot(gr.aes(“xi0”, “xi1”)) + gr.geom_point()

)

Module contents