One of the most exciting uses of machine learning is scientific discovery. Sometimes long before a discovery is officially made people have written about similar ideas and one could speculate what might have happened if some dots had been connected sooner.

This is exactly what the mat2vec model (code, paper) seeks to do with the help of AI. It is a word embedding model trained on materials science literature that is able to capture latent knowledge. For example they show that you can predict material properties like thermoelectricity based on similarity of word embedding of “thermoelectric” to embeddings of different materials.

Another project HyperFoods uses graph neural networks with goal of identifying food-based cancer-beating molecules.

Inspired by these projects I hacked together veg2vec, a word embedding model training on scientific abstracts about health effects of different plant-based foods.

Contents

Veg2vec

Steps for developing the model

  • I retrieved abstracts from the PubMed API by searching using as keywords various plant-based ingredients
  • Then, with some help from a family member, I annotated about 1000 of the abstracts similar to the approach in the paper as relevant if about plants and human health or diet and not relevant otherwise (with a small number where it was doubftul)
  • Neither of us are subject matter experts so although often it seemed quite obvious that an article, a scientist in this field might have labelled some of them differently.
  • Then I trained a simple logistic regression model using TF-IDF features with 5-fold cross validation and a small test set on which it got an f1_score of 0.8
  • I then used the model to filter the abstracts from over 520k to around 128k.
  • Then I modified the gensim word2vec model using in mat2vec and trained it
  • I evaluated it on the grammar analogies they had provided and made some new ones, for instance one where common names of ingredients are paired with scientific names for example common name to species like asparagus - asparagus_officinalis + brassica_oleracea = broccoli

  • The scale of the accuracies is lower to those presented in the mat2vec paper suggesting that the analogies are not as well aligned with the dataset and / or that the dataset could be improved
relationship accuracy
0 common-family 0.252381
1 common-genus 0.019231
2 common-species 0.318182
3 family-genus 0.125000
4 gram2-opposite 0.339827
5 gram3-comparative 0.457895
6 gram4-superlative 0.345455
7 gram5-present-participle 0.220816
8 gram7-past-tense 0.256354
9 gram9-plural-verbs 0.478333
10 total 0.274692

Food embeddings

Ideally the word embeddings for foods should be close if they have similar nutritional properties and health effects but it is also expected that they will group in other ways

Before we look at results, please note:

  • I am not an expert in health, nutrition, plant science and related so there are likely to be mistakes.
  • Results based on co-occurrence of words in the dataset and must not be considered to have scientific validity - do not rely on them in any way!

Presence of cancer beating molecules

  • I found a table which I understand is based on the HyperFoods project which contains the common and scientific names for several plant-based ingredients and for each one a list of the cancer-beating molecules found in themI augmented with food groups from FooDB
  • To match the scientific names with the words in the vocabulary, I used difflib
  • This method matches words using an algorithm that tries to ‘compute a “human-friendly diff between two sequences’ according to the documentation. Sometimes the match is not right if for instance a similar enough result does not exist in the vocabulary. For example actinidia chinensis (kiwi) was not present as a single ngram and this was matched to schizandra chinensis (five-flavour berry)
  • In the results below I filtered matches where the scientific term did not match correctly or where I was unsure if the match was correct
  • In the plot below, the vectors for the scientific names are used and the size of the points is based on the number of CBMs present in an ingredients according to the dataset