If you’re anything like me, you’re not satisfied with just one recipe. I want to make carrot cake, but there are a zillion carrot cake recipes online. Which one do I choose? Sure, I can go with a resource I trust, say, Bruno Albouze. But in this endless onslaught of new content and the push to “innovate”, how do I know whether I’m getting a tried-and-true, run-of-the mill carrot cake or a spongy hydrocolloid pudding of the same name? The point is, I’m not sure whether I want a recipe for carrot cake, or to drill down to the core of what basic ratios comprise a carrot cake. You follow?

The thing is, with carrot cake, and, I’d venture to say, most generic baked goods, the mass of recipes on the internet will tend to be more similar than they are different. Sure, recipe A might have 1 ½ cups of flour to 2 cups of sugar, and recipe B 1 ¾ cups of flour, but you can rest assured that neither will have 6 cups of flour to those 2 cups of sugar. If you could look at a large pool of these recipes, the trends would start to become apparent to you.

Another way of putting it is that recipes are data: data tied up in weird formats and data buried under long personal anecdotes, but data no less. I mean this in the sense that, when aggregated, seemingly meaningless instances suddenly speak of a phenomenon. Although I have no first-hand experience to confirm this, I have no doubt recipe developers have been employing this tactic, i.e. Excel sheet-ing every recipe for a dish they’re developing, probing the limits of ingredient ratios, cook times, and so on, based on pre-published versions of the recipe they’re trying to develop.

I’m spitballing about a world where looking up a recipe isn’t about finding the most exciting-looking version published online. Instead, I’d like to be able to look at a statistical model that describes the most common ingredients and the extremes of the ratios. What’s the most fat as a percentage of the weight of the flour that is used in carrot cake? As the fat increases, does that affect the amount of eggs? How about sugar?

This article is going to tackle pretty much none of that. However, just scratching the surface of this approach, we would need to parse recipes into quantities, units, and ingredients, so that they could be machine processed en-masse. To do that, we’ll need to perform Named Entity Recognition (NER), and that’s what this article is: building a ready-to-use, trainable NER model for recipes.


To start with, we’re going to need some data. If you’ve ever wanted to machine process recipes en-masse, you’ve probably noticed that there’s little in the way of annotated data related to parsing recipes. Enter TASTEset.

It consists of 700 recipe ingredient lists, annotated with named entities from 9 classes (including utilities to convert to BIO and BILOU formats), with the steps to train a BERT-powered NER model. The work they’ve done is great. As far as I know, this is the first dataset of its kind, and it’s freely available. There are functions available to process the data into nearly any common format, and a model training pipeline has already been laid out for you. What more could you ask for?

Well, I still think there are some gaps. Don’t get me wrong; the TASTEset dataset and repo are a blessing. I just think we can better appropriate this resource for our own purposes. Specifically, I want to address 2 issues:

  1. BERT is (maybe..?) overkill. It’s a bit of a flamethrower-to-light-a-candle approach. As compared with plain language, recipes follow a relatively standardized, albeit not completely rigid format. That means a comparatively simpler, and therefore lighter model should be able to tackle the vast majority of cases. On the academic front, lots of energy goes into benchmarking, while folks like myself are primarily interested in “good enough”.

  2. The TASTEset code isn’t exactly plug-and-play. There isn’t a way to get that system up-and-running for inference without a couple hours of leafing through the repo’s code to pry out the important bits. That’s to say nothing of the time you’d spend writing code to optimize the model for use in a production environment.

Again, these aren’t flaws, they just speak to the authors’ purposes, which are different from ours. So how are we going to appropriate the model?

  1. Training in a spaCy environment

  2. Bundling into a pipeline that can be used freely for inference.

I’ve chosen not to include the code, line-by-line, in this article. I find it sort of redundant, and it tends to look a little clunky. There’ll be a Colab notebook with explanations of all the steps. You can think of this article as the companion to that notebook. So without further ado, let’s get into it.


This section will defer to the companion notebook. There’s little to be said about this procedure, other than that it’s more-or-less entirely taken from the training procedure from the SpaCy documentation. You can think of this tutorial as a handpicking of the parts that are relevant to NER. Where it deviates from the documentation is in data pre-processing. Fortunately, the TASTEset authors have done a lot of the heavy lifting on that front. Our job is less about hand-processing data than it is a matter of calling the right utility functions. Again, wherever you need extra clarification, you can check the Colab notebook for line-by-line code.


How’d our model perform? Better yet, how do we go about looking at that? SpaCy gives us a JSON file with a number of performance metrics, again, sparing us having to do any real calculations. Here’s a table with the precisionrecalland f1-score for each entity. I’ve tacked on the total scores at the end as well:

And look at that, an f1 of 92%! The way I see it, the entities that perform the best are the ones we’re most interested in. The model is having a seemingly hard time recovering PART and PURPOSE entities. PART would be attributes like “thighs” in “boneless, skinless chicken thighs”. PURPOSE is for disclaimers, like “to serve” in “mint leavesto serve”.

These entities are being mislabeled, which isn’t necessarily that bad, depending on what they’re being mislabeled as. I’m fine with PART being mashed in with FOOD. In fact, I think I’d rather “egg yolk” be labeled as FOOD, rather than FOOD - PART. As for PURPOSE, the model could drop that altogether, for all I care.  A confusion matrix should give us some more clarity:

As for PURPOSE, the fact that it doesn’t show occurrences of mislabeling tells us that PARTs aren’t being recovered as entities at all. That’s what we’d hoped for. As for PART, it looks like it’s being confused for FOOD in all but one instance. Again, that’s what we wanted. Frankly, in a future iteration I might do away with the odd entities and expand the FOOD category. 

In fact, if you look at PROCESS and PHYSICAL_QUALITY, it looks like a lot of the more niche entities are tending towards being considered part of the FOOD labeling. I suppose that’s what you get with a rough-and-tumble model trained on such a modest dataset.

Inference (or, actually using the model)

So, we’ve got our model. Let’s go test it out in the wild. The first thing we’re going to need is a recipe. Grabbing an ingredient list from an online recipe tends to be pretty straightforward. Barring any full-fledged discussions on recipe scraping, what you need to know is that most recipe webpages store metadata (including the ingredients list) conveniently in a <script> tag with type application/ld. If you want to read up on it, here’s a primer, including a discussion on approaching recipes that aren’t available in application/ld tags. Bear in mind that you’ll need to know a little bit about HTML and DOM traversal to make it work, but that’s well worth your while. Fortunately for us, this work has been bundled into a super-easy-to-use open-source Python package (that the TASTEset authors just so happen to have used to build the dataset!)

The recipe I’ll be using here is Chef John's Pecan Sour Cream Coffee Cake from Allrecipes. It’s one of those coffee cakes with the gooey, nutty streusel filling that makes you go “why bother with the cake?”. Oh, and if you don’t know who Chef John is, what are you doing reading this article? Go lose yourself down the rabbit hole of hypnotic repertoire of soothing and witty cooking tutorials on his Food Wishes YouTube channel.
So, a couple lines of code later and we’ve got ourselves the ingredients as a Python list:

['1.5 cups pecans, finely chopped',
 '0.33333334326744 cup white sugar',
 '0.33333334326744 cup packed light brown sugar',
 '3 tablespoons melted butter',
 '1 teaspoon cinnamon',
 '0.125 teaspoon salt',
 '1.875 cups all-purpose flour',
 '1 teaspoon baking powder',
 '0.75 teaspoon baking soda',
 '0.5 teaspoon fine sea salt',
 '1 cup white sugar',
 '0.5 cup unsalted butter, softened',
 '2 large eggs',
 '1 cup sour cream or creme fraiche',
 '1.5 teaspoons vanilla extract']

Do you see the problem? Not yet? Alright, let’s try parsing it into entities. And for the purposes of our display below, entities are highlighted according to their kind.

Note: The scraper we used doesn’t separate components of the cake (e.g. the cake part, and the crumble part); that’s why you’ll see some ingredients repeated (eg. sugar in the Food Wishes recipe.)

Notice that the model can’t consistently recognize decimals as quantities. 

It seems like the fractions are being converted to floating points in the application/ld tag, including that ugly thing computers tend to do with floats. The training data doesn’t contain anything like this, and our model can’t seem to get past it. It might be tempting to just avoid these recipes when using our model, but I can say, at least anecdotally, that Allrecipes seems to have a near monopoly on user-submitted recipes in English, and I doubt we’ll get much use out of a recipe parser that can’t deal with its formatting.

The simple way around this issue is to add a text preprocessing function that runs before we pass the ingredients to our spaCy pipeline:

from fractions import Fraction
import re

def fraction_to_mixed_number(fraction: Fraction) -> str:
 if fraction.numerator >= fraction.denominator:
   whole, remainder = divmod(fraction.numerator, fraction.denominator)
   if remainder == 0:
     return str(whole)
     return f"{whole} {Fraction(remainder, fraction.denominator)}"
   return str(fraction)

def convert_floats_to_fractions(text: str) -> str:
   return re.sub(r'\b-?\d+\.\d+\b', lambda match: fraction_to_mixed_number(Fraction(float(match.group())).limit_denominator()), text)

The nearly-unreadable convert_floats_to_fractions uses a regular expression to grab floats, convert them to fractions, and then factor out whole numbers from said fraction, to return what I’ve just learned is called a mixed number. As an added bonus, it was written by everyone’s new favorite dystopian NLP robot. In terms of how to implement this, we have a couple of options. Each one comes with its advantages and disadvantages, which I’ll discuss here.

Add a pipe in the spaCy pipeline before the ner pipe.

This one is advantageous for the fact that the normalization function is baked into the pipeline, meaning it’ll work out of the box. I’m not crazy about this option, though. Any function that cleans up the text will have to loop through the tokenized document and replace characters. On top of that, I’ve poked around online, and middle-of-the-pipeline normalization just doesn’t seem to be a standard practice.

Overwrite the pipeline’s tokenizer.__call__() method.

From an engineering perspective, this one is ideal. The tokenizer is optimized for efficiency, and if we could work the text normalization logic into the function, then we could replace the decimals in the same loop as the initial tokenization. Unfortunately, efficiency comes at the cost of inaccessibility; I would hardly know where to begin with this. It would mean tracking down the Tokenizer class definition in the spaCy source code, writing some C or C++ to do fit our purposes, and possibly rebuilding the package with our changes. If this is in your wheelhouse, or you’re up for a challenge, then go for it! That said, for an article that champions simplicity of implementation, this approach is out of the question.

Create an external text normalization function that runs before the text enters the  pipeline

On the surface, this seems like the worst option, but it’s the one I’m going with. Let me explain: I’m not overriding the tokenizer.__call__(), meaning we’ve already settled on compromised efficiency. For reasons of transparency and convention, I think it’s best to give the user full control over whether and how this normalization function is implemented, even if that comes at the cost of convenience.

Here’s what the recipe looks like after pre-processing:

['1 1/2 cups pecans, finely chopped',
 '1/3 cup white sugar',
 '1/3 cup packed light brown sugar',
 '3 tablespoons melted butter',
 '1 teaspoon cinnamon',
 '1/8 teaspoon salt',
 '1 7/8 cups all-purpose flour',
 '1 teaspoon baking powder',
 '3/4 teaspoon baking soda',
 '1/2 teaspoon fine sea salt',
 '1 cup white sugar',
 '1/2 cup unsalted butter, softened',
 '2 large eggs',
 '1 cup sour cream or creme fraiche',
 '1 1/2 teaspoons vanilla extract']

Much better! This is a bit of a compromise. The preprocessing function turns all float-like numbers into mixed fractions (“mixed” meaning that the whole numbers are factored out). I think that’s the sensible choice for volumetric measurements, but it might be less common for imperial weight measurements. Units like pounds and ounces tend to be expressed as decimals rather than fractions. I ran the model on the imperial weighted measurements converted to fractions to see if it would work nonetheless, and it went swimmingly. That is, if you can get past some ugly conversions (16.3 oz. converts to 16 3/10 oz…. Yuck.) As far as compromises go, I’d say we did pretty well.

Okay, one more recipe for good measure? Let’s go with another coffee cake. This time the wholesome, up-beat John Kannel’s, from his YouTube channel/blog Preppy Kitchen. You’ll note that this domain isn’t officially supported by our scraper. As long as we enable wild_mode, we get the ingredients without any problems. Here are the entities that the model coughed up. Again, as above, entities are highlighted according to their kind.

I’d say that’s pretty good! It’s not perfect. There are inaccuracies in the ingredients, insofar as you see a significant difference between light brown– and brown sugar. It also misses two measurements. Fortunately, those are alternates, and the primary measurement is being registered. The only glaring omission is the pecans, which seem to have gone missing from the data.

Considering how little time it took to set up and train, and the incredibly modest footprint of our model, I’d say this is pretty good!


I hope this has expanded your idea of what constitutes data. What’s more, I hope this has helped bridge the gap between theoretical NLP and its practical applications. Using a bunch of out-of-the-box code and functions, we were able to train a quick, lightweight, and personalized model in, like, a half hour of total work. That’s not so bad at all.

You may have noticed that I’m running the model separately on individual lines of recipes, whereas it was trained on whole recipes joined by spaces. I tried it both ways, and I prefer the results of going line-by-line. Best practice would probably dictate that I train the model on individual lines if that’s how I’m planning on using it. But I didn’t, because of laziness. It worked well enough here, and I don’t see a reason to bother.

That’s the story here; I cut corners and hacked together a quick program, and we got respectable results. Folks tend to shy away from data science projects, intimidated by the endless technicalities, conventions, and stipulations. Let this be a testament to the value of putting the cart before the horse: just try it out; you’ve got nothing to lose.

Related Articles

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo
Essential Building Blocks for Language AI