Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval

Fangda Han
3 min readJan 16, 2019

--

Link

https://dl.acm.org/citation.cfm?id=3240627

Task

Given the recipe (title+instructions+ingredients), they try to retrieve the image corresponding to this recipe in the dataset and vice versa.

Method

During training, they have both recipe and image, for each component in the recipe, they first encode it with Bi-GRU, then they learn a weight matrix W_w and bias b_w to map each word(sentence) feature to a vector, which is the attention vector u_t.

Attention means the weights of each word. Taking the title for example, each word in the title has its own attention, so each word should have a scalar to represent the attention.

In order to get the attention of each word(sentence), they need to change u_t to a scalar. There are a few ways to do this:

  1. Let W_w directly outputs a scalar instead of a vector, in other words, u_t is a scalar to be used as the attention.
  2. Learn a shared vector u_w (so-called context vector) which has the same length as u_t , then calculate the attention by dot production, which is what they do in the paper.

For method 2, context vector u_w is learned, so they are really hoping the vector could generally become a good representation of all the words in all the titles. Actually, I could not understand this part well.

The rest are easy, a classic ResNet-50 to extract image feature, and using cosine similarity as the loss

Experiment

In Table 3, they got SotA on the Recipe1M dataset, however, they did not provide stds in their paper which makes the result not quite trustworthy. Taking the MedR in im2recipe as an example, if the std is more than 0.5 for attention+SR in the 1K experiment, then the direct compare between 5.2 (JNE+SR.) and 4.6(attention+SR.) does not make too much sense.

Limitation

  1. In Sec.4.6, they mentioned using raw recipe instead of the processed recipe. My confusion is whether they also change the ingredients at the same time.
  2. In their paper, the image feature is fixed during training, if we could fine-tune the image part or build image attention, it may further improve the result.

--

--

No responses yet