Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval
Link
https://dl.acm.org/citation.cfm?id=3240627
Task
Given the recipe (title+instructions+ingredients), they try to retrieve the image corresponding to this recipe in the dataset and vice versa.
Method
During training, they have both recipe and image, for each component in the recipe, they first encode it with Bi-GRU, then they learn a weight matrix W_w
and bias b_w
to map each word(sentence) feature to a vector, which is the attention vector u_t
.
Attention means the weights of each word. Taking the title for example, each word in the title has its own attention, so each word should have a scalar to represent the attention.
In order to get the attention of each word(sentence), they need to change u_t
to a scalar. There are a few ways to do this:
- Let
W_w
directly outputs a scalar instead of a vector, in other words,u_t
is a scalar to be used as the attention. - Learn a shared vector
u_w
(so-called context vector) which has the same length asu_t
, then calculate the attention by dot production, which is what they do in the paper.
For method 2, context vector u_w
is learned, so they are really hoping the vector could generally become a good representation of all the words in all the titles. Actually, I could not understand this part well.
The rest are easy, a classic ResNet-50 to extract image feature, and using cosine similarity as the loss
Experiment
In Table 3, they got SotA on the Recipe1M dataset, however, they did not provide stds in their paper which makes the result not quite trustworthy. Taking the MedR in im2recipe as an example, if the std is more than 0.5 for attention+SR in the 1K experiment, then the direct compare between 5.2 (JNE+SR.) and 4.6(attention+SR.) does not make too much sense.
Limitation
- In Sec.4.6, they mentioned using raw recipe instead of the processed recipe. My confusion is whether they also change the ingredients at the same time.
- In their paper, the image feature is fixed during training, if we could fine-tune the image part or build image attention, it may further improve the result.