So ... how does one evaluate the quality of a framework + benchmark without results like this? I feel like this would be a lot more compelling if they had at least some reasonable first models with result metrics and analysis showing that the tasks are both tractable and are good at revealing differences in performance and behavior (i.e. the problems should be neither too easy or to hard). For example, what if Amazon purchase/churn behavior is just largely independent from review behavior, and review behavior is mostly dictated by "did some company use this account to generate fake 5-star reviews?"