Hacker News new | past | comments | ask | show | jobs | submit login

I agree with the suggestion to de-bias the application and not the representation itself.

Recently I was using a version of Conceptnet Numberbatch (word embeddings built from ConceptNet, word2vec, and GloVe data that perform very well on evaluations) as an input to sentiment analysis. So its input happens to include a crawl of the Web (via GloVe) and things that came to mind as people played word games (via ConceptNet). All of this went into a straightforward support vector regression with AFINN as training data.

You can probably see where this is going. The resulting sentiment classification of words such as "Mexican", "Chinese", and "black" would make Donald Trump blush.

I think the current version is less extreme about it, but there is still an effect to be corrected: it ends up with slightly negative opinions about most words that describe groups of people, especially the more dissimilar they are from the American majority.

So my correction is to add words about groups of people to the training data for the sentiment analyzer, with a lot of weight, saying that their output has to be 0.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: