Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: A tool for writing English that checks “popularity” of used sentences?
126 points by twa927 on Oct 20, 2016 | hide | past | favorite | 48 comments
As a non-native English speaker I find that the best way to check grammar is to google whole parts of sentences (in apostrophes - exact match). It's because there are multiple exceptions to language rules and some wording just can feel "not right" despite being correct.

Is there a tool that does something like this automatically?

I thought about writing such tool by myself, but it seems there are no good-quality, free search engine APIs that allow many calls. Or, maybe there are some open APIs to book dumps or something similar?




You might like to check out writeful: http://writefullapp.com/


Wow, perfect. I've been looking for something like this for years.


Great app. Are you working on Android version as well?


Not my app, I'm afraid :) I just found out about it through my company.


AFAIK, an ex-Googler had that very same itch and he founded http://www.linguee.com to try to solve it.


I've found Linguee very useful for English -> French translation.

I think it draws heavily from the huge corpus of professionally translated EU regulations and documents.


Agreed, it has been extremely useful to me too for translating various Hungarian technical terms into English. Naming classes and database tables is much easier this way because most often the right term cannot be found in even the most detailed technical dictionaries but Linguee somehow just knows it. And it also shows the context so you can be very confident in your choice.


There are quite a few Ngram datasets available https://www.google.com/search?q=download+n-gram+dataset

... these are almost certainly used in many spelling and grammar checkers. (To help with where the same spelled word is used in different context)

http://www.aclweb.org/anthology/W12-0304


Yes, I remember trying to use Google Books Ngram Dataset [1], but it was too tedious for me to setup and maintain a server with the data for a purpose of a quick-and-dirty tool (that's why I asked for a ready API). Still, using it is probably a nice idea for a more ambitious side project or even a startup.

EDIT. Actually I would happily pay for a tool that implements the idea. Grammarly has paid plans but $30/month is too steep (for my types of usages), and the types of grammar checks it performs is not exactly what I need (which is what real people in real situations use).

[1] http://storage.googleapis.com/books/ngrams/books/datasetsv2....


We (foxtype) actually have a dev tool that does exactly this.

If we publish it as an online tool do you think people will find it useful?

We have multiple corpora, some language models built in neural networks, etc.


LanguageTool has limited support for using Google's n-gram data to find spelling errors. It only uses 3-grams, and only for a list of commonly confused words. I'm not aware of any Free Software that does better.

http://wiki.languagetool.org/finding-errors-using-n-gram-dat...


I wonder if there is a tool like this:

1. You enter a sentence

2. It gives out 5 different ways to say the exact same thing.

Such a tool not only would help ESL people but also it would help native speakers find more relaxed or formal versions of a sentence.


We're building a tool that does something similar but for email. Currently we're targeting cold sales emails- the idea is, you enter a recipient's email and we aggregate data about them and surface relevant, personable sentences that you can use in the email. You'll also be able to change the tone of these sentences (funny, professional, casual, etc.)

Learn more: http://emailfox.co


That's ... disgustingly creepy on first glance.


How do you mean? A tool to help you relate to the folks you're cold-emailing? It seems like a way to find common ground.

Maybe I am misreading you, but perhaps you think the effect is a dishonest one? That because the language EmailFox helps you find isn't the phrasing you improvised at first blush, it's not your copy?

If the UX is strong, think of the wonders this could do for non-fiction writing!


Something like this will only work if your clients don't screw it up by spamming the same targets over and over.

If you can somehow get it through to your clients that they should only ever spam one target once, then hell, I'm for it.


Yup, agreed. I think overtime we'll also try to compensate for this by build a ML model from all the emails being sent to provide more 'fuzziness' to it.


don't want to sound harsh, but the copy of your website does not inspire much confidence for a tool that i am supposed to use for writing.


You're not harsh at all- i'd appreciate feedback, what parts do you think we should improve?

I'll admit we kind of rushed the sign up page as we've been busy building the product.


It's a pretty hard problem to solve in for the general case. It's actually quite rare for sentences to not be unique. Example: https://www.google.de/search?rls=en&q=%22It+gives+out+5+diff...


Yet both "It gives out" and "different ways to say the exact same thing" give many thousands of results. So "5" should be recognized as a template variable or the meaning should be combined from popular fragments.


Perhaps one way to make such a tool would be to try replacing each word with one of its synonyms but that wouldn't change the form of the sentence.

I thought at first that it might be worth using Google Translate (or Bing, etc.) to translate a sentence from English into a language and back again. I was surprised to find that most of the results are grammatically incorrect and at least one, while almost grammatically correct, has a quite different meaning (Eng. -> Latin -> Eng.).

And unfortunately there wasn't much variation to be seen either.

It seems that the verb to be is easily mangled and that some languages seem to require a definite article where the original English does not.

Original English: It's actually quite rare for sentences to not be unique.

Spanish: En realidad es bastante raro para frases para no ser único.

Back to English: It's actually quite rare for phrases not to be unique.

German: Es ist eigentlich ziemlich selten für Sätze nicht einzigartig sein.

Back to English: It's actually quite rare for phrases not to be unique.

Japanese: 文章は一意ではないことは実際には非常にまれです。

Back to English: Sentence is very rare in practice is not unique.

French: Il est en fait assez rare pour les phrases à ne pas être unique.

Back to English: It is actually quite rare for phrases not to be unique.

Polish: To rzeczywiście dość rzadko zdania nie być unikalna.

Back to English: It's actually quite rare sentence not be unique.

Hebrew: זה בעצם די נדיר משפטים לא להיות ייחודיים.

Back to English: It's actually quite rare sentences not be unique.

Italian: In realtà è abbastanza raro per le frasi di non essere unico.

Back to English: It's actually quite rare for the sentences not to be unique.

Latin (but which era?): Suus 'vere non esse unica sententia admodum rarum.

Back to English: It's really not very rare, is a single sentence.

Romanian (which some say is similar to Latin!): Este de fapt destul de rar pentru fraze să nu fie unic.

Back to English: It's actually quite rare for phrases is not unique.


Check out http://foxtype.com - does some of that but more grammar-like heuristics such as conciseness, complexity.

On a side note, I'm part of a team working on http://emailfox.co which will provide 'Smart Sentences' for you when composing an email, based on a recipient. Allowing you to write personal, relevant emails faster.


People on mobile: looks like foxtype already is a product, it's a chrome extension.[0]

On mobile it just asks your email so they tell you when they launch (on mobile?). Horrible to have landing pages like that. Absolutely useless product page on mobile.

[0] https://chrome.google.com/webstore/detail/foxtype/npcfiblhbj...


I like the idea of composing sentences from some high-level pre-checked blocks (the new Google's Assistant seems to do this also). But this doesn't fit my use cases because when I pasted a sentence with a grammar error, it didn't tell me there's an error.


Try http://www.netspeak.org/?locale=en it seems to do some of the things you asked. It is implemented on top of n-gram corpora.


It looks helpful, but I would like to paste a whole document and be told which fragments look suspicious because of low popularity. The site requires putting wildcards and completes only a single n-gram.


I did not get that from your question. This takes parts of a sentence (i think at most 5-grams) and a few operators e.g.:

If you ask for similar words to 'much' in a fragment.

  'and knows ... #much ...'

  =>
  
  and knows a lot, 3.500, 65,2%
  and knows a lot about, 2.100, 39,5%
  and knows a great deal, 690, 12,6%
  and knows much, 630, 11,5%
  and knows lots, 380, 7,1%
  and knows lots of, 300, 5,5%
  and knows a good deal, 100, 1,9%
  and knows practically, 53, 1,0%
  and knows very much, 45, 0,8%


You could probably use some of the Ngrams datasets to figure this out. Parse some books from https://www.gutenberg.org/ or use the google ngrams corpus. Pay attention to the year(s) which you wish to model english from - grammar and form keep changing!


I have been thinking of doing something like this (using Ngrams for grammar check for non-natives) for a while. I would be happy to fund development if you or somebody else are interested in working on it.


From XKCD themselves, an editor that only allows for common words: https://xkcd.com/simplewriter/


www.grammarly.com (haven't tried it though) In the demo they showed it turning a sentence into a more colloquial sentence.

I'm a native English speaker, and I'd like to know appropriate punctuation for a given combination of words. I'd like to search through a list.


Thank you, I especially like macOS editor.


When I'm conflicted about different phrasings of things (for instance, if there is a hyphen or there isn't on when writing compound words), I usually just use a google search and go with whatever result has the most number of hits. That could be a suitable enough proxy for your use-case, and perhaps you could just use the google search service as an API...

Of course, the RIGHT way to do this would be to use the n-gram datasets that people here have suggested :-)


In FAQ: "Why does Google Books only provide feedback on 5 tokens or less?"

You mean "..feedback only for 5 tokens or FEWER?" Use your app! ;) //runs away


Some thing like this: http://corpus.byu.edu/bnc/ ?


To improve the qualitative aspects of writing, in this case for job listings primarily, check out https://textio.com/. There's no API, but I think it will help you think about what "popular" language means.


What you want is a language model. This will give you the probability on a word by word basis.

Something like [1] is pretty much state-of-the-art. It's worth noting that the kind of writing you are doing change the probability significantly. [2] shows this quite well.

[1] https://colinmorris.github.io/lm-sentences/#/billion_words

[2] https://colinmorris.github.io/lm-sentences/#/brown_romance


Bah, if you have good reason to be confident that your sentence is correct even if English speakers might feel it is wrong, then I say you should just write it anyway.

I like to read such things because it makes me think about what is being said and how the language works. If we always use "popoular" patterns then our writing becomes cliched and boring and people's eyes will glide right over it.


You have a point. But as a non-native speaker trying to learn a language, you aim to become so fluent that people will not notice you're foreign. You want to be able to play with the language.

A big part of learning a language is to become familiar with frequent speech patterns and slang. A language is not a sterile set of words with attached grammar but a slippery gelatinous blob that molds itself to the culture and people. Spoken languages are quite lively. If I want to integrate myself and joke around with natives, I need to learn to mold it the same way as natives to. In order to learn how to do that, you first have to start imitating.


Perhaps it also depends on your learning history.

Right now I live in Germany, and speak pretty ungramatically, but from being here I copy a lot of everyday idiom without really understanding it. So what I would like is the opposite of what you are looking for: confidence that my German sentences (especially written ones) are formally correct.

I don't mind if that makes me look like a well taught foreigner. Right now I sound like a badly taught foreigner.


If you can read Chinese, there's interesting tool:

http://www.pigai.org/guest2016.html

It extracted common phrase from the sentences with explanations & suggestions & count usages from corpus.


Never found it, but if you build it count me in as a user. Same issue, same solution.


Thanks for the mention above (foxtype.com).

We're currently building an online editor checks


Oops, accidently sent.

We're currently building an online editor that:

1 checks for compatibility of words in a sentence (essentially popularity) 2 give example sentences for a certain word 3 word suggestions depending on context.

Language models would be a decent way to check popularity though it would be noisy. Sentence level rewrites would be hard unless you make it template driven.


\incidental Use quotes (") for exact match, not apostrophes (').


https://github.com/rickyhan/bodine

This is a tiny tool I wrote a long time ago. There's also writefullapp.com which is closed source.


I can suggest http://samedaypapers.com/. It always helps me)))




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: