mdpacer's comments

mdpacer · on March 7, 2016

> What kinds of NLP technique does this system use?

It depends on your interpretation of NLP. In a sense, all of the rules are hard coded, and so it does string token processing that happens to be informed by contributed interpretations of style guides' rules for usage. Thus, most of the NLP has been performed by the human programmers interpreting those rules.

Though we are interested in extensions in the direction of robust machine NLP approaches able to meet the other goals of proselint, that presents many challenges (including some I mention in response to your third question). Nonetheless, this is an active area of research.

> Is it possible to specify new rules in a high-level way?

In short, no, but it is an area of active research on our part to develop a rule-templating engine for exactly this purpose. "High-level" is subjective though, so there may always be someone who intends to ask about a level higher than the interface that we provide at the time that this question is asked.

> Can it learn from examples?

In a sense, yes, all of the rules have been learned by people from the example text in guides and translated to linting rules. But I do not think that was your intended question.

If instead you mean: you would provide it a set of examples of your writing and it would induce a rule, no it does not do that currently, and may not for quite some time.

Stylistic rule induction is a difficult – though interesting – problem (as is rule induction more generally). It is not something we are intrinsically opposed to, but the simplest version of learning from examples would violate two core principles of the design of proselint.

First, our rules are taken from and organised around the advice provided by respected authors in their writing on linguistic style.

Second, any inductive method will be intrinsically uncertain about the rules that it induces. This uncertainty will always be opposed to our aim of having a low false alarm rate, making inductive methods possible but subject to extensive tuning and testing. This suggests that further development of a test set outside of the examples provided would be needed, to ensure coverage of any of the rules that the examples would suggest inducing.

Additionally, almost all state-of-the-art machine learning systems would require a set of relevant labeled examples of usage errors and non-errors that would somehow generalise to the examples that you would like to provide it. Even specifying the data format would be difficult; if you have any insights as to how this would be done, please develop them below, it can only be helpful and aid progress in this direction.

> Does it work on a sentence-by-sentence basis only, or does it "grasp" complete paragraphs?

I think the easiest way for you to answer this question is for you to see it in action at this website: http://proselint.com/write/

I should mention that longer range dependencies require greater computational power which brushes up against another aim of proselint, to be fast enough to run on reasonably large files as a real-time linter. This may not always be the case in all instantiations of proselint, but for now this is true.

If you have paragraph level rules that you might want to suggest (like the issue I just created when writing this response: https://github.com/amperser/proselint/issues/310), please do! It is even more helpful if you can find an authoritative reference to include as part of your issue, because that will be needed to incorporate the rule into proselint.

mdpacer · on March 7, 2016

This is a fair concern of style recommenders in general. Yes, we want to shape text. And what follows is merely a partial response, but it should address some of your concerns.

First, much of the advice is that certain word sequences are problematic without suggesting any particular replacement text. There are a few reasons for this (including the computational natures of error-detection vs. solution-recommendation problems). The reason most relevant to your concern is that solution-recommendations are more likely to produce a homogenizing effect because they have a driving effect, wherein using a particular set of words is deemed superior to another set of words. Much in the way that the diversity of life-forms has arisen because of selective pressures, by eliminating the least fit combinations of words, the native variation in writing can flourish all the more readily.

The goal is not to homogenize text for the sake of uniformity, but rather to identify those cases that have been identified by respected authors and usage guides as being specifically problematic. Any text that is sufficiently artful and compelling to have not been specifically addressed by these sources should not be able to be caught by the linter. Novelty will continue to introduce new usages, and some of them will be poor. Authors identified as trustworthy may point these out, but this will only be in retrospect. If you do not trust a guide's point of view, our strongest recommendation would be to turn off the modules associated with that guide. You can see some of the module names and a high-level description here: http://proselint.com/checks/.

Finally, I will modify a quote in the Foreword[^fn2] by Robert Bringhurst in The Elements of Typographic Style (version 3.2, 2004) > [Language usage] thrives as a shared concern — and there are no paths at all where there are no shared desires and directions. A [language user] determined to forge new routes must move, like other solitary travelers, through uninhabited country and against the grain of the land, crossing common thoroughfares in the silence before dawn. The subject [of proselint] is not [stylistic] solitude, but the old, well-traveled roads at the core of the tradition: paths that each of us is free to follow or not, and to enter and leave when we choose — if only we know the paths are there and have a sense of where the lead. That freedom is denied us if the tradition is concealed or left for dead. Originality is everywhere, but much originality is blocked if the way back to earlier discoveries is cut or overgrown.

[^fn2]: Only because we are on the topic of historical traditions and stylistic guides, it should be mentioned that a foreword – according to book design tradition – would be written by an individual other than the author about the author, the book, and usually the relation between them. In this case, the section in Bringhurst's masterpiece labeled "Foreword" would likely be better described as "Preface" or "Introduction". Given his knowledge of book design, I shall assume that this was a conscious departure from the road of tradition, even if I cannot appreciate the new view that it offers.

mdpacer · on March 7, 2016

Part of the goals of proselint is to minimize the number of false positives that traditionally clutter the results of style checkers, resulting in users ignoring the changes when they see them. We want to be reasonably certain before raising an alarm. You can read more about the precise metric[^fn1] we use here: http://proselint.com/lintscore/.

And yes, `python3` for the win. :)

[^fn1]: If you wanted to be truly precise, it's a parametric family of metrics.