Hacker News new | past | comments | ask | show | jobs | submit login

Thanks for shedding a little more light. Ignoring the semantics, in this case, between an apostrophe and a right single quote has resulted in many documents containing information that can not be parsed unambiguously because we have to pick a glyph and doing so with an ambiguously defined glyph loses contextual information.

As a side-effect, since GPTs are based on the examples we give, they can't encode the proper punctuation for many phrases that use British English quotation mark styles, making them unable to "curl" the quotation mark properly. For example, none can curl this paragraph correctly:

    ''E's got a 'ittle box 'n a big 'un,' she said, 'wit' th' 'ittle 'un 'bout 2'×6". An' no, y'ain't cryin' on th' "soap box" to me no mo, y'hear. 'Cause it 'tweren't ever a spec o' fun!' I says to my frien'.
The other downside to using ' or ' is that most fonts treat them as straight quotes, making for "improper" English typography when typeset into a book.



> many documents containing information that can not be parsed unambiguously

Well, and I'd suggest the unambiguous information was usually never there in the first place. It's less of an encoding problem, and more of an input "problem". People type either a single quote/apostrophe, or a double quote, and let smart quotes sort it out.

And sure, smart quotes will fail spectacularly with your spectacularly pathological example! Heck, it took me a few seconds to figure out what on earth was going on with the first 5 characters. :)

Your example would usually be typeset properly in a physical published book because it's done with professionals manually reviewing the typography.

Just throw it in the bucket of hyphens vs. minuses vs. dashes em and en, x's versus multiplication signs... our symbols are full of ambiguities, it's not just apostrophes.


> let smart quotes sort it out.

Smart quotes fail in simple cases, too.

https://gitlab.com/DaveJarvis/KeenQuotes/-/tree/main/src/tes...

I've developed a lexer/parser that can disambiguate most cases, but wow was it a chore to write.

> Well, and I'd suggest the unambiguous information was usually never there in the first place.

Interesting. Isn't the text ambiguous because the glyphs lack the semantics to capture the usage of apostrophes versus closing single quotes? It's a Catch-22, isn't it? If UNICODE had semantics for apostrophes versus right single quotes, then our documents would be unambiguous. But we can't make them unambiguous because UNICODE doesn't capture these semantics.


> If UNICODE had semantics for apostrophes versus right single quotes, then our documents would be unambiguous.

No -- as I said before, it's an input problem before anything else. There aren't separate keys for apostrophe and right single quote on the keyboard. We don't even have separate keys for left and right quotes. So even if there were encodings for them, they wouldn't be used correctly. They'd be used correctly about as often as people type a proper minus sign rather than a hyphen for subtraction, which is almost never.


> it's an input problem before anything else

I see where we have our wires crossed. I'm not considering the input problem because my software (KeenQuotes) parses the source document's apostrophes into their correct semantics (99.9% of the time).

My issue is that, having discerned the correct English single quotation mark (straight, apostrophe, or closing), I have no way of encoding it into a document that retains the semantics while typesetting it using common fonts (to match double quotes). My point is that if UNICODE had a way of capturing the semantics, it would at least be technically possible to create unambiguous documents, input notwithstanding.

Matthew Butterick, a typographer, states, "I’ve never seen any LaTeX-created documentation that’s gotten this right":

https://practicaltypography.com/straight-and-curly-quotes.ht...

I sent him a screenshot showing my software typesetting the quotation marks properly, albeit with a document that has incorrect semantics (as per our discussion):

https://i.ibb.co/p3TM7QM/curly-quotes.png


I totally appreciate the desire for more semantic encoding. I mean, it would be a dream if every sentence was semantically delimited, if every word was annotated with which hyphenation pattern it should follow for splitting across lines (when there are multiple), and whether the capital letter at the start of a sentence should remain capital even when converted to lowercase, because it's a capitalized proper noun. I could go on.

But that's not what Unicode is for. The apostrophe situation is just one of 100 things I could think of off the top of my head. Unicode encodes characters, not semantics. And this is by design, because people don't input, or want to input, semantics -- they just want to type something that looks right. Something other people can read, not something computers can semantically parse.

So we have a bunch of heuristic and AI and manual tools we use to try to annotate things semantically, and we put that information at the level of something like XML, not Unicode. Which is infinitely more flexible, because you can define and use whatever semantics you want, not limited to whatever the Unicode body decided.

If KeenQuotes gets apostrophes right 99.9% of the time, then just use that to automatically analyze all your input text and then store and process it in some kind of XML notation, like "Peter<apos>’</apos>s" or "<possessive>Peter’s</possessive>" or "<word>Peter’s</word>" or something. Unicode is the wrong level of abstraction.


> process it in some kind of XML notation

The output from KeenQuotes is used by KeenWrite. KeenWrite can generate text, HTML, XHTML, and PDF documents. Those output document formats lack correct the semantics because of UNICODE. As much as rolling my own XML notation would be fun, it won't work in practice---nobody would be able to publish their exported documents for viewing or general consumption. We'll have to agree to disagree on this one: I think UNICODE dropped the ball on English apostrophes where it didn't have to. Having one more character for curled apostrophes would have kept open the possibility of encoding unambiguous HTML documents (with respect to apostrophes/right single quotes for quotations; your point about other characters I quite appreciate).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: