I work in ecommerce and this guy and others like him have been creating this stuff for 3-4 years. It's a massive annoyance for us as the only real identifier we got from suppliers is the publisher name, we blocked them for awhile but then we started seeing a massive range of publisher names coming through in order to get around blocks (I assume every other retailer was blocking them as well).
When we first started seeing the automated content books, literally overnight our product set increased several million titles - I now estimate that about 20 million of the 50 million products we have are automated books (they come out with new "editions" all the time as well). This obviously has a massive impact on our search results - these books have keyword laden titles and descriptions, and without a solid identifier it was very difficult to get rid of them. Thankfully recently the suppliers that print these products have started flagging them as crapware.
As for the customer response to this type of product - it's definitely negative, with a massive return rate. As far as I am concerned, this is a massive scam - hedging on the fact that some people are to lazy to return the books.
I wonder if it wouldn't be a good idea to "hellban" them. When they visit the site, they see the books - but for all other users they'd effectively be invisible. It might not be worth the programming effort, though. I wonder how many books this guy sells.
Just what Amazon needs: 800,000 computer-generated books crapping up the listings along with the thousands of on-demand printed copies of Wikipedia articles. Wonderful.
I can't believe this guy has no sense of humor[1]! He's completely serious about providing a service and making tons of money from it.
In one question the interviewer asks: 'could you make a novel for instance?' to which he replies 'well, novels don't usually make money'. Wouldn't you think the correct answer would be 'well, it wouldn't be very good'?
> "to which he replies 'well, novels don't usually make money'. Wouldn't you think the correct answer would be 'well, it wouldn't be very good'?"
Perhaps the same question can be posed to many startup founders in our community. Very often people in the startup scene seem much more interested in extracting profit (or really, just making the exit) rather than creating something good (in whatever way that you might personally define goodness).
Plenty of developers write software never hoping to make a dime from it or hire anyone else to work on it. But they have little reason to worry about incorporation or branding or 'pivoting' or 'series a funding' or 'growth hacking', and they self-identify as 'open source developers' rather than 'start up entrepreneurs'.
Agreed. It's a shame to think about all the interesting things that might be getting built instead. Programming a novel generator is a great example. What a fantastic challenge being overlooked. It reminds me of this great video of Vonnegut graphing the emotional shape of stories. http://www.youtube.com/watch?v=oP3c1h8v2ZQ&feature=youtu...
It's not that he has no sense of humor, it's that the fact that the books aren't good is a foregone conclusion. The idea that one of his books could be good is just not even on his radar.
But it seems the opposite from the interview, no? He seems to believe that he is creating legitimate value. There is no mention of the error in the process.
Your second link: I have no idea what it says. I read the citation and the first four paragraphs. It sounds like it came out of a postmodernism generator.
I'm sorry to hear that you're not familiar enough with cultural theory to understand the post. Just as somebody unfamiliar with mathematics would be hard-pressed to understand (and understand the point of) the jumble of symbols, it requires some immersion in the field to extract any value from cultural theory. If you can make the time in your life, I recommend it.
Or perhaps we need to find a filter to replace the traditional publisher's editor/market filters. Completely frictionless publishing of anything will just tend to increase the noise and make the signal harder to find.
First and foremost -- 1) Like Google it has a "see no evil" business model. ("If humans aren't in the approval loop, we're not responsible.")
2) Its behavior as a company is generally reprehensible (e.g. strong-arming state governments into not collecting sales taxes -- it wants to have its cake and eat it w.r.t. affiliate marketing)
3) Its business models reek of dumping (using profits in one domain to unfairly compete in others). The main difference is Amazon doesn't actually make profits, so it's burning investor capital to win new markets so it can burn more investor capital to win even more new markets, etc. (It will make it up on volume...) The fact that investors buy into this is odd, to say the least, but it's still pretty odious.
4) as a small publisher, I'm still ticked off by the way it allegedly "competes" in the ebook market. (Kindles are the hardest ebook readers on which to read ePubs, KDP does not in fact pay competitive royalties, but Amazon has bamboozled the tech press into believing it does, and even seems to have several governments helping it maintain its near monopoly on epublishing.)
These aren't actually 'books'. They're mostly business reports, which I hate to say, are already programmatically compiled by interns repetitively applying formulas in spreadsheets and building graphs from database queries. That's why they cost more than paperbacks, because "The 2009-2014 Outlook for Plastics Lamp Shades in the United States" is essential for the people in the business.
He also has medical dictionaries which seem to confuse his cause, as he explains "For health titles, only the format editing and production side is automated. The text in the health books was written by medical professionals and edited by a professional editor; the computer expedited formatting using about 50 odd routines (the preface, chapter intros, glossaries, indexes, headings, margins, etc.)"
Aw, man. I used to hate this guy when I worked on the title authority team. The programmatic titles alone often caused the title matchers to infer that all the books were somehow related, and cluster them all together. Oh, the happy hours spent unpicking the resulting mess. The Wikipedia guy was the worst in terms of sheer pointlessness though. Personally I'd delete the lot of them. Hard to think that your job is worthwhile when it consists of cleaning up after other people's crappy perl scripts...
Why does Amazon allow it? It's somewhere between outright fraud, and DOSing or Amazon's search functionality; It's clearly something that constitutes abuse.
Similarly, if I were to fund a human team of title-writers to create plausible original titles for a hundred original titles per hour, and fill the actual pages with random words from the OED... Advertising that as a book on a particular topic would likely be some type of fraud. The result is not far from what I've seen of this guy's work.
This extends into lots of areas on the marketplace. I spend a bit of time looking at some of the listings our products have been matched to by UPC and working out just which product the listing is trying to sell. Often the title/ image/ description can be describing different products.
My favorite page:
http://www.totopoetry.com/search.asp?word=truth
I have used this approach to write definitions as well (www.websters-online-dictionary.org)
The following contrasts definitions of zealously:
1. In a zealous manner. [Human]
2. In an enthusiastic, fervid, ardent or fervent manner. [graph theoretic]
3. In a fanatical manner. [graph theoretic]
In many of the langauges/subjects we work in there is very little content published. We avoid areas where there are substanial titles availability already.
In the video it can be seen that a fair amount of what the automated system does is versatile formatting ... in Word and Excel. But why oh why?
(Semi)-automated formatting of documents is a problem that has been solved with expert results in LaTeX.
Also deviations in the data are recognized and highlited, but not (yet?) examined and elaborated upon. No doubt this will be possible in the near future.
So, kudos for the general approach. Can't wait for the automated reading programs for digesting these books. And that is meant only half jokingly. The data is there, the general knowledge is there and there is enough reasoning power to draw conclusions. The next step would be to make automated descisions based on the available data, so politicians could join the authors in beeing unemployed...
Well I for one welcome our new text blasting overlords.
At 29 USD for 38 pages, Basketry[0] doesn't look like a particularly useful buy.
Also, from the description: "…editorial decisions to include or exclude events is purely a linguistic process." Is it really correct to describe that as an editorial decision? (Not to mention "editorial decisions…is"?)
The index for this book is also useless. It seems to index every word in the text, whether it's significant or not; for example, "September". Also the index has entries for "Nova" and "Scotia", but not "Nova Scotia" (which is presumably where these two words came from).
I don't think that real authors would face any significant competition from this guy.
So you weren't talking about the index containing every word in the book? Or, if you were, isn't that be exactly what the most naive indexing algorithm would do?
The reviews tend to paint a pretty grim view of the quality of these books, but the sarcastic ones are some of the funniest reviews I've ever read.
Here's an honest one that shines a light on the quality:
"The description for this book is TOTALLY misleading. It is NOT a book of quotations and phrases. It is a reference book of where to look to find possible quotations- like the old filecard cabinets in the library. On the few pages where you can actally find a quote, it reads like this one: Jack London, from Jerry of the Islands, "I am writing these lines in Honolulu, Hawaii." Huh?? That's it. That's all there is! I'm not sure who would use this book. Certianly not me! I was very disappointed as I was looking for a collection of quotes from notables like Mark Twain, Jack London, etc."
Edit: Sort by avg rating. Goldmine of comedy in the reviews for "Butts" and "Scrotum" books. Wow.
Here's the review for "The 2009-2014 Outlook for Plastics Lamp Shades in the United States":
"(4/5 stars)An instant classic in the Icon style
While this outlook hardly holds a candle to comparable classics such as The World Market for Silica Sands and Quartz Sands: A 2009 Global Trade Perspective, the information is invaluable for any red-blooded American.
The five-year span is parsed in fascinating prose, and the 176 pages fly by, feeling almost like a 150-page work.
Don't let your lack of background knowledge deter you - there isn't too much reference to the 2004-2009 report, and most of the important information is explained in exposition.
Luther Blaze runs the show in this non-stop thrill ride of an economic adventure. His last outing ended in the government setting up a secret commission to investigate his possible wrongdoing in stopping the mysterious project Mantis, but now he's back, and ready to run roughshod over anyone in his way.
At a price of approximately 2.81 per page, you know you're getting your money's worth with this paperback. It's a perfect read for the park, or a lazy Sunday afternoon.
On a side note, this book is a real pick-up gem! I personally attracted no less than three beautiful women, all of whom wanted my thoughts on the challenging themes and motifs. They all gave their numbers, and there's been no looking back!
The biggest problems with this book are largely physical aspects of the book. I didn't care for the font too much. And the beige background on the cover betrays the intrigue within.
The Icon International group has hit another classic out of the park. I look forward to the next book with baited breath, and I can't wait to see how Agent 71 and Dash get out of this jam."
The funniest part of "Scrotum" is that it's priced at $28.95. For an 84-page paperback, whose sole purpose is to trace the usage of the word "scrotum" throughout an oddly specific period in English linguistic history (1678 - 2007). I mean, perhaps the content farmer should also write a pricing algorithm?
So how soon before the algorithm then uses the comments as part of the next revision of the book? Its an interesting way to demonstrate the value proposition.
In this case algorithmically selecting stuff from around the web generates random books, but they have little value for the most part. Books that are researched and curated on the other hand have higher values. The only difference being the curation, not the basic information. So it gives you an insight into pricing curation.
Next up, a bot that sends out 800,000 DMCA takedown notices, oh wait we already have that for youtube.
I think using the word "value" is a bit too strong at this point, but
maybe someday algorithms along these lines (and the ones I mentioned to
you in email, article spinning) will be able to do more good than harm.
The trouble is, at present, they are more harmful than beneficial.
Machine generated, or guided, curation may eventually become a good thing,
but as you know all too well, there are far too many uses of such tech
that are harmful (e.g. poisoning search results, &c).
I agree with you, clogging up Amazon's stuff with 800,000 titles is a sad effect.
It suggests that crooks in the 21st century don't rob banks, rather they rob a million people of 50 cents because none of them feels ripped off enough to prosecute and even if they did they are only out 50 cents.
Thank you, you made my day. This is really fantastic, I particularly appreciate the nonsensical scales on the graphs (time expressed in dB, number of CPU in Joules) and the hilarious fake reference papers scattered with well known names (Even Erdos :).
What a greedy misappropriation of an otherwise incredible tool.
I assume what pushed him into the market was the software's economic analysis of the latent market for spam books on Amazon 2010-2014
As a medical student, I particularly love books like this[0] with their description: "If your time is valuable, this book is for you. First, you will not waste time searching the Internet while missing a lot of relevant information. Second, the book also saves you time indexing and defining entries. Finally, you will not waste time and money printing hundreds of web pages."
How does it determine the price? By the pound? Amazon should clearly mark these as computer generated. I'll bet most of the purchases are from people who didn't know what they were even getting.
Some people tend to assign a fair bit of trust to the Amazon brand (which may be eroded by things like this). The fact that a book is on Amazon may convince some people that it is a book of a quality you would usually expect from traditionally published works.
For anyone not familiar with his work, I highly recommend his collection of short stories titled "The great automatic grammatizator". The major plot of the eponymous story is about a hacker who builds a novel-writing machine so that he can drive authors out of the market by out-producing them.
"Prose is a form of language which applies ordinary grammatical structure and natural flow of speech rather than rhythmic structure (as in traditional poetry)."
"In the philosophy of language, a natural language (or ordinary language) is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect."
The answer is no, since the text did not come from human intellect but computer programming; they can't be classified as books.
If you go looking for overly specific definitions of words in natural language, which is effectively defined by its usage, any conclusions you draw from these 'definitions' will be flawed at best. Obtaining your fundamental definitions from selected lines in wikipedia adds bias as well as inaccuracies.
Also, even assuming your definitions were reasonable, you fail to explore the possibility that his books are poetry.
So this guy has been able to find a common schema across several different domains and add rules to it to churn out content. I like the concept. I watched the video in the article and could see how this could be used for instruction in the case where the resources like trained teachers are scarce.
I've had a feeling for a long time that due to the predictability of humans and our processes that it this is inevitable. I think it's great that he can do this for things like instruction but if this were to get "smart" enough that would put a lot of people out of work.
I'm curious about the licencing - he's scraping some stuff from Wikipedia, but it's not clear which bits.
I'm also confused about the super high price. Is this deliberate to avoid having to refund to very many unhappy customers?
And while these books are probably awful he's going to be known by the future people as one of the innovators of auto-generated content. At least he's not breaking spam filters with Markov chains.
It's gently odd that AI got stuck for a while; I very much hope that AI research and practice gets a bit more attention and funding.
Kinda reminds of the analogy about monkeys writing on typewriters, where a million monkeys would in a thousand years come up with something as Shakespeare. I may go out on a limb here, but by the end of the next decade the most popular book of the holiday season might be one that was actually written by a computer...
"The Policeman's Beard is Half Constructed," written in 1983 by a program named Racter. Bestseller? It's very well known, but I don't know how well it sold.
Racter was more of a William S. Burroughs cut-up program, it mostly randomizes and regurgitates it's input. Policeman's Beard required carefully selected input and is on the border of being a hoax.
This is an interesting web page that convincingly makes the case that "The Policeman's Beard is Half Constructed" was written largely by the humans involved, with the program having a minimal role, and that the Racter program they sold to the public was incapable of reproducing the novel.
> "The Library of Babel" (Spanish: La biblioteca de Babel) is a short story by Argentine author and librarian Jorge Luis Borges (1899–1986), conceiving of a universe in the form of a vast library containing all possible 410-page books of a certain format.
Usually I would not say something like this, but I hope he gets sued for this. Big time. He claims to sell books to doctors and patients - there is no way he should be allowed to. With no control of the content, chances are big that he gives wrong advice. This make me feel bad.
If Amazon's search and discovery work correctly, no one should see these books until they start generating sales. So it might almost be like they don't exist until/unless the "publisher" can generate some interest and sales.
I could see this being useful to pull resources together for an abstract paper.
Putting this on amazon is definitely overkill for now. He should have proven the value by first generating books under his name and seeing the response.
I assume Amazon can nuke this experiment and future clones simply by requiring a modest payment, even a dollar or the price for a single copy for listing, right?
"Beginning in March 1998, he launched a private initiative (dubbed the “K to 12 +2 project”) after directing a workshop for the World Bank which considered illiteracy. One aspect of this problem is the lack of educational materials in local languages (the smaller the language in population, the less likely the publishing industry will find it profitable to serve such communities, leaving some 1000 written languages without basic textbooks). Using automated authoring processes he pioneered, his international business publications have funded a variety of multilingual educational materials including a free online multilingual dictionary, PC games, videos and ebooks. He has applied this approach to support projects sponsored by the Bill and Melinda Gates Foundation creating thousands of factsheets (using meta analysis) on tropical plants, and is now working on automated rural radio scripts, call center materials serving smallholder farmers, and SMS content engines working with the GSM Association, the Grameen Foundation, and Farm Radio International in Kenya, Uganda, Malawi and India, among other developing counties. As a hobby, he has applied graph theory to automatically author hundreds of thousands of didactic poems (limericks, sonnets, haiku, acrostics, etc.), and is on working fiction and academic studies."
Interesting research area. Unfortunate that such laudable motivations ended up in so much ridiculous spam... Perhaps a separate category on Amazon or hosting the content on another site would be better for everyone?
Unfortunate that such laudable motivations ended up in so much ridiculous spam... Perhaps a separate category on Amazon or hosting the content on another site would be better for everyone?
Agreed, when I'm searching for books on the outlook for wooden toilet seats in China, I really want to be able to filter down to the legitimate guides only.
Libraries in Sweden, and probably in other countries as well, have their own "algorithms" to decide what to purchase - at a university it could be everything related to a particular topic, for instance. This means that these computer-generated books are in stock in many libraries (at the taxpayer's expense).
Hey, this gives me a great idea: I could write a program that will gather email addresses on the web and automatically email my sales pitch to every single one of them. And I could sell both this program and my email lists to other companies willing to pay for them.
Thanks for the link, this was really interesting. Compiling and creating content in a human readable for can be extremely valuable. I wonder if this system will be used for artificial intelligence purposes some day; learning about any topic on the web.
People this is all fake. It's not real or he's just a scammer. Look at the reviews he's got, it's all by the same person or the reviewer reviewed other titles of his books. Insead is a real place though, so dunno why he's a real prof there.
Of course but it's nothing more than bureaucratic red tape. This was simply the call that one junior reviewer made at one point based on a strict interpretation of the rules that had to be appealed. Obviously there will be some false negatives. The quality of the App Store is certainly better than Google Play.
The red tape is not the executive apparatus, but rather the choice of the executive apparatus. And since I'm not an idealogue I can see the pros and cons.
Well, when you use a computer to compile your source code it is still protected by copyright. You have affixed your creation to media. (There are some occurrences where you don't have to affix to media to get copyright -- yoga and dance choreography)
Copyright is supposed to protect an expression of an idea. I'm not sure how far you can bend the argument to say that computer reorganization of facts is expression.
For pixar and animations, the end result is an expression of artists' designs.
Can you have copyright without an expressive aspect? Is there any human expression in this output?
Assuming all this will be judged under American law, as Amazon is an American company, the interesting cases here are Feist v Rural:
> Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991),[1] commonly called Feist v. Rural, is an important United States Supreme Court case establishing that information alone without a minimum of original creativity cannot be protected by copyright.
(The specific holding was that you can't copyright the phone listings in a phone book because a simple alphabetical listing isn't creative enough.)
and Bridgeman v. Corel:
> Bridgeman Art Library v. Corel Corp., 36 F. Supp. 2d 191 (S.D.N.Y. 1999), was a decision by the United States District Court for the Southern District of New York, which ruled that exact photographic copies of public domain images could not be protected by copyright in the United States because the copies lack originality. Even if accurate reproductions require a great deal of skill, experience and effort, the key element for copyrightability under U.S. law is that copyrighted material must show sufficient originality.
(Again, a court stated that in order to create a new copyright, the person had to exercise some creative spark; no machine alone can do that.)
> I'm not sure how far you can bend the argument to say that computer reorganization of facts is expression.
The Supreme Court seems to have said that it can't be a purely mechanical reordering. Of course, the whole point of having a judge is to create new judgements based on new fact patterns, so precedent isn't an absolute guide to future rulings.
IANAL either but that's an interesting question. I think machines must be able to generate IP though, otherwise the output of something like Pixar's render farms wouldn't be considered copyrightable, only the source models, animations, textures, shaders etc.
The output is considered to be a derivative work of the input, so the same copyright as the input applies. In the case of Pixar, the original models are definitely an original creative work, so the output is as well.
If you are simply using public domain databases as the input, and passing them through an automated process with no creative input, then you probably won't be able to get (or more to the point, enforce) a copyright on it.
Now, whether the databases he's using are copyrighted or not is an interesting question; as well as whether database copyrights even apply. Not all jurisdictions have the notion of database copyrights. Individual facts cannot be copyrighted; but collections of them can, at least in some jurisdictions. However, I don't know if the databases he is using are copyrighted or not.
In the US at least, facts cannot be copyrighted. Various cites have tried to stop alternate subway maps or transit apps from being sold, but courts have ruled that mass transit routes and schedules are facts and not protected by copyright.
I guess it makes sense that these auto generated books may fall under the same umbrella.
When we first started seeing the automated content books, literally overnight our product set increased several million titles - I now estimate that about 20 million of the 50 million products we have are automated books (they come out with new "editions" all the time as well). This obviously has a massive impact on our search results - these books have keyword laden titles and descriptions, and without a solid identifier it was very difficult to get rid of them. Thankfully recently the suppliers that print these products have started flagging them as crapware.
As for the customer response to this type of product - it's definitely negative, with a massive return rate. As far as I am concerned, this is a massive scam - hedging on the fact that some people are to lazy to return the books.