Unicode tools (at least should) have ways to determine visually similar letters. Maybe someone more knowledgeable about Unicode than me can pull out the term for it (maybe "homograph"?). For example: 'ö' and 'o' should 'match' using this method. It also allows you to do things like make sure that μ (mu) and µ (micro sign) match.
So in response to:
> I'm not sure if interchangeable is the right word. 'phởne' and 'phone' yield different result lists.
The results are different because it's matching somethings that are an exact character match, and others that are just visually similar (homographs?).
Does it mean that it supports only basic common denominator features of every language? I am a fan of attributes in C# - are Haxe annotations as powerful?
Out of the box Haxe's metadata[1] to me doesn't appear as powerful as C#'s attributes. The big difference is that they are not type safe and are not type-checked by the compiler.
You can however use any expression you want in them, and have macros read them and confirm they are correct, and do whatever you want with them. For example I do some validation on my models:
@:validate( _.length>0 && _.indexOf(' ')==-1 )
public var username:String;
As for the bigger question - they are limited in what they can add, but they have many features not in other languages, they just implement them in a syntax-heavy way on the languages that lack such features. Pattern matching is a good example [2].
You can replace all commas with a placeholder (e.g. "#COMMA#"), replace the delimiter with a comma, parse the document and then replace all placeholders in the data with ",".
That does not work, unless that first replacement magically ignores the commas that are part of field separators. If you know how to write the code that does that, your problem is solved.
I was referencing to "What if the character separating fields is not a comma?".
And there it clearly works. I used this technique a few times with success. If you find a CSV file that has mixed field separator types, then you probably found a broken CSV file.
You just choose a placeholder that does not appear in the data. You could even implement it in a way that a placeholder is automatically selected upfront that does not appear in the data.
When it comes to parsing, the thing is that you usually have to make some assumptions about the document structure.
What if there is #COMMA, in one of the fields (but no #COMMA#)?
Yes, the assumption you have to make is called the grammar, and you better have a parser that always does what the grammar says, and global text replacement is a technique that is easy to get wrong, difficult to prove correct, and completely unnecessary at that.
> What if there is #COMMA, in one of the fields (but no #COMMA#)?
What should happen? Since #COMMA is not #COMMA#, it gets not replaced, because it does not match.
Please keep in mind, that I replied to suni's very specific question and did not try to start a discussion about general parser theory. In practice, we find a lot of files that do not respect the grammar, but still need to find a way to make the data accessible.
What would happen is that you first would replace #COMMA, with #COMMA#COMMA# and then later replace that with ,COMMA# , thus garbling the data.
The way to make the data accessible is to request the producer to be fixed, it's that simple. If that is completely impossible, you'll have to figure out the grammar of the data that you actually have and build a parser for that. Your suggested strategy does not work.
Usually the person parsing the CSV data doesn't have control over the way the data gets written. If he did, he would probably prefer something like protocol buffers. CSV is the lowest common denominator, so it's a useful format for exchanging data between different organizations that are producing and consuming the data.
https://github.com/dbro/csvquote is a small and fast script that can replace ambiguous separators (commas and newlines, for example) inside quoted fields, so that other text tools can work with a simple grammar. After that work is done, the ambiguous commas inside quoted fields get restored. I wrote it to use unix shell tools like cut, awk, ... with CSV files containing millions of records.
You tend to have more control over the way the data is produced than you think, and you should make use of it. It's idiotic to work around broken producers over and over and over again, each time with a high risk of introducing some bugs, instead of pushing back and getting the producer fixed once and for all. Often the problem is simply in the perception that somehow broken output is just "not quite right", and therefore nothing to make a fuss about. That's not how reliable data processing works. You have a formal grammar, and either your data conforms to it or it does not, and if it doesn't, good software should simply reject it.
Your csvquote is something completely different, though it seems like you yourself might be confused about what it actually is when you use the word "ambiguous". There is nothing ambiguous about commas and newlines in CSV fields. If it were, that would be a bug in the grammar. It just so happens that many unix shell tools cannot handle CSV files in any meaningful way, because that is not their input grammar. Now, what your csvquote actually does is that it translates between CSV and a format that is compatible with that input grammar on some level, in a reversible manner. The thing to recognize is that that format is _not_ CSV and that you are actually parsing the input according to CSV grammar, so that the translation is actually reversible. Such a conversion between formats is obviously perfectly fine - as long as you can prove that the conversion is reversible, that the round-trip is the identity function, that the processing you do on the converted data is actually isomorphic to what you conceptually want to do, and so on.
BTW, I suspect that that code would be quite a bit faster if you didn't use a function pointer in that way and/or made the functions static. I haven't tried what compilers do with it, but chances are they keep that pointer call in the inner loop, which would be terribly slow. Also, you might want to review your error checking, there are quite a few opportunities for errors to go undetected, thus silently corrupting data.
I used that strategy for parsing gigabytes of CSVs containing arbitrary natural language from the web - try to get these files fixed, or figure out a grammar for gigabytes of fuzzy data...
My approach never failed for me, so telling me that my strategy does not work is a strong claim, where it reliably did the job for me.
Your examples are all valid, but what you are describing are theoretical attacks on the method, while the method works in almost all cases in practice. We are talking about two different viewpoints: dealing with large amounts of messy data on one hand and parser theory in an ideal cosmos on the other hand.
How do you know that the strategy worked reliably if you never compared the results to the results obtained using a reliable method (which you presumably didn't, because then you could just have used the reliable method)? The larger the data you have to deal with, the more likely it is that corner cases will occur in it, and the less likely that you will notice anomalies, thus the more important that you are very strict in your logic if you want to derive any meaningful results.
As such, the two viewpoints really are: not really caring about the soundness of your results and solving the actual problem.
Now, maybe you really can show that the bugs in the methods you use only cause negligible noise in your results, in which case it might be perfectly fine to use those methods. But just ignoring errors in your deduction process because you don't feel like doing the work of actually solving the problem at hand is not pragmatism. You'll have to at least demonstrate that your approach does not invalidate the result.
As I wrote above, by making sure that I use a placeholder that does not appear in the data, I make sure that it does not cause the issues you describe. And if I was wrong with that assumption, I can at least minimize the effect by choosing a very unlikely sequence as placeholder.
I really see no issue here. How do you find valid grammars for fuzzy data in practice?
> He continued: "I don't know how to measure it, but it gives us an idea that what we're doing is being understood by some. And there are some good peers of mine also, who are very high-ranking in the film business and the music business, sending me a lot of good will. It's been real positive.
So Wu-Tang Clan fans in Kazakhstan or Tanzania (or even every country other than the U.S.) will probably never be able to listen to this album...? I guess these will be the people who don't "understand" what Wu-Tang Clan is doing, while only the privileged ones "understand" the concept.
That's artificial shortage, not art (not talking about the music itself).
My perspective on art is a reaction on the elitism of the art scene, so basically my comments are art.
Edit/addition: Honestly, I could have much more respect for this project, if Wu-Tang made it only accessible to homeless people, or only to prisoners, but effectively, they make it only accessible to the riches. I really do like the Wu-Tang Clan, but I am really not impressed by this stunt.
Sure - and I think Sony e-readers were even among the first to allow side-loading of non-DRM ebooks.
The problem is the economics. There just haven't been financially successful examples of standalone readers, and Sony has been no exception.
Amazon can offer cheap Kindles because they can sell ebooks onto the devices directly and make money from content. Apple can charge more for their books because people are bought-in to their iPad investment, and because competitors have to pay Apple 30% to sell directly on to an iOS device.
Sony, meanwhile, continues to fail to bring an integrated value proposition to the market. Everything with them involves friction, compared to the competition. Last time I was at CES, I couldn't even find a Sony reader in their booth.
> Is Microsoft circa 2014 worse than Google, Apple, or Facebook? We're not nearly as organized as we'd need to be to be as evil as you might think we are.
Microsoft is not any worse than the other companies. They are all at the same terrible level.
But Microsoft became a bit better over the last years, I would say.
I usually think of a business model whereby a company goes around and tells others "pay me for patent X or I'll sue you" as extortion. And shockingly enough, that's part of the modern, saintly Microsoft business model. Not to mention the suspicious fast-tracking of the word document standard.
Another thing is trust. Why would you trust an organization which still had at least a famous C-level executive which executed a number of business strategies you have a problem with? Why would you trust an organization run by Steve Ballmer? Now, maybe new management will turn Microsoft around, but it has done enough dodgy things over the years that yes, it will have to "bend over backwards" for a long time before I trust it again.
I'll add that on the other hand, I have the utmost respect for Microsoft Research, which keeps churning out amazing results. I just wish it was called something else. I think I'd like the idea of a chartered research establishment, like the BBC, with a secure amount of public EU money, focusing on the future of computing, and not scrounging for grants.
> We also show that automated attempts at circumventing stylometry using machine translation may not be as effective, often altering the meaning of text while providing only small drops in accuracy.
Yes, Angular can use jQuery if it's present in your app when the application is being bootstrapped. If jQuery is not present in your script path, Angular falls back to its own implementation of the subset of jQuery that we call jQLite.
Due to a change to use on()/off() rather than bind()/unbind(), Angular 1.2 only operates with jQuery 1.7.1 or above.
> Human Corrected Translations for 1 cent per word
"Per word" of the source language, or the target language? Sum of both? What about languages which have a different concept of "words" in written text (e.g. Chinese, Turkish, ...).
And by the way... "cent" of which currency? :)
Edit: I just saw that the list of supported languages does not contain languages with "exotic" types of word boundaries (yet).
We are still trying to figure that out, there is no clear answer regarding how to base the pricing. Perhaps it should be based on words that were actually corrected? So for the time being we basing it only on source language words. For chinese, for example, we probably will do it based on characters, but ideas are welcome.
I think only pricing the source language makes sense. That way the consumer knows the cost going in and there's no incentive for you to provide more verbose translations.