The data fields kind of confuse me...I assumed that there would be one medallion per operating cab, right? And I assume that one cab, i.e. medallion, can make one pickup in a given second...therefore, pickup_datetime and medallion should result in a unique key, right?
But I get dupes. From the 2013 January dataset, medallion "DD336C4ADA65CCBD2284F63BF348A4F0" makes 3 pickups at 7:09AM on Jan 13, 2013. The dropoff time is noted as that exact same second of the pickup. Yet the pickup coordinates, and the dropoff coordinates, are all different. Though there are all kinds wonky real-life situations...a driver starting the fare and shutting it off during a trip, accidentally, perhaps...this seems to indicate an error in how the data is recorded...because the machine, even if doing something that breaks business rules, the timestamp should still be consistent with real time, right? Anyway, it's something to be wary of when doing analysis.
And it also looks like the hack licenses are all different...which I'm not even sure how that works, but it doesn't seem to be a case simply of 3 different drivers being registered to the same medallion...
Heh, this is far from the worst problems in this dataset. Wait until you find the clusters of lat-long points coordinates that are off by 1 degree in either direction (or 0.1 degrees)
Then there are some pickups clustered around weird areas of NJ and CT (NYC cabs are not supposed to pick up passengers outside NY). Lots of weirdness to sink your teeth into.
But it's a really fun dataset. You can see hurricane Sandy, big city events, construction, etc.
Sure, it wouldn't be hard to re-load the data in a new dataset/table with more accurate types. Strings are just the default. I just wasn't sure how clean the data was and didn't want to debug bad data before loading. I was lazy basically.
But if you wanted to do it, it's pretty straightforward.
In the meantime I haven't found it to be too hard to just INTEGER(field) and FLOAT(field) wherever I need to.
was playing around with google maps and data : you can see the heatmap for taxi pick up by the day for first 6 days of july. Also could be useful as where uber drivers can find passengers based on last year data
But I get dupes. From the 2013 January dataset, medallion "DD336C4ADA65CCBD2284F63BF348A4F0" makes 3 pickups at 7:09AM on Jan 13, 2013. The dropoff time is noted as that exact same second of the pickup. Yet the pickup coordinates, and the dropoff coordinates, are all different. Though there are all kinds wonky real-life situations...a driver starting the fare and shutting it off during a trip, accidentally, perhaps...this seems to indicate an error in how the data is recorded...because the machine, even if doing something that breaks business rules, the timestamp should still be consistent with real time, right? Anyway, it's something to be wary of when doing analysis.
And it also looks like the hack licenses are all different...which I'm not even sure how that works, but it doesn't seem to be a case simply of 3 different drivers being registered to the same medallion...
The three data rows in question: