Hacker News new | past | comments | ask | show | jobs | submit login

It's a 2.64GB CSV file with the following columns:

    handle
    type
    name
    website
    founded
    industry
    specialties
    size
    city
    state
    country_code
15,263,246 rows.

I think the main listing for Google is this one (as an example):

10361050:company/google,Public Company,Google,goo.gle,,Software Development,"search, ads, mobile, android, online video, apps, machine learning, virtual reality, cloud, hardware, artificial intelligence, youtube, and software","10,001+",Mountain View,California,US




Ran https://sqlite-utils.datasette.io/en/stable/cli.html#cli-ana... to figure out the most common values in each column:

    company.type: (2/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 5032965

      Distinct values: 189

      Most common:
        5128934: Privately Held
        5032965: 
        1202598: Self-Owned
        1007806: Partnership
        952111: Public Company
        763992: Nonprofit
        749117: Self-Employed
        319885: Educational
        84829: Government Agency
        2423: De financiación privada

    company.website: (4/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 3043640

      Distinct values: 10675220

      Most common:
        3043640: 
        81214: facebook.com
        50174: instagram.com
        41050: business.site
        27134: linktr.ee
        24024: indiamart.com
        19683: wixsite.com
        16546: negocio.site
        15864: linkedin.com
        13201: yelp.com

    company.founded: (5/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 7626823

      Distinct values: 1551

      Most common:
        7626823: 
        488964: 2020
        438343: 2017
        417847: 2018
        404720: 2016
        398489: 2019
        386049: 2015
        383686: 2021
        338120: 2014
        298768: 2013

    company.industry: (6/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 1274455

      Distinct values: 2591

      Most common:
        1274455: 
        793016: IT Services and IT Consulting
        626267: Advertising Services
        623184: Construction
        424481: Real Estate
        417648: Business Consulting and Services
        416398: Software Development
        401791: Retail
        337914: Financial Services
        305029: Wellness and Fitness Services

    company.specialties: (7/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 9681220

      Distinct values: 5226043

      Most common:
        9681220: 
        1678: Real Estate
        1353: Education
        557: Software Development
        537: real estate
        516: Recruitment
        456: Marketing
        420: Property Management
        396: Digital Marketing
        393: Hospitality

    company.size: (8/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 2531526

      Distinct values: 226

      Most common:
        6317189: 2-10
        3459571: 11-50
        2531526: 
        1170922: 51-200
        730818: 1 employee
        417504: 201-500
        262702: 1
        149520: 501-1,000
        130678: 1,001-5,000
        43399: 10,001+

    company.city: (9/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 3039470

      Distinct values: 393107

      Most common:
        3039470: 
        262475: London
        116915: Paris
        110220: New York
        96914: São Paulo
        72455: Los Angeles
        66837: Madrid
        65075: Toronto
        59321: New Delhi
        58151: Dubai

    company.state: (10/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 4326168

      Distinct values: 55773

      Most common:
        4326168: 
        670232: England
        567647: California
        318576: Texas
        283576: New York
        276783: Florida
        215523: São Paulo
        172516: Maharashtra
        164761: Ontario
        161408: Île-de-France

    company.country_code: (11/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 2858360

      Distinct values: 272

      Most common:
        3943846: US
        2858360: 
        1204699: GB
        816378: IN
        691296: FR
        638895: BR
        443326: DE
        401268: NL
        386207: ES
        373460: CA

    company_fts.specialties: (13/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 9681220

      Distinct values: 5226043

      Most common:
        9681220: 
        1678: Real Estate
        1353: Education
        557: Software Development
        537: real estate
        516: Recruitment
        456: Marketing
        420: Property Management
        396: Digital Marketing
        393: Hospitality


Very interesting, thanks!


How did you compute this? I just did another check to verify (wc -l) and it's coming to 15,980,531.


I used wc -l at first, but I've just imported into SQLite and the count(*) is 15,263,246 - updated my previous comment (which had said 15,263,251).

I downloaded the CSV and ran:

    sqlite-utils insert companies.db company companies-dataset-2023-02-ckgENv.csv --csv
    sqlite-utils enable-fts companies.db company name specialties
    sqlite-utils analyze-tables --save companies.db
This lets me run searches against the name and specialties columns, and gives me those aggregate stats too.


Ok. I'm not sure how this happened, but I think the dataset was somehow mislabeled. It appears that this dataset is the Q1 version, not the latest Q2. Can you please try re-downloading it?

We're probably going to have to make an public announcement about this...


OK, that one has 15,948,996 rows.

Here's what I got from running the same "sqlite-utils analyze-tables companies2.db company" command against it:

    company.handle: (1/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 0

      Distinct values: 15948996

    company.type: (2/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 5253878

      Distinct values: 92

      Most common:
        5311279: Privately Held
        5253878: 
        1290064: Self-Owned
        1055857: Partnership
        987045: Public Company
        828643: Self-Employed
        799655: Nonprofit
        334552: Educational
        87681: Government Agency
        35: De financiación privada

    company.name: (3/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 1591

      Distinct values: 15549439

      Most common:
        1591: 
        1098: .
        277: A
        246: -
        164: None
        155: X
        142: N/A
        132: ...
        128: x
        122: 1

    company.website: (4/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 3249552

      Distinct values: 11272926

      Most common:
        3249552: 
        86957: facebook.com
        57769: instagram.com
        46404: business.site
        31397: linktr.ee
        27882: indiamart.com
        21852: wixsite.com
        19008: negocio.site
        17366: linkedin.com
        13224: yelp.com

    company.founded: (5/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 8040264

      Distinct values: 1561

      Most common:
        8040264: 
        524236: 2020
        451742: 2017
        441748: 2018
        426575: 2019
        418318: 2021
        411391: 2016
        389487: 2015
        339212: 2014
        299038: 2013

    company.industry: (6/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 1334901

      Distinct values: 421

      Most common:
        1334901: 
        820156: IT Services and IT Consulting
        651746: Construction
        651557: Advertising Services
        465857: Software Development
        455111: Business Consulting and Services
        447922: Real Estate
        435151: Retail
        355049: Financial Services
        312937: Wellness and Fitness Services

    company.size: (7/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 2655086

      Distinct values: 123

      Most common:
        6646929: 2-10
        3584483: 11-50
        2655086: 
        1197530: 51-200
        1091094: 1 employee
        421595: 201-500
        150053: 501-1,000
        129373: 1,001-5,000
        44755: 10,001+
        27742: 5,001-10,000

    company.city: (8/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 3155391

      Distinct values: 410985

      Most common:
        3155391: 
        269708: London
        124059: Paris
        113135: New York
        99314: São Paulo
        75428: Los Angeles
        69691: Madrid
        67328: Toronto
        63738: Dubai
        63456: New Delhi

    company.state: (9/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 4524015

      Distinct values: 58563

      Most common:
        4524015: 
        691167: England
        584141: California
        329584: Texas
        291639: New York
        286723: Florida
        222552: São Paulo
        185925: Maharashtra
        171885: Ontario
        171657: Île-de-France

    company.country_code: (10/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 2961064

      Distinct values: 272

      Most common:
        4059985: US
        2961064: 
        1232403: GB
        885302: IN
        756411: FR
        664235: BR
        467501: DE
        414433: NL
        410372: ES
        389535: CA


The thing I find most interesting is this:

        269708: London
        124059: Paris
        113135: New York
        99314: São Paulo
        75428: Los Angeles
        69691: Madrid
        67328: Toronto
        63738: Dubai
        63456: New Delhi
I would not have expected São Paulo to come fourth in this list, after New York but in front of Los Angeles. I just learned it's the 4th largest city in the world https://en.wikipedia.org/wiki/List_of_largest_cities - after Tokyo, Delhi, Shanghai - but I guess it has much more of a representation on LinkedIn than those other cities.


wc -l will count all newlines, even those that are escaped. Perhaps some company descriptions have newlines?


I don't have any work email. Can anyone please share some other source to download?


Not an answer to your question, but I did find their justification for why they require a work email address intriguing.

https://blog.bigpicture.io/how-we-stopped-spam-signups/

I've been working through this issue a lot lately. There tend to be two camps, the "Make the experience good for the person trying to access the data, because good relationships are more important than the ability to contact someone" versus "Why should I give someone something for free when there is zero chance I can ever make a sale to that person?"




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: