A raw dump of companies from all over the world by LinkedIn handle

babblingfish · on May 17, 2023

It's funny how OP does not address where this data comes from even though it's obviously from LinkedIn. I see many people in the comments asking questions so I will add my two cents as someone who is currently employed by LinkedIn and has an interest in web scraping.

This dataset was taken from scraping the company pages from LinkedIn. A company has to pay to have this page, so this certainly does not include all companies. If you have a premium account your search is not rate limited so you can iteratively scrape anything you want even though it's technically a violation of the terms of service.

There are many companies that sell data scraped from LinkedIn as a product. LinkedIn won a court case against hiQ Labs for scraping member data and other things[1]. I am not trying to compare this court case to the OP's website, just something worth mentioning.

In any case, web scraping is a sort of gray area of the law. In my opinion, this data set does not contain member data and is not being monetized so it feels kosher to me.

(Opinions expressed are solely my own and do not express the views or opinions of my employer.)

[1] https://www2.staffingindustry.com/Editorial/IT-Staffing-Repo...

simonw · on May 17, 2023

> A company has to pay to have this page, so this certainly does not include all companies.

I found a listing in there for my old startup derived from the LinkedIn page, and we never paid for that LinkedIn listing.

babblingfish · on May 18, 2023

Yeah apparently it's free, my bad, I remembered it incorrectly

prepend · on May 17, 2023

> In any case, web scraping is a sort of gray area of the law.

I don’t think it’s grey. It seems to be legal as the data are made freely available and the only grey part is that companies don’t want this to happen and would rather charge and not have people scrape.

ddingus · on May 19, 2023

It reminds me of a similar conflict in public records.

Many municipalities charge for copies of public records. One can go to viewing rooms and examine the records at no charge. We all own those records as members of the public that funded them, and the place where they are kept.

Many municipalities want to prohibit photography because people taking their own picture of a public record does not involve the copy fee.

The municipality confuses access to records as a part of their fee, and that is where attempts to limit photography come from.

The fee actually funds the work necessary for an official copy to be made, optionally stamped to be admissible in court or accepted as an "original copy."

A photocopy of a death certificate is no different than the one the city clerk made, except for the stamp and clerk being able to testify about making the record copy.

Some companies and government agencies will accept a death cert no matter what. Others want an official copy.

Getting back to scraping:

Clearly people have access and can make their own data copies.

Maybe the answer is for companies to make official data products available, or something along those lines.

JohnFen · on May 17, 2023

It is a gray area (in the US) in the sense that there is no clear consensus about it in the courts. There have been court rulings in both directions.

hallqv · on May 17, 2023

What rulings are you referring to? Been reaffirmed that scraping LinkedIn is legal multiple times, even by Supreme Court. https://techcrunch.com/2022/04/18/web-scraping-legal-court/

JohnFen · on May 17, 2023

I believe that decision was that web scraping doesn't violate the CFAA unless there are access controls (such as the need to log in) to get to it, but they specifically said that the ruling doesn't comment about other possible claims against it.

That's why I think it's still a gray area. But I could be wrong -- I haven't been following this stuff all that closely.

prepend · on May 17, 2023

What rulings have been against it?

You don’t need consensus to allow something, you need consensus to be against something, otherwise it’s allowed.

This “scraping is Grey” sounds like FUD to me. Legally, you can scrape anything publicly available as long as you damage the server. You understand that google is just a giant web scraper right?

And all the other search engines spidering the web. Do you think they are risking their business being in the “grey?”

tyingq · on May 17, 2023

There does seem to be some distinction if you log into and scrape:

"Only three claims remained for the final order - the violation of the CAN-SPAM Act, violation of the CFAA and California Penal Code.

The district court then granted summary judgment to Facebook on all three of the remaining Facebook claims. The district court awarded statutory damages of $3,031,350"

https://en.wikipedia.org/wiki/Facebook,_Inc._v._Power_Ventur....

prepend · on May 18, 2023

That’s completely different though as they are scraping private info that’s proprietary to Facebook.

I think there’s no grey area for whether scraping publicly available data is legal as it is legal.

SnowHill9902 · on May 17, 2023

So it’s not gray, it’s melange.

hallqv · on May 17, 2023

Multiple court reaffirms in favor of scraping, including by the SC, seems more like verdant to me..

blitzar · on May 18, 2023

http get requests are legal.

mcenedella · on May 17, 2023

LinkedIn lost the HiQ case. Your GC claiming it’s a win doesn’t change that you lost, repeatedly, in multiple courts of law: https://en.m.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

noizejoy · on May 17, 2023

> LinkedIn lost the HiQ case. Your GC claiming it’s a win doesn’t change that you lost, repeatedly, in multiple courts of law

That would appear to be false, since after appeals, there was a settlement wich would seem to make it pretty clear that hiQ lost:

https://www.natlawreview.com/article/hiq-and-linkedin-reach-...

foolswisdom · on May 17, 2023

According to that wiki page, the courts eventually did rule that HiQ violated LinkedIn's terms of service, and a settlement agreement was reached.

henryfjordan · on May 17, 2023

Linkedin lost on the CFAA claims, there were others that they did not lose.

dheera · on May 17, 2023

> In any case, web scraping is a sort of gray area of the law.

I feel like it's also only a matter of time before we have decentralized web scrapers that can consume data and assemble datasets and store the results in a decentralized fashion in a way that is completely non-enforceable. LinkedIn data would be aggregated in all sorts of interesting ways, and there would be literally no name you could sue, no address you could serve papers to, it would be a hundred thousand bots in 100 countries that collectively assembled the dataset and put it on some IPFS or torrent.

If data is visible with human eyes, it's visible by decentralized bots, the only missing piece is the technical and financial complexity of the current state of decentralized compute.

SXX · on May 18, 2023

> If you have a premium account your search is not rate limited so you can iteratively scrape anything you want even though it's technically a violation of the terms of service.

Well. This is certainly not true.

Even with Premium subscription you can hit a hard limit of personal profile views while just actively going through profiles without any automation. And Linkedin will temporary restrict and threaten to ban your account.

oldtownroad · on May 17, 2023

I have lots of LinkedIn pages for various companies (alive and dead). I’ve never had to pay: while there’s limits on page creation for free accounts (which appear to be prevent abuse) there’s no direct payment required to create a company page.

I’m not sure if you’re accidentally leaking a new strategy that LinkedIn are about to launch or if you’re mistaken but given you’re claiming to work for LinkedIn, probably worth correcting your comment either way.

babblingfish · on May 17, 2023

Definitely not the latter. I am referring to the former point.

oldtownroad · on May 18, 2023

The limit only applies if you’ve created… I can’t remember the number exactly and it’s not listed anywhere but I recall hitting a limit around 5 pages.

You can see that it’s free to make a page here — the only requirement is a free account: https://www.linkedin.com/help/linkedin/answer/a543852/create...

babblingfish · on May 18, 2023

B1FF_PSUVM · on May 17, 2023

> and is not being monetized

It seems to have a paid version with more data: https://docs.bigpicture.io/docs/free-datasets/companies/#fie...

(just the fact, I don't have an opinion)

joshspankit · on May 18, 2023

I’m on the same page. It also requires users to create an account in order to get the free version, and I would argue that in most cases that’s part of a monetization funnel.

babblingfish · on May 18, 2023

dang · on May 17, 2023

The submitted title was "World's largest open source company dataset", but (1) "world's largest" is linkbait and the article walks it back, (2) "open source" could be worded better per https://news.ycombinator.com/item?id=35979581, and (3) the only thing left in the title after taking those out would be "company dataset", which is too generic to be a good title.

I've therefore replaced the title above with what appears to be an accurate description from https://news.ycombinator.com/item?id=35978156.

mfrye0 · on May 17, 2023

Hey dang. That's fair.

To be frank, we went back and forth on this, but in the end, thought the original title was ok. The only other large, "open source" dataset we could find was 9M. So after researching, we came to the conclusion that it sounded clickbaity, but was likely accurate.

And yes on "open source". We fully intend for this to be "open source" in the full meaning of the word, but it seems we were moving too fast and missed adding the formal license.

mfrye0 · on May 17, 2023

Hey HN, we're thrilled to announce our latest project - the World's Largest Open Source Company Dataset. Our team has been working hard on this product for the past few months, and we're excited to finally share it with you all.

We started off years ago trying to build a B2B app, but getting basic company data at scale was a huge barrier for us. This 15M+ record dataset attempts to solve that and has all the key company fields like name, industry, size, location, LinkedIn handle, etc. We aim to update it quarterly to ensure that you always have the most up-to-date information.

Disclaimer: Okay, we have to admit, we didn't exactly comb through every dataset out there to verify that ours is the world's largest, but we did our research, and we're pretty sure it might be. Whether or not that's true, we believe this dataset is a robust and invaluable resource for anyone interested in company data.

rickette · on May 17, 2023

Out of curiosity: Would you be willing to share how you acquired this data? Website scraping or other means?

Atlas22 · on May 17, 2023

Data is very likely to be from LinkedIn if you look at the field descriptions and stats. The only field that is 100% available is the one based on the LinkedIn URL. I would guess scraping unless LinkedIn provides an API for this data that I can't find.

jamesgill · on May 17, 2023

They (Microsoft) have APIs for app/website integration, but that's all I know about.

https://developer.linkedin.com/

yolo3000 · on May 17, 2023

Most likely from scraping (crunchbase, yahoo,etc), unless they bought it from somewhere. In most countries you can get it from the chamber of commerce. Dun and Bradstreet and other similar companies. Some of these data aggregators will have partnership with other companies, and you can also (illegally) scrape it from there.

mfrye0 · on May 17, 2023

So this blew up today. Reviewing all the comments now.

1. Yes, this is scraped from public sources. 2. Yes, this is free to use / is open source in the broadest sense. Apologies for the confusion on the lack of a license and no mention about this in our TOS. We probably should update our TOS to be clearer here. 3. This is a raw dump of companies from all over the world by LinkedIn handle. The handles are deduped, but the website is not.

justinclift · on May 17, 2023

> Yes, this is free to use ...

Including for commercial purposes?

mfrye0 · on May 17, 2023

tyingq · on May 17, 2023

Seems at odds with your tos. https://bigpicture.io/terms

jsty · on May 17, 2023

What's the license the dataset is released under? I poked around the documentation a bit but couldn't find it - apologies if it's in there!

conzept · on May 17, 2023

The data is not licensed as opensource (only non-commercial use): "The Service and its entire contents, features and functionality (including but not limited to all information, software, text, displays, images, video and audio, and the design, selection and arrangement thereof), are owned by the Company, its licensors or other providers of such material and are protected by United States and international copyright, trademark, patent, trade secret and other intellectual property or proprietary rights laws. You are permitted to use the Service for Your personal, non-commercial use, or legitimate business purposes related to Your role as a customer of BigPicture.io." - https://bigpicture.io/terms

paxys · on May 17, 2023

You use "open source" multiple times in the post, HN title, HN comments, but:

1. The source code for the project isn't shared anywhere.

2. The data isn't shared under any standard open source license.

3. The terms of your site explicitly prohibit commercial use of this data.

So what exactly makes this "open source in the broadest sense"?

time_to_smile · on May 17, 2023

It's open source in the sense of OSINT [0]. Clearly confusing on a site like Hacker News, but this has been standard usage of the term for that community for a long time now.

0. https://en.wikipedia.org/wiki/Open-source_intelligence

turtleyacht · on May 18, 2023

Thank-you for the clarification. "Open-source" is definitely different from "Open Source."

Meaning Open-source (sourced from open sources) but claiming Open Source is disingenuous.

Wikipedia isn't helpful either, because it refers to OSS as Open-source Software [1].

Open Source meaning may be more useful in comparison to Free Software. Stallman refers to "Open-source" (hyphenated) only once in this article, but only to refer to it as confusing versus free software [2].

It's possible "OSINT as Open-source" has been in use for longer than Stallman's use of "open source," but definitely they are different.

It's strange a site would sell up a feature on HN as "Open-source content, in the meaning of OSINT" without being up-front about it. The default assumption would be "open source as code that is free to modify, etc."

The mental gymnastics would be

  1. They claim it is "open source."
  2. They are talking about _content._
  3. It must be the OSINT kind of "open."

This could be a pattern, because they're always needing to add another comment, "Just kidding, we meant OSINT open; we're not sharing the code."

... documentation could be open source too, though--in the sense of "free to modify, etc" and not "sourced from freely available data."

Could it be both? Only if they accept contributions, I guess.

[1] https://en.m.wikipedia.org/wiki/Open-source_software

[2] https://www.gnu.org/philosophy/open-source-misses-the-point....

nxqs · on May 17, 2023

"It's 'open source'"... https://youtu.be/dTRKCXC0JFg?t=6

mfrye0 · on May 17, 2023

Fair points. I agree the wording could be better.

No, the source code is not available. This dataset is a subset of the raw data our system collects. Our final product made available via the API does a variety of processing steps on the raw data (dedupes, joins, ML predictions, etc). The final, processed data is the piece that is proprietary / subject to the terms.

We will update the site terms to reference this dataset, as we aim to continue releasing an updated version each quarter. I'll have to double check with the lawyers, but it will most likely be MIT licensed.

slabity · on May 17, 2023

What exactly does "open-source" mean to you? Because it sounds like there's absolutely nothing open about this other than a small scraping of LinkedIn data (which you should probably ask your lawyers if you're even allowed to license out).

The wording isn't just misleading, it's a complete lie.

EDIT: Nevermind, the title has been updated to accurately reflect this being a small datadump.

simonw · on May 17, 2023

It's a 2.64GB CSV file with the following columns:

    handle
    type
    name
    website
    founded
    industry
    specialties
    size
    city
    state
    country_code

15,263,246 rows.

I think the main listing for Google is this one (as an example):

10361050:company/google,Public Company,Google,goo.gle,,Software Development,"search, ads, mobile, android, online video, apps, machine learning, virtual reality, cloud, hardware, artificial intelligence, youtube, and software","10,001+",Mountain View,California,US

simonw · on May 17, 2023

Ran https://sqlite-utils.datasette.io/en/stable/cli.html#cli-ana... to figure out the most common values in each column:

    company.type: (2/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 5032965

      Distinct values: 189

      Most common:
        5128934: Privately Held
        5032965: 
        1202598: Self-Owned
        1007806: Partnership
        952111: Public Company
        763992: Nonprofit
        749117: Self-Employed
        319885: Educational
        84829: Government Agency
        2423: De financiación privada

    company.website: (4/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 3043640

      Distinct values: 10675220

      Most common:
        3043640: 
        81214: facebook.com
        50174: instagram.com
        41050: business.site
        27134: linktr.ee
        24024: indiamart.com
        19683: wixsite.com
        16546: negocio.site
        15864: linkedin.com
        13201: yelp.com

    company.founded: (5/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 7626823

      Distinct values: 1551

      Most common:
        7626823: 
        488964: 2020
        438343: 2017
        417847: 2018
        404720: 2016
        398489: 2019
        386049: 2015
        383686: 2021
        338120: 2014
        298768: 2013

    company.industry: (6/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 1274455

      Distinct values: 2591

      Most common:
        1274455: 
        793016: IT Services and IT Consulting
        626267: Advertising Services
        623184: Construction
        424481: Real Estate
        417648: Business Consulting and Services
        416398: Software Development
        401791: Retail
        337914: Financial Services
        305029: Wellness and Fitness Services

    company.specialties: (7/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 9681220

      Distinct values: 5226043

      Most common:
        9681220: 
        1678: Real Estate
        1353: Education
        557: Software Development
        537: real estate
        516: Recruitment
        456: Marketing
        420: Property Management
        396: Digital Marketing
        393: Hospitality

    company.size: (8/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 2531526

      Distinct values: 226

      Most common:
        6317189: 2-10
        3459571: 11-50
        2531526: 
        1170922: 51-200
        730818: 1 employee
        417504: 201-500
        262702: 1
        149520: 501-1,000
        130678: 1,001-5,000
        43399: 10,001+

    company.city: (9/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 3039470

      Distinct values: 393107

      Most common:
        3039470: 
        262475: London
        116915: Paris
        110220: New York
        96914: São Paulo
        72455: Los Angeles
        66837: Madrid
        65075: Toronto
        59321: New Delhi
        58151: Dubai

    company.state: (10/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 4326168

      Distinct values: 55773

      Most common:
        4326168: 
        670232: England
        567647: California
        318576: Texas
        283576: New York
        276783: Florida
        215523: São Paulo
        172516: Maharashtra
        164761: Ontario
        161408: Île-de-France

    company.country_code: (11/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 2858360

      Distinct values: 272

      Most common:
        3943846: US
        2858360: 
        1204699: GB
        816378: IN
        691296: FR
        638895: BR
        443326: DE
        401268: NL
        386207: ES
        373460: CA

    company_fts.specialties: (13/22)

      Total rows: 15263246
      Null rows: 0
      Blank rows: 9681220

      Distinct values: 5226043

      Most common:
        9681220: 
        1678: Real Estate
        1353: Education
        557: Software Development
        537: real estate
        516: Recruitment
        456: Marketing
        420: Property Management
        396: Digital Marketing
        393: Hospitality

tacker2000 · on May 17, 2023

Very interesting, thanks!

mfrye0 · on May 17, 2023

How did you compute this? I just did another check to verify (wc -l) and it's coming to 15,980,531.

simonw · on May 17, 2023

I used wc -l at first, but I've just imported into SQLite and the count(*) is 15,263,246 - updated my previous comment (which had said 15,263,251).

I downloaded the CSV and ran:

    sqlite-utils insert companies.db company companies-dataset-2023-02-ckgENv.csv --csv
    sqlite-utils enable-fts companies.db company name specialties
    sqlite-utils analyze-tables --save companies.db

This lets me run searches against the name and specialties columns, and gives me those aggregate stats too.

mfrye0 · on May 17, 2023

Ok. I'm not sure how this happened, but I think the dataset was somehow mislabeled. It appears that this dataset is the Q1 version, not the latest Q2. Can you please try re-downloading it?

We're probably going to have to make an public announcement about this...

simonw · on May 17, 2023

OK, that one has 15,948,996 rows.

Here's what I got from running the same "sqlite-utils analyze-tables companies2.db company" command against it:

    company.handle: (1/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 0

      Distinct values: 15948996

    company.type: (2/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 5253878

      Distinct values: 92

      Most common:
        5311279: Privately Held
        5253878: 
        1290064: Self-Owned
        1055857: Partnership
        987045: Public Company
        828643: Self-Employed
        799655: Nonprofit
        334552: Educational
        87681: Government Agency
        35: De financiación privada

    company.name: (3/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 1591

      Distinct values: 15549439

      Most common:
        1591: 
        1098: .
        277: A
        246: -
        164: None
        155: X
        142: N/A
        132: ...
        128: x
        122: 1

    company.website: (4/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 3249552

      Distinct values: 11272926

      Most common:
        3249552: 
        86957: facebook.com
        57769: instagram.com
        46404: business.site
        31397: linktr.ee
        27882: indiamart.com
        21852: wixsite.com
        19008: negocio.site
        17366: linkedin.com
        13224: yelp.com

    company.founded: (5/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 8040264

      Distinct values: 1561

      Most common:
        8040264: 
        524236: 2020
        451742: 2017
        441748: 2018
        426575: 2019
        418318: 2021
        411391: 2016
        389487: 2015
        339212: 2014
        299038: 2013

    company.industry: (6/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 1334901

      Distinct values: 421

      Most common:
        1334901: 
        820156: IT Services and IT Consulting
        651746: Construction
        651557: Advertising Services
        465857: Software Development
        455111: Business Consulting and Services
        447922: Real Estate
        435151: Retail
        355049: Financial Services
        312937: Wellness and Fitness Services

    company.size: (7/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 2655086

      Distinct values: 123

      Most common:
        6646929: 2-10
        3584483: 11-50
        2655086: 
        1197530: 51-200
        1091094: 1 employee
        421595: 201-500
        150053: 501-1,000
        129373: 1,001-5,000
        44755: 10,001+
        27742: 5,001-10,000

    company.city: (8/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 3155391

      Distinct values: 410985

      Most common:
        3155391: 
        269708: London
        124059: Paris
        113135: New York
        99314: São Paulo
        75428: Los Angeles
        69691: Madrid
        67328: Toronto
        63738: Dubai
        63456: New Delhi

    company.state: (9/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 4524015

      Distinct values: 58563

      Most common:
        4524015: 
        691167: England
        584141: California
        329584: Texas
        291639: New York
        286723: Florida
        222552: São Paulo
        185925: Maharashtra
        171885: Ontario
        171657: Île-de-France

    company.country_code: (10/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 2961064

      Distinct values: 272

      Most common:
        4059985: US
        2961064: 
        1232403: GB
        885302: IN
        756411: FR
        664235: BR
        467501: DE
        414433: NL
        410372: ES
        389535: CA

simonw · on May 17, 2023

The thing I find most interesting is this:

        269708: London
        124059: Paris
        113135: New York
        99314: São Paulo
        75428: Los Angeles
        69691: Madrid
        67328: Toronto
        63738: Dubai
        63456: New Delhi

I would not have expected São Paulo to come fourth in this list, after New York but in front of Los Angeles. I just learned it's the 4th largest city in the world https://en.wikipedia.org/wiki/List_of_largest_cities - after Tokyo, Delhi, Shanghai - but I guess it has much more of a representation on LinkedIn than those other cities.

rahimnathwani · on May 17, 2023

wc -l will count all newlines, even those that are escaped. Perhaps some company descriptions have newlines?

GirishSharma643 · on May 17, 2023

I don't have any work email. Can anyone please share some other source to download?

jehb · on May 17, 2023

Not an answer to your question, but I did find their justification for why they require a work email address intriguing.

https://blog.bigpicture.io/how-we-stopped-spam-signups/

I've been working through this issue a lot lately. There tend to be two camps, the "Make the experience good for the person trying to access the data, because good relationships are more important than the ability to contact someone" versus "Why should I give someone something for free when there is zero chance I can ever make a sale to that person?"

decide1000 · on May 17, 2023

Not sure why you need an account. Download it here:

https://bigpicture-datasets-public.s3.us-west-2.amazonaws.co...

maxlin · on May 18, 2023

New link: https://bigpicture-datasets-public.s3.us-west-2.amazonaws.co...

sundarurfriend · on May 17, 2023

There's a comment below saying it's a 2.64GB CSV file, while gzip shows the uncompressed size of this one to be 1.97GB.

medvezhenok · on May 17, 2023

Presumably as lead generation for their main upsell, which is the enriched version of the same data set :)

opportune · on May 18, 2023

This whole thing is amateur to the core, it’s an improperly licensed (are you open source or not? What license? You don’t get to say “probably eventually MIT” lol) freemium scrape of linkedin masquerading as a product

ttul · on May 17, 2023

Oh boy, DMCA request incoming...

maxlin · on May 18, 2023

That is literally their own site. I don't think they would DMCA themselves

Thanks for the link. The work email requirement for just the download is just bullshit (yes I did read their piece, with that kind of logic they could also require credit card info etc while still calling it free like many scammy sites do)

nologic01 · on May 17, 2023

Good luck with your launch! This reminded me of a similar project, the opencorporates database (https://opencorporates.com/), though the target use cases seem different.

photochemsyn · on May 17, 2023

"With over 15 million global companies included..."

What distinguishes a global from a non-global company? Also, how many of these are anonymous Delaware/Nevada/South Dakota/etc-based shell companies, or are those excluded from the dataset somehow?

Veen · on May 17, 2023

I suspect it means "15 million companies from around the world".

chirau · on May 17, 2023

I think they meant 'from across the globe' as in not necessarily from the US only or other specific geographies.

gorbachev · on May 17, 2023

Is LinkedIn scraped data open source?

Murrawhip · on May 17, 2023

On your home page you list Microsoft as being one of your clients. I'm pretty impressed that you managed to sell them what appears to be (mostly) their own data.

mpeg · on May 17, 2023

We went through this in the ad world before GDPR existed. Companies were selling datasets for advertising and, I remember one of my clients (one of the largest ad agencies in the world) telling me they had over 100 sources of data they augmented their publisher data with but they tested and saw there was only about 20% uniqueness on average between sources.

They were buying the same data again and again, recycled and repackaged.

1024core · on May 17, 2023

What would be really interesting is to turn this into a graph based on, say, past experience of CEOs/big dealings with each other/etc.

tomalaci · on May 17, 2023

The moment those graph connections will show negative view of some of those CEOs they will start screaming about privacy issues and/or sue the graph creator.

With scraping LinkedIn alone it should already be possible to create such graphs. Not sure why I haven't seen any. Either scraping LinkedIn isn't as straightforward or such grap-creating attempts have been shut down due to potential for defaming someone.

1024core · on May 17, 2023

I would say that scraping LinkedIn is not easy.

givemeethekeys · on May 17, 2023

Are the entries deduped? If one company owns another, is that represented as well?

ricardo81 · on May 17, 2023

'open source'

scrape crunchbase

scrape companies house

scrape wherever else

scrape linkedin

frontier company... or maybe not.

tuukkah · on May 17, 2023

A simple Wikidata query can return same type of information in case you prefer open data: https://w.wiki/6ify

data_maan · on May 17, 2023

I looked at the attributes they say the dataset has. Not too many (e.g. number of people, location). The really interesting ones, like who is doing business with whom, are missing.

mikecoles · on May 17, 2023

I'd like to see parent company listed. How many brands fall under TTI or Unilever?

For example, a hierarchy of tool brands: https://www.protoolreviews.com/power-tool-manufacturers-who-...

NoboruWataya · on May 17, 2023

https://search.gleif.org/#/record/549300MKFYEKVRWML317

(Not brands, but shows you subsidiaries.)

cdkmoose · on May 17, 2023

From the deeper dataset documentation, it looks like this free dataset is a field subset of their paid product.

tomalaci · on May 17, 2023

How hard is it to scrape LinkedIn for all its public profile data? Do you need special developer access? Do you need to sign some contract with MSFT for anything nontrivial?

borkborkimacat · on May 18, 2023

semi-direct link as there's some tomfoolery going on with this: https://wetransfer.com/downloads/b937345cd81d96654cb2d2bb43d...

Wronnay · on May 17, 2023

I always get "Oops! We ran into an error. Contact us at support@bigpicture.io" when I try to sign-up

mfrye0 · on May 18, 2023

Apologies. We've had a huge problem with bots, so we have a number of security measures in place. The Google Captcha component is probably flagging you as a bot.

Try disabling your VPN if you're on one, or use a different IP.

seanhunter · on May 18, 2023

Oh the irony.

companydataguy · on May 17, 2023

This is Duedil + Company Check + Open Corporates.

Duedil and CC were (mostly) powered by Creditsafe data which is much better in Europe at least than D&B

Open Corporates sold their data to Creditsafe for low 5 figures.

Interesting point re DnB in Eu it’s mostly a license if the brand name and owns little of the data or the business.

r3trohack3r · on May 17, 2023

This is awesome and in the ballpark of something I'm working on right now.

I currently have a list of developer handles, their associated aliases, and their associated email addresses - trying to map that set to employment history.

Do folks know of any good data sets for this?

andylynch · on May 17, 2023

How do you plan to identify or differentiate between legal entities? Eg a big company like your Uber example will often numerous subsidiaries, in many jurisdictions. Do you plan on including well-known identifiers like LEIs in your model?

visarga · on May 17, 2023

I searched for a dataset like this for a long time, trying to use it to augment named entity recognition tasks for documents. But now that GPT is on the market, this works out of the box. It's still useful as reference for validation.

pimlottc · on May 17, 2023

I'm confused, most of what's in this dataset has nothing to do with RedHat.

speedgoose · on May 17, 2023

The blog picture looks to be generated with dalle2. The quality was mind blowing less than a year ago, while it is now a mess of artefacts compared to dalle2.5 (Bing), adobe firefly, stable diffusion, and of course MidJourney.

wg0 · on May 17, 2023

What I am curious to know about is - what company buys from whom and the whole dependency graph to visualise how complex our modern economy is.

But not sure that kind of information is in there.

byyll · on May 17, 2023

Closest I can think of: https://www.importyeti.com

yolo3000 · on May 17, 2023

That sort of data is normally not public

sam_lowry_ · on May 17, 2023

Individual countries know this, and even share rather successfully across unions.

EU exchanges VAT information which is exactly about who buys from whom... down to transactions of 120€ and more if I am not mistaken.

Commercial databases are more limited, but even those can trace ownership of companies to ultimate beneficial owners. Orbis from Bureau Van Dijk is probably the biggest of such databases.

Rastonbury · on May 19, 2023

15 million is small, for example there are vendors who have indexed several hundred million, not speaking from data quality but size wise

brentis · on May 18, 2023

Can we rename to Glengarry leads?

seanhunter · on May 18, 2023

You don’t get the Glengarry leads. Because to give them to you … would be throwing them away.

whoomp12342 · on May 17, 2023

sounds fun until I have to make an account. then noep. too lazy.