6 datasets found
  1. m

    World’s Top 2% of Scientists list by Stanford University: An Analysis of its...

    • data.mendeley.com
    Updated Nov 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JOHN Philip (2023). World’s Top 2% of Scientists list by Stanford University: An Analysis of its Robustness [Dataset]. http://doi.org/10.17632/td6tdp4m6t.1
    Explore at:
    Dataset updated
    Nov 17, 2023
    Authors
    JOHN Philip
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    John Ioannidis and co-authors [1] created a publicly available database of top-cited scientists in the world. This database, intended to address the misuse of citation metrics, has generated a lot of interest among the scientific community, institutions, and media. Many institutions used this as a yardstick to assess the quality of researchers. At the same time, some people look at this list with skepticism citing problems with the methodology used. Two separate databases are created based on career-long and, single recent year impact. This database is created using Scopus data from Elsevier[1-3]. The Scientists included in this database are classified into 22 scientific fields and 174 sub-fields. The parameters considered for this analysis are total citations from 1996 to 2022 (nc9622), h index in 2022 (h22), c-score, and world rank based on c-score (Rank ns). Citations without self-cites are considered in all cases (indicated as ns). In the case of a single-year case, citations during 2022 (nc2222) instead of Nc9622 are considered.

    To evaluate the robustness of c-score-based ranking, I have done a detailed analysis of the matrix parameters of the last 25 years (1998-2022) of Nobel laureates of Physics, chemistry, and medicine, and compared them with the top 100 rank holders in the list. The latest career-long and single-year-based databases (2022) were used for this analysis. The details of the analysis are presented below: Though the article says the selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field, the actual career-based ranking list has 204644 names[1]. The single-year database contains 210199 names. So, the list published contains ~ the top 4% of scientists. In the career-based rank list, for the person with the lowest rank of 4809825, the nc9622, h22, and c-score were 41, 3, and 1.3632, respectively. Whereas for the person with the No.1 rank in the list, the nc9622, h22, and c-score were 345061, 264, and 5.5927, respectively. Three people on the list had less than 100 citations during 96-2022, 1155 people had an h22 less than 10, and 6 people had a C-score less than 2.
    In the single year-based rank list, for the person with the lowest rank (6547764), the nc2222, h22, and c-score were 1, 1, and 0. 6, respectively. Whereas for the person with the No.1 rank, the nc9622, h22, and c-score were 34582, 68, and 5.3368, respectively. 4463 people on the list had less than 100 citations in 2022, 71512 people had an h22 less than 10, and 313 people had a C-score less than 2. The entry of many authors having single digit H index and a very meager total number of citations indicates serious shortcomings of the c-score-based ranking methodology. These results indicate shortcomings in the ranking methodology.

  2. Medicare 20% [2006-2018] Enrollment/Summary (MBSF)

    • redivis.com
    application/jsonl +7
    Updated Dec 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Center for Population Health Sciences (2021). Medicare 20% [2006-2018] Enrollment/Summary (MBSF) [Dataset]. http://doi.org/10.57761/wnn9-b060
    Explore at:
    avro, spss, sas, application/jsonl, csv, arrow, parquet, stataAvailable download formats
    Dataset updated
    Dec 17, 2021
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Center for Population Health Sciences
    Time period covered
    Jan 1, 1999 - Dec 31, 2018
    Description

    Abstract

    Master Beneficiary Summary Files (MBSF)

    Usage

    This dataset page includes some of the tables from the Medicare Data in PHS's possession. Other Medicare tables are included on other dataset pages on the PHS Data Portal. Depending upon your research question and your DUA with CMS, you may only need tables from a subset of the Medicare dataset pages, or you may need tables from all of them.

    The location of each of the Medicare tables (i.e. a chart of which tables are included in each Medicare dataset page) is shown here.

    Before Manuscript Submission

    All manuscripts (and other items you'd like to publish) must be submitted to

    phsdatacore@stanford.edu for approval prior to journal submission.

    We will check your cell sizes and citations.

    For more information about how to cite PHS and PHS datasets, please visit:

    https:/phsdocs.developerhub.io/need-help/citing-phs-data-core

    Documentation

    Metadata access is required to view this section.

    Section 2

    Metadata access is required to view this section.

    Usage Notes

    Metadata access is required to view this section.

  3. Medicare RIF 20%

    • redivis.com
    application/jsonl +7
    Updated Apr 9, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Center for Population Health Sciences (2018). Medicare RIF 20% [Dataset]. http://doi.org/10.57761/2g49-b240
    Explore at:
    application/jsonl, stata, parquet, spss, avro, arrow, sas, csvAvailable download formats
    Dataset updated
    Apr 9, 2018
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Center for Population Health Sciences
    Description

    Abstract

    Stanford has a 20% sample of CMS data. These data are hosted on our secure servers and can only be accessed after completing a reuse application with CMS. You can explore these data using our CMS Public files which have no restrictions.

    Documentation

    **A checklist for the steps in gaining access to the CMS RIF 20% sample can be found here: **CMS RIF 20% Sample Access Checklist

    **ResDAC has full and current **CMS File Availability and Documentation

    The Stanford Center for Population Health Sciences has purchased a 20% sample (linked) of all records from for the files as listed below. Where available, we have purchased all data from 2006 – 2018, though for some files all years are not available. We have the following files and years. N/A indicates that we have not purchased these files.

    Medicare Claims Inpatient: N/A Outpatient: 2006-2018 SNF: N/A Hospice: 2006-2018 Home Health: 2006-2018 Carrier: 2006-2018 DMERC: 2006-2018

    Part D Event with actual Prescriber/Pharmacy identifiersDrug Characteristics: 2006-2018 Prescriber Characteristics File: N/A Formulary File: 2010-2018 Plan Characteristics Files: 2006-2018

    MEDPAR All (SS/LS/SNF): 2006-2018

    Enrollment/Summary FilesMaster Beneficiary Summary File: All years. Base Beneficiary Summary File A/B/C/D: 2006-2018 Chronic Conditions: 2006-2018 Cost & Utilization: 2006-2018 Other Chronic or Potentially Disabling Conditions: 2006-2018 National Death Index: N/A EDB User View: Current Vital Status File: Current

    MiscellaneousMDPPAS: 2008-2018

  4. Medicare 20% [2019-2020] Enrollment/Summary

    • redivis.com
    • stanford.redivis.com
    application/jsonl +7
    Updated Jul 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Center for Population Health Sciences (2023). Medicare 20% [2019-2020] Enrollment/Summary [Dataset]. http://doi.org/10.57761/xg2t-1343
    Explore at:
    avro, arrow, application/jsonl, parquet, spss, sas, csv, stataAvailable download formats
    Dataset updated
    Jul 27, 2023
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Center for Population Health Sciences
    Time period covered
    Feb 11, 1815 - Dec 31, 2020
    Description

    Usage

    This dataset page includes some of the tables from the Medicare Data in PHS's possession. Other Medicare tables are included on other dataset pages on the PHS Data Portal. Depending upon your research question and your DUA with CMS, you may only need tables from a subset of the Medicare dataset pages, or you may need tables from all of them.

    The location of each of the Medicare tables (i.e. a chart of which tables are included in each Medicare dataset page) is shown here.

    Before Manuscript Submission

    All manuscripts (and other items you'd like to publish) must be submitted to

    phsdatacore@stanford.edu for approval prior to journal submission.

    We will check your cell sizes and citations.

    For more information about how to cite PHS and PHS datasets, please visit:

    https:/phsdocs.developerhub.io/need-help/citing-phs-data-core

    Documentation

    Metadata access is required to view this section.

  5. Dockerfiles

    • kaggle.com
    Updated Jun 22, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Research Computing Center (2018). Dockerfiles [Dataset]. https://www.kaggle.com/datasets/stanfordcompute/dockerfiles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Stanford Research Computing Center
    Description

    Context

    The Dockerfiles dataset is a set of approximately 130,000 Dockerfiles extracted in early summer 2018 across a sampling of search prefixes. This dataset is released under an MIT license

    $ find data -type f -name Dockerfile | wc -l
    129,519
    

    The files are hosted as public images on Docker Hub and thus freely available for download and parsing.

    Content

    The files are currently provided in their raw format, each named Dockerfile under an organization by the Docker Hub username. For example, here is the top level of folders under "data" in the repository:

    data
    ├── 0
    ├── 1
    ├── 2
    ├── 3
    ├── 4
    ├── 5
    ├── 6
    ├── 7
    ├── 8
    ├── 9
    ├── a
    ├── b
    ├── c
    ...
    
    ├── w
    ├── x
    ├── y
    └── z
    36 directories, 0 files
    

    and within each, we have folders that represent Docker Hub usernames:

    data/a
    ├── a13r
    ├── a13xx
    ├── a1exanderjung
    ...
    ├── azuresdk
    ├── azzanatsu
    └── azzra
    

    And then each Dockerhub username has subfolders with container names, and the subfolders contain the Dockerfiles (no pun intended).

    data/a/a13r
    ├── waecm-2018-group-16-bsp-1-backend
    │  └── Dockerfile
    ├── waecm-2018-group-16-bsp-1-frontend
    │  └── Dockerfile
    └── waecm-2018-group-16-bsp-1-revproxy
      └── Dockerfile
    

    Download

    Since this dataset (despite the huge number of files!) fits still in a Github repository, the files are provided as is under version control, and don't require any special downloading aside from cloning the repo, or downloading the archive.

    git clone https://www.github.com/vsoch/datasets
    wget https://github.com/vsoch/dockerfiles/archive/1.0.0.zip
    wget https://github.com/vsoch/dockerfiles/archive/1.0.0.tar.gz
    

    Acknowledgements

    Thanks for reading! If you have other questions, or want help for your project, please don't hesitate to reach out. If the dataset is useful to you, we have a Zenodo reference:

    DOI

    Inspiration

    Many of the same questions about signatures of software can be tested or generally relevant for this dataset. Additionally, we might ask the following:

    • How do containers relate (or inherit) from one another? For example, if we use the FROM statements to build a graph, what interesting things do we find?
    • What are signatures (of installation routines?) common across different containers?
    • Can we classify different operating systems, domains of science, or package manages?

    Resources

  6. US ZIP codes to CBSA

    • redivis.com
    application/jsonl +7
    Updated Dec 2, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Center for Population Health Sciences (2019). US ZIP codes to CBSA [Dataset]. http://doi.org/10.57761/mk9y-ty94
    Explore at:
    arrow, application/jsonl, stata, parquet, avro, spss, csv, sasAvailable download formats
    Dataset updated
    Dec 2, 2019
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Center for Population Health Sciences
    Time period covered
    Jan 1, 2010 - Apr 1, 2019
    Description

    Abstract

    A crosswalk matching US ZIP codes to corresponding CBSA (core-based statistical area)

    Documentation

    The denominators used to calculate the address ratios are the ZIP code totals. When a ZIP is split by any of the other geographies, that ZIP code is duplicated in the crosswalk file.

    **Example: **ZIP code 03870 is split by two different Census tracts, 33015066000 and 33015071000, which appear in the tract column. The ratio of residential addresses in the first ZIP-Tract record to the total number of residential addresses in the ZIP code is .0042 (.42%). The remaining residential addresses in that ZIP (99.58%) fall into the second ZIP-Tract record.

    So, for example, if one wanted to allocate data from ZIP code 03870 to each Census tract located in that ZIP code, one would multiply the number of observations in the ZIP code by the residential ratio for each tract associated with that ZIP code.

    https://redivis.com/fileUploads/4ecb405e-f533-4a5b-8286-11e56bb93368%3E" alt="">(Note that the sum of each ratio column for each distinct ZIP code may not always equal 1.00 (or 100%) due to rounding issues.)

    CBSA definition

    A core-based statistical area (CBSA) is a U.S. geographic area defined by the Office of Management and Budget (OMB) that consists of one or more counties (or equivalents) anchored by an urban center of at least 10,000 people plus adjacent counties that are socioeconomically tied to the urban center by commuting. Areas defined on the basis of these standards applied to Census 2000 data were announced by OMB in June 2003. These standards are used to replace the definitions of metropolitan areas that were defined in 1990. The OMB released new standards based on the 2010 Census on July 15, 2015.

    Further reading

    The following article demonstrates how to more effectively use the U.S. Department of Housing and Urban Development (HUD) United States Postal Service ZIP Code Crosswalk Files when working with disparate geographies.

    Wilson, Ron and Din, Alexander, 2018. “Understanding and Enhancing the U.S. Department of Housing and Urban Development’s ZIP Code Crosswalk Files,” Cityscape: A Journal of Policy Development and Research, Volume 20 Number 2, 277 – 294. URL: https://www.huduser.gov/portal/periodicals/cityscpe/vol20num2/ch16.pdf

    Contact authors

    Questions regarding these crosswalk files can be directed to Alex Din with the subject line HUD-Crosswalks.

    Acknowledgement

    This dataset is taken from the U.S. Department of Housing and Urban Development (HUD) office: https://www.huduser.gov/portal/datasets/usps_crosswalk.html#codebook

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
JOHN Philip (2023). World’s Top 2% of Scientists list by Stanford University: An Analysis of its Robustness [Dataset]. http://doi.org/10.17632/td6tdp4m6t.1

World’s Top 2% of Scientists list by Stanford University: An Analysis of its Robustness

Explore at:
Dataset updated
Nov 17, 2023
Authors
JOHN Philip
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

John Ioannidis and co-authors [1] created a publicly available database of top-cited scientists in the world. This database, intended to address the misuse of citation metrics, has generated a lot of interest among the scientific community, institutions, and media. Many institutions used this as a yardstick to assess the quality of researchers. At the same time, some people look at this list with skepticism citing problems with the methodology used. Two separate databases are created based on career-long and, single recent year impact. This database is created using Scopus data from Elsevier[1-3]. The Scientists included in this database are classified into 22 scientific fields and 174 sub-fields. The parameters considered for this analysis are total citations from 1996 to 2022 (nc9622), h index in 2022 (h22), c-score, and world rank based on c-score (Rank ns). Citations without self-cites are considered in all cases (indicated as ns). In the case of a single-year case, citations during 2022 (nc2222) instead of Nc9622 are considered.

To evaluate the robustness of c-score-based ranking, I have done a detailed analysis of the matrix parameters of the last 25 years (1998-2022) of Nobel laureates of Physics, chemistry, and medicine, and compared them with the top 100 rank holders in the list. The latest career-long and single-year-based databases (2022) were used for this analysis. The details of the analysis are presented below: Though the article says the selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field, the actual career-based ranking list has 204644 names[1]. The single-year database contains 210199 names. So, the list published contains ~ the top 4% of scientists. In the career-based rank list, for the person with the lowest rank of 4809825, the nc9622, h22, and c-score were 41, 3, and 1.3632, respectively. Whereas for the person with the No.1 rank in the list, the nc9622, h22, and c-score were 345061, 264, and 5.5927, respectively. Three people on the list had less than 100 citations during 96-2022, 1155 people had an h22 less than 10, and 6 people had a C-score less than 2.
In the single year-based rank list, for the person with the lowest rank (6547764), the nc2222, h22, and c-score were 1, 1, and 0. 6, respectively. Whereas for the person with the No.1 rank, the nc9622, h22, and c-score were 34582, 68, and 5.3368, respectively. 4463 people on the list had less than 100 citations in 2022, 71512 people had an h22 less than 10, and 313 people had a C-score less than 2. The entry of many authors having single digit H index and a very meager total number of citations indicates serious shortcomings of the c-score-based ranking methodology. These results indicate shortcomings in the ranking methodology.

Search
Clear search
Close search
Google apps
Main menu