Facebook
TwitterThis represents Harvard's responses to the Common Data Initiative. The Common Data Set (CDS) initiative is a collaborative effort among data providers in the higher education community and publishers as represented by the College Board, Peterson's, and U.S. News & World Report. The combined goal of this collaboration is to improve the quality and accuracy of information provided to all involved in a student's transition into higher education, as well as to reduce the reporting burden on data providers. This goal is attained by the development of clear, standard data items and definitions in order to determine a specific cohort relevant to each item. Data items and definitions used by the U.S. Department of Education in its higher education surveys often serve as a guide in the continued development of the CDS. Common Data Set items undergo broad review by the CDS Advisory Board as well as by data providers representing secondary schools and two- and four-year colleges. Feedback from those who utilize the CDS also is considered throughout the annual review process.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is the final release of the 2020 CES Common Content Dataset. The data includes a nationally representative sample of 61,000 American adults. This release includes the data from the survey, a full guide to the data, and the questionnaires. The dataset includes vote validation performed by Catalist. Please consult the guide and the study website (https://cces.gov.harvard.edu/frequently-asked-questions) if you have questions about the study. Special thanks to Marissa Shih and Rebecca Phillips for their work in preparing this data for release.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.7910/DVN/DBW86Thttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.7910/DVN/DBW86T
Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. We tackle this problem by releasing the HAM10000 ("Human Against Machine with 10000 training images") dataset. We collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc). More than 50% of lesions are confirmed through histopathology (histo), the ground truth for the rest of the cases is either follow-up examination (follow_up), expert consensus (consensus), or confirmation by in-vivo confocal microscopy (confocal). The dataset includes lesions with multiple images, which can be tracked by the lesion_id-column within the HAM10000_metadata file. Due to upload size limitations, images are stored in two files: HAM10000_images_part1.zip (5000 JPEG files) HAM10000_images_part2.zip (5015 JPEG files) Additional data for evaluation purposes The HAM10000 dataset served as the training set for the ISIC 2018 challenge (Task 3), with the same sources contributing the majority of the validation- and test-set as well. The test-set images are available herein as ISIC2018_Task3_Test_Images.zip (1511 images), the ground-truth in the same format as the HAM10000 data (public since 2023) is available as ISIC2018_Task3_Test_GroundTruth.csv.. The ISIC-Archive also provides the challenge images and metadata (training, validation, test) at their "ISIC Challenge Datasets" page. Comparison to physicians Test-set evaluations of the ISIC 2018 challenge were compared to physicians on an international scale, where the majority of challenge participants outperformed expert readers: Tschandl P. et al., Lancet Oncol 2019 Human-computer collaboration The test-set images were also used in a study comparing different methods and scenarios of human-computer collaboration: Tschandl P. et al., Nature Medicine 2020 Following corresponding metadata is available herein: ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.csv: Human ratings for Test images with and without interaction with a ResNet34 CNN (Malignancy Probability, Multi-Class probability, CBIR) or Human-Crowd Multi-Class probabilities. This is data was collected for and analyzed in Tschandl P. et al., Nature Medicine 2020, therefore please refer to this publication when using the data. Some details on the abbreviated column headings: image_id: This is the ISIC image_id of an image at the time of the study. There should be no duplications in the combination image_id & interaction_modality. As not every image was shown with every interaction modality, not every combination is present. prob_m_dx_akiec, ... : m is "machine probabilities". Values are values after softmax, and "_mal" is all malignant classes summed. prob_h_dx_akiec, ... : h is "human probabilities". Values are aggregated percentages of human ratings from past studies distinguishing between seven classes. Note there is no "prob_h_mal" as this was none of the tested interaction modalities. user_dx_without_interaction_akiec, ...: Number of participants choosing this diagnosis without interaction. user_dx_with_interaction_akiec, ...: Number of participants choosing this diagnosis with interaction. HAM10000_segmentations_lesion_tschandl.zip: To evaluate regions of CNN activations in Tschandl P. et al., Nature Medicine 2020 (please refer to this publication when using the data), a single dermatologist (Tschandl P) created binary segmentation masks for all 10015 images from the HAM10000 dataset. Masks were initialized with the segmentation network as described by Tschandl et al., Computers in Biology and Medicine 2019, and following verified, corrected or replaced via the free-hand selection tool in FIJI.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The source dataset and its full description may be accessed through the Harvard Dataverse, and should be cited as
Tschandl, Philipp, 2018, "The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions", https://doi.org/10.7910/DVN/DBW86T, Harvard Dataverse, V4, UNF:6:KCZFcBLiFE5ObWcTc2ZBOA== [fileUNF]
Note that the herein uploaded dataset does not contain all of the source material, namely the file ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.tab - which contains data on a study involving human-computer collaboration - and the folder HAM10000_segmentations_lesion_tschandl - containing binary segmentation masks of the training images. Still, in contrast to most of the HAM10000 datasets published in Kaggle, the current one includes the test dataset that was curated for the ISIC 2018 challenge (Task 3).
The uploaded dataset is comprised by 3 folders and 2 files, described in the table below.
| Content | Type | Description |
|---|---|---|
HAM10000_images_part_1 | folder | Part 1 of a set of training pictures |
HAM10000_images_part_2 | folder | Part 2 of a set of training pictures |
ISIC2018_Task3_Test_Images | folder | Set of test pictures |
HAM10000_metadata.csv | file | Metadata associated with the training data |
ISIC2018_Task3_Test_GroundTruth.csv | file | Metadata associated with the test data |
The training dataset (HAM10000_images_part_1 and HAM10000_images_part_2) is called "HAM10000" meaning "Human Against Machine with 10000 training images"" (actually 10015 images) and it corresponds to a large collection of multi-source dermatoscopic RGB images (JPG) of common pigmented skin lesions. The test dataset (ISIC2018_Task3_Test_Images) corresponds to 511 images. The files HAM10000_metadata.csv and ISIC2018_Task3_Test_GroundTruth.csv contain the respective metadata (data about the data) which further include other features and the labels.
Their structure of the metadata files follows the template presented by the table below.
| Column | Type | Description |
|---|---|---|
lesion_id | String | ID of the lesion case |
image_id | String | ID of an image (also the name of the respective JPG file) associated with that case |
dx | String | Label of that case |
dx_type | String | Method used for diagnosing that case |
age | Float | Age of the person associated with that case |
sex | String | Sex of the person associated with that case |
localization | String | Location of the lesion in the person body |
dataset | String | Reference from which the data was taken |
dx column (the classes)The values that the column dx may take are tabulated below.
| Value | Description |
|---|---|
akiec | Actinic keratoses and intraepithelial carcinoma (also called "Bowen's disease") - an early form of skin cancer |
bcc | Basal cell carcinoma - the most common type of skin cancer |
bkl | Benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses) - common and benign |
df | Dermatofibroma - common and benign |
mel | Melanoma - a type of skin cancer involving the melanin cells |
nv | Melanocytic nevus - the medical term for a mole (benign) |
vasc | Vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage) (benign) |
dx_type column (the diagnosis methods)And the table below present the values of the column dx_type.
| Value | Description |
|---|---|
histo | Histopathology |
follow_up | Follow-up examination |
consensus | Expert consensus |
confocal | In-vivo confocal microscopy |
Facebook
TwitterWe study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros (“nonevents”). In many literatures, these variables have proven difficult to explain and predict, a problem that seems to have at least two sources. First, popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events. We recommend corrections that outperform existing methods and change the estimates of absolute and relative risks by as much as some estimated effects reported in the literature. Second, commonly used data collection strategies are grossly inefficient for rare events data. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables, such as in international conflict data with more than a quarter-million dyads, only a few of which are at war. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all available events (e.g., wars) and a tiny fraction of nonevents (peace). This enables scholars to save as much as 99% of their (nonfixed) data collection costs or to collect much more meaningful explanatory variables.We provide methods that link these two results, enabling both types of corrections to work simultaneously, and software that implements the methods developed.
Facebook
TwitterThe Global Roads Open Access Data Set, Version 1 (gROADSv1) was developed under the auspices of the CODATA Global Roads Data Development Task Group. The data set combines the best available roads data by country into a global roads coverage, using the UN Spatial Data Infrastructure Transport (UNSDI-T) version 2 as a common data model. All country road networks have been joined topologically at the borders, and many countries have been edited for internal topology. Source data for each country are provided in the documentation, and users are encouraged to refer to the readme file for use constraints that apply to a small number of countries. Because the data are compiled from multiple sources, the date range for road network representations ranges from the 1980s to 2010 depending on the country (most countries have no confirmed date), and spatial accuracy varies. The baseline global data set was compiled by the Information Technology Outreach Services (ITOS) of the University of Georgia. Updated data for 27 countries and 6 smaller geographic entities were assembled by Columbia University's Center for International Earth Science Information Network (CIESIN), with a focus largely on developing countries with the poorest data coverage.
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
The Caselaw Access Project
In collaboration with Ravel Law, Harvard Law Library digitized over 40 million U.S. court decisions consisting of 6.7 million cases from the last 360 years into a dataset that is widely accessible to use. Access a bulk download of the data through the Caselaw Access Project API (CAPAPI): https://case.law/caselaw/ Find more information about accessing state and federal written court decisions of common law through the bulk data service documentation here:… See the full description on the dataset page: https://huggingface.co/datasets/free-law/Caselaw_Access_Project.
Facebook
TwitterWe introduce a method for scaling two data sets from different sources. The proposed method estimates a latent factor common to both datasets as well as an idiosyncratic factor unique to each. In addition, it offers a flexible modeling strategy that permits the scaled locations to be a function of covariates, and efficient implementation allows for inference through resampling. A simulation study shows that our proposed method improves over existing alternatives in capturing the variation common to both datasets, as well as the latent factors specific to each. We apply our proposed method to vote and speech data from the 112th U.S. Senate. We recover a shared subspace that aligns with a standard ideological dimension running from liberals to conservatives while recovering the words most associated with each senator's location. In addition, we estimate a word-specific subspace that ranges from national security to budget concerns, and a vote-specific subspace with Tea Party senators on one extreme and senior committee leaders on the other.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is an iris dataset commonly used in machine learning. Accessed on 10-19-2020 from the following URL: http://faculty.smu.edu/tfomby/eco5385_eco6380/data/Iris.xls
Facebook
TwitterThis replication archive contains all data and code to replicate the results in "A Common-Space Scaling of the American Judiciary and Legal Profession" by Maya Sen and Adam Bonica. Abstract: We extend the scaling methodology previously used in Bonica (2014) to jointly scale the American federal judiciary and legal profession in a common-space with other political actors. The end result is the first data set of consistently measured ideological scores across all tiers of the federal judiciary and the legal profession, including 840 federal judges and 380,307 attorneys. To illustrate these measures, we present two examples involving the U.S. Supreme Court. These data open up significant areas of scholarly inquiry.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.7910/DVN/RPATZAhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.7910/DVN/RPATZA
The Pan Africa Bean Research Alliance is a network of national agricultural research centers (NARS), and private and public sector institutions that work to deliver better beans with consumer and market preferred traits to farmers. The datasets presented here draw from 17 Sub Saharan countries that are members of PABRA. The dataset on released bean varieties is a collection of 513 bean varieties released by NARS and there characteristics. The dataset on bean varieties and the relationship to constraints provides the 513 bean varieties on the basis of resistance to constraints such as fungal, bacterial, viral, diseases and tolerance to abiotic stresses. There is also a dataset of bean varieties that have been released in more than one country, useful for moving seed from one country to another and facilitating regional trade. The dataset on Niche market traits provides the market defined classifications for bean trade in Sub Saharan Africa as well as varieties that fall into these classifications. The datasets are an update to the 2011 discussion on PABRAs achievement in breeding and delivery of bean varieties in Buruchara et. 2011 in pages 236 and 237 here: http://www.ajol.info/index.php/acsj/article/view/74168 . It is also an update to a follow up to this discussion in Muthoni, R. A., Andrade, R. 2015 on the performance of bean improvement programmes in sub-Saharan Africa from the perspectives of varietal output and adoption in chapter 8. here: http://dx.doi.org/10.1079/9781780644011.0148. The data is extracted from the PABRA M&E database available here (http://database.pabra-africa.org/?location=breeding).
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
What explains the substantial variation in the International Monetary Fund's (IMF) lending policies over time and across cases? Some scholars argue that the IMF is the servant of the United States and other powerful member-states, while others contend that the Fund's professional staff acts independently in pursuit of its own bureaucratic interests. I argue that neither of these perspectives, on its own, fully and accurately explains IMF lending behavior. Rather, I propose a “common agency” theory of IMF policymaking, in which the Fund's largest shareholders—the G5 countries that exercise de facto control over the Executive Board (EB)—act collectively as its political principal. Using this framework, I argue that preference heterogeneity among G5 governments is a key determinant of variation in IMF loan size and conditionality. Under certain conditions, preference heterogeneity leads to either conflict or “logrolling” within the EB among the Fund's largest shareholders, while in others it creates scope for the IMF staff to exploit “agency slack” and increase its autonomy. Statistical analysis of an original data set of 197 nonconcessional IMF loans to 47 countries from 1984 to 2003 yields strong support for this framework and its empirical predictions. In clarifying the politics of IMF lending, the article sheds light on the merits of recent policy proposals to reform the Fund and its decision-making rules. More broadly, it furthers our understanding of delegation, agency, and the dynamics of policymaking within international organizations.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This study investigates the impact of consistency in supervisor behaviour styles on graduate students' perceived objectification and explores the mediating mechanisms involved. This study intends to use a multi-round data collection method to reduce common methodological bias.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Why do some conspiracy theories (CTs) remain popular and continue to spread on social media while others quickly fade away? Situating conspiracy theories within the literature on social movements, we propose and test a new theory of how enduring CTs maintain and regain popularity online. We test our theory using an original, hand-coded dataset of 5,794 tweets surrounding a divisive and regularly commemorated set of CTs in Poland. We find that CTs that cue in-and- out-group threats garner more retweets and likes than CT tweets lacking this rhetoric. Surprisingly, given the extant literature on party leaders’ ability to shape political attitudes and behaviors, we find that ruling party tweets endorsing CTs gain less engagement than CT tweets from non-officials. Finally, when a CT’s main threat frames are referenced in current events, CTs re-gain popularity on social media. Given the centrality of CTs to populist rule, these results offer a new explanation for CT popularity—one focused on the conditions under which salient threat frames strongly resonate.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
These are two pedigree based data set that was used to write a collaborative paper titled "Is Craniofacial Morphology and Body Composition Related by Common Genes: Comparative Analysis of Two Ethnically Diverse Populations"
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterThis represents Harvard's responses to the Common Data Initiative. The Common Data Set (CDS) initiative is a collaborative effort among data providers in the higher education community and publishers as represented by the College Board, Peterson's, and U.S. News & World Report. The combined goal of this collaboration is to improve the quality and accuracy of information provided to all involved in a student's transition into higher education, as well as to reduce the reporting burden on data providers. This goal is attained by the development of clear, standard data items and definitions in order to determine a specific cohort relevant to each item. Data items and definitions used by the U.S. Department of Education in its higher education surveys often serve as a guide in the continued development of the CDS. Common Data Set items undergo broad review by the CDS Advisory Board as well as by data providers representing secondary schools and two- and four-year colleges. Feedback from those who utilize the CDS also is considered throughout the annual review process.