10 datasets found
  1. Data from: Course-Skill Atlas: A national longitudinal dataset of skills...

    • figshare.com
    application/gzip
    Updated Oct 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alireza Javadian Sabet; Sarah H. Bana; Renzhe Yu; Morgan Frank (2024). Course-Skill Atlas: A national longitudinal dataset of skills taught in U.S. higher education curricula [Dataset]. http://doi.org/10.6084/m9.figshare.25632429.v7
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Oct 8, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Alireza Javadian Sabet; Sarah H. Bana; Renzhe Yu; Morgan Frank
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Higher education plays a critical role in driving an innovative economy by equipping students with knowledge and skills demanded by the workforce.While researchers and practitioners have developed data systems to track detailed occupational skills, such as those established by the U.S. Department of Labor (DOL), much less effort has been made to document which of these skills are being developed in higher education at a similar granularity.Here, we fill this gap by presenting Course-Skill Atlas -- a longitudinal dataset of skills inferred from over three million course syllabi taught at nearly three thousand U.S. higher education institutions. To construct Course-Skill Atlas, we apply natural language processing to quantify the alignment between course syllabi and detailed workplace activities (DWAs) used by the DOL to describe occupations. We then aggregate these alignment scores to create skill profiles for institutions and academic majors. Our dataset offers a large-scale representation of college education's role in preparing students for the labor market.Overall, Course-Skill Atlas can enable new research on the source of skills in the context of workforce development and provide actionable insights for shaping the future of higher education to meet evolving labor demands, especially in the face of new technologies.

  2. 2019 Kaggle Machine Learning & Data Science Survey

    • kaggle.com
    Updated Dec 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EK (2020). 2019 Kaggle Machine Learning & Data Science Survey [Dataset]. https://www.kaggle.com/eswarankrishnasamy/2019-kaggle-machine-learning-data-science-survey/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 22, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    EK
    Description

    Overview Welcome to Kaggle's third annual Machine Learning and Data Science Survey ― and our second-ever survey data challenge. You can read our executive summary here.

    This year, as in 2017 and 2018, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for three weeks in October, and after cleaning the data we finished with 19,717 responses!

    There's a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.

    Challenge This year Kaggle is launching the second annual Data Science Survey Challenge, where we will be awarding a prize pool of $30,000 to notebook authors who tell a rich story about a subset of the data science and machine learning community.

    In our third year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, we’re inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world.

    The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

    Submissions will be evaluated on the following:

    Composition - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and notebook, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one notebook, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid.

    How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry.

    No submission is necessary for the Weekly Notebook Award. To be eligible, a notebook must be public and use the 2019 Data Science Survey as a data source.

    Submission deadline: 11:59PM UTC, December 2nd, 2019.

    Survey Methodology This survey received 19,717 usable respondents from 171 countries and territories. If a country or territory received less than 50 respondents, we grouped them into a group named “Other” for anonymity.

    We excluded respondents who were flagged by our survey system as “Spam”.

    Most of our respondents were found primarily through Kaggle channels, like our email list, discussion forums and social media channels.

    The survey was live from October 8th to October 28th. We allowed respondents to complete the survey at any time during that window. The median response time for those who participated in the survey was approximately 10 minutes.

    Not every question was shown to every respondent. You can learn more about the different segments we used in the survey_schema.csv file. In general, respondents with more experience were asked more questions and respondents with less experience were asked less questions.

    To protect the respondents’ identity, the answers to multiple choice questions have been separated into a separate data file from the open-ended responses. We do not provide a key to match up the multiple choice and free form responses. Further, the free form responses have been randomized column-wise such that the responses that appear on the same row did not necessarily come from the same survey-taker.

    Multiple choice single response questions fit into individual columns whereas multiple choice multiple response questions were split into multiple columns. Text responses were encoded to protect user privacy and countries with fewer than 50 respondents were grouped into the category "other".

    Data has been released under a CC 2.0 license: https://creativecommons.org/licenses/by/2.0/

  3. US Dept of Education: College Scorecard

    • kaggle.com
    zip
    Updated Nov 9, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2017). US Dept of Education: College Scorecard [Dataset]. https://www.kaggle.com/kaggle/college-scorecard
    Explore at:
    zip(589617678 bytes)Available download formats
    Dataset updated
    Nov 9, 2017
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    It's no secret that US university students often graduate with debt repayment obligations that far outstrip their employment and income prospects. While it's understood that students from elite colleges tend to earn more than graduates from less prestigious universities, the finer relationships between future income and university attendance are quite murky. In an effort to make educational investments less speculative, the US Department of Education has matched information from the student financial aid system with federal tax returns to create the College Scorecard dataset.

    Kaggle is hosting the College Scorecard dataset in order to facilitate shared learning and collaboration. Insights from this dataset can help make the returns on higher education more transparent and, in turn, more fair.

    Data Description

    Here's a script showing an exploratory overview of some of the data.

    college-scorecard-release-*.zip contains a compressed version of the same data available through Kaggle Scripts.

    It consists of three components:

    • All the raw data files released in version 1.40 of the college scorecard data
    • Scorecard.csv, a single CSV file with all the years data combined. In it, we've converted categorical variables represented by integer keys in the original data to their labels and added a Year column
    • database.sqlite, a SQLite database containing a single Scorecard table that contains the same information as Scorecard.csv

    New to data exploration in R? Take the free, interactive DataCamp course, "Data Exploration With Kaggle Scripts," to learn the basics of visualizing data with ggplot. You'll also create your first Kaggle Scripts along the way.

  4. INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

    • zenodo.org
    • data.niaid.nih.gov
    pdf
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta; Kishor Datta Gupta; Nafiz Sadman; Nishat Anjum (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. http://doi.org/10.5281/zenodo.4047648
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta; Kishor Datta Gupta; Nafiz Sadman; Nishat Anjum
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bangladesh, United States
    Description

    Introduction

    There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

    However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

    2 Data-set Introduction

    2.1 Data Collection

    We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

    • The headline must have one or more words directly or indirectly related to COVID-19.
    • The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.
    • The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.
    • Avoid taking duplicate reports.
    • Maintain a time frame for the above mentioned newspapers.

    To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

    2.2 Data Pre-processing and Statistics

    Some pre-processing steps performed on the newspaper report dataset are as follows:

    • Remove hyperlinks.
    • Remove non-English alphanumeric characters.
    • Remove stop words.
    • Lemmatize text.

    While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

    The primary data statistics of the two dataset are shown in Table 1 and 2.

    Table 1: Covid-News-USA-NNK data statistics
    
    No of words per
    headline
    
    7 to 20
    
    No of words per body
    content
    
    150 to 2100
    
    Table 2: Covid-News-BD-NNK data statistics
    No of words per
    headline
    
    10 to 20
    
    No of words per body
    content
    
    100 to 1500
    

    2.3 Dataset Repository

    We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

    3 Literature Review

    Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

    Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

    Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

    Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

    4 Our experiments and Result analysis

    We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

    • In February, both the news paper have talked about China and source of the outbreak.
    • StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.
    • Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.
    • Washington Post discussed global issues more than StarTribune.
    • StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.
    • While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

    We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

    where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract

  5. c

    English Poor Law Cases, 1690-1815

    • datacatalogue.cessda.eu
    • beta.ukdataservice.ac.uk
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deakin, S; Shuku, L; Cheok, V (2025). English Poor Law Cases, 1690-1815 [Dataset]. http://doi.org/10.5255/UKDA-SN-856924
    Explore at:
    Dataset updated
    Mar 26, 2025
    Dataset provided by
    University of Cambridge
    Authors
    Deakin, S; Shuku, L; Cheok, V
    Time period covered
    Jan 1, 2020 - Jan 31, 2023
    Area covered
    United Kingdom
    Variables measured
    Text unit
    Measurement technique
    The cases were sourced from original texts of legal judgments. A text file was first created for each judgment and a separate word file was then created. The word files were annotated for subsequent use in computational analysis. In the current dataset the cases are ordered alphabetically in a single word document. The annotations (colour coding for words (yellow) and certain longer phrases (green) of interest) have been retained.
    Description

    This dataset of historical poor law cases was created as part of a project aiming to assess the implications of the introduction of Artificial Intelligence (AI) into legal systems in Japan and the United Kingdom. The project was jointly funded by the UK’s Economic and Social Research Council, part of UKRI, and the Japanese Society and Technology Agency (JST), and involved collaboration between Cambridge University (the Centre for Business Research, Department of Computer Science and Faculty of Law) and Hitotsubashi University, Tokyo (the Graduate Schools of Law and Business Administration). As part of the project, a dataset of historic poor law cases was created to facilitate the analysis of legal texts using natural language processing methods. The dataset contains judgments of cases which have been annotated to facilitate computational analysis. Specifically, they make it possible to see how legal terms have evolved over time in the area of disputes over the law governing settlement by hiring.

    A World Economic Forum meeting at Davos 2019 heralded the dawn of 'Society 5.0' in Japan. Its goal: creating a 'human-centred society that balances economic advancement with the resolution of social problems by a system that highly integrates cyberspace and physical space.' Using Artificial Intelligence (AI), robotics and data, 'Society 5.0' proposes to '...enable the provision of only those products and services that are needed to the people that need them at the time they are needed, thereby optimizing the entire social and organizational system.' The Japanese government accepts that realising this vision 'will not be without its difficulties,' but intends 'to face them head-on with the aim of being the first in the world as a country facing challenging issues to present a model future society.' The UK government is similarly committed to investing in AI and likewise views the AI as central to engineering a more profitable economy and prosperous society.

    This vision is, however, starting to crystallise in the rhetoric of LegalTech developers who have the data-intensive-and thus target-rich-environment of law in their sights. Buoyed by investment and claims of superior decision-making capabilities over human lawyers and judges, LegalTech is now being deputised to usher in a new era of 'smart' law built on AI and Big Data. While there are a number of bold claims made about the capabilities of these technologies, comparatively little attention has been directed to more fundamental questions about how we might assess the feasibility of using them to replicate core aspects of legal process, and ensuring the public has a meaningful say in the development and implementation.

    This innovative and timely research project intends to approach these questions from a number of vectors. At a theoretical level, we consider the likely consequences of this step using a Horizon Scanning methodology developed in collaboration with our Japanese partners and an innovative systemic-evolutionary model of law. Many aspects of legal reasoning have algorithmic features which could lend themselves to automation. However, an evolutionary perspective also points to features of legal reasoning which are inconsistent with ML: including the reflexivity of legal knowledge and the incompleteness of legal rules at the point where they encounter the 'chaotic' and unstructured data generated by other social sub-systems. We will test our theory by developing a hierarchical model (or ontology), derived from our legal expertise and public available datasets, for classifying employment relationships under UK law. This will let us probe the extent to which legal reasoning can be modelled using less computational-intensive methods such as Markov Models and Monte Carlo Trees.

    Building upon these theoretical innovations, we will then turn our attention from modelling a legal domain using historical data to exploring whether the outcome of legal cases can be reliably predicted using various technique for optimising datasets. For this we will use a data set comprised of 24,179 cases from the High Court of England and Wales. This will allow us to harness Natural Language Processing (NLP) techniques such as named entity recognition (to identify relevant parties) and sentiment analysis (to analyse opinions and determine the disposition of a party) in addition to identifying the main legal and factual points of the dispute, remedies, costs, and trial durations. By trailing various predictive heuristics and ML techniques against this dataset we hope to develop a more granular understanding as to the feasibility of predicting dispute outcomes and insight to what factors are relevant for legal decision-making. This will allow us to then undertake a comparative analysis with the results of existing studies and shed light on the legal contexts and questions where AI can and cannot be used to produce accurate and repeatable results.

  6. A

    Awards -- Master's Degree Programs at Predominantly Black Institutions

    • data.amerigeoss.org
    • catalog.data.gov
    doc
    Updated Oct 19, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AmeriGEOSS Dev (2021). Awards -- Master's Degree Programs at Predominantly Black Institutions [Dataset]. https://data.amerigeoss.org/ro/dataset/awards-masters-degree-programs-at-predominantly-black-institutions
    Explore at:
    docAvailable download formats
    Dataset updated
    Oct 19, 2021
    Dataset provided by
    AmeriGEOSS Dev
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the enactment of the Higher Education Opportunity Act (HEOA) of 2008, five Predominantly Black Institutions are eligible to receive funding to improve graduate education opportunities at the master’s level in mathematics, engineering, physical or natural sciences, computer science, information technology, nursing, allied health or other scientific disciplines where African American students are underrepresented.

    Types of Projects Institutions may use federal funds for activities that include:

    Purchase, rental or lease of scientific or laboratory equipment for educational purposes, including instructional and research purposes; Construction, maintenance, renovation and improvement in classroom, library, laboratory and other instructional facilities, including purchase or rental of telecommunications technology equipment or services; Purchase of library books, periodicals, technical and other scientific journals, microfilm, microfiche, and other educational materials, including telecommunications program materials; Scholarships, fellowships, and other financial assistance for needy graduate students to permit the enrollment of students in, and completion of a master’s degree in mathematics, engineering, physical or natural sciences, computer science, information technology, nursing, allied health, or other scientific disciplines in which African Americans are underrepresented; Establishing or improving a development office to strengthen and increase contributions from alumni and the private sector; Assisting in the establishment or maintenance of an institutional endowment to facilitate financial independence pursuant to Section 331; Funds and administrative management, and the acquisition of equipment, including software, for use in strengthening funds management and management information systems; Acquisition of real property that is adjacent to the campus in connection with the construction, renovation, or improvement of, or an addition to, campus facilities; Education or financial information designed to improve the financial literacy and economic literacy of students or the students’ families, especially with regards to student indebtedness and student assistance programs under title IV; Tutoring, counseling, and student service programs designed to improve academic success; Faculty professional development, faculty exchanges, and faculty participation in professional conferences and meetings; and Other activities proposed in the application that are approved by the Secretary as part of the review and acceptance of such application.

  7. d

    #RoeOverturned: Twitter Dataset on the Abortion Rights Controversy

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rao, Ashwin; Rong-Ching Chang; Qiankun Zhong; Magdalena Wojcieszak; Kristina Lerman (2023). #RoeOverturned: Twitter Dataset on the Abortion Rights Controversy [Dataset]. https://search.dataone.org/view/sha256%3A7705ec5fc680ea4ffc251c29835603dd922a3be5acef45c560d0ab5c5def4a59
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Rao, Ashwin; Rong-Ching Chang; Qiankun Zhong; Magdalena Wojcieszak; Kristina Lerman
    Description

    On June 24, 2022, the United States Supreme Court overturned landmark rulings made in its 1973 verdict in Roe v. Wade. The justices by way of a majority vote in Dobbs v. Jackson Women's Health Organization, decided that abortion wasn't a constitutional right and returned the issue of abortion to the elected representatives. This decision triggered multiple protests and debates across the US, especially in the context of the midterm elections in November 2022. Given that many citizens use social media platforms to express their views and mobilize for collective action, and given that online debate provides tangible effects on public opinion, political participation, news media coverage, and the political decision-making, it is crucial to understand online discussions surrounding this topic. Toward this end, we present the first large-scale Twitter dataset collected on the abortion rights debate in the United States. We present a set of 74M tweets systematically collected over the course of one year from January 1, 2022 to January 6, 2023.

  8. d

    GraphXAI

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Queen, Owen (2023). GraphXAI [Dataset]. http://doi.org/10.7910/DVN/KULOS8
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Queen, Owen
    Description

    As post hoc explanations are increasingly used to understand the behavior of Graph Neural Networks (GNNs), it becomes crucial to evaluate the quality and reliability of GNN explanations. However, assessing the quality of GNN explanations is challenging as existing graph datasets have no or unreliable ground-truth explanations for a given task. Here, we introduce a synthetic graph data generator, ShapeGGen, which can generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. Further, the flexibility to generate diverse synthetic datasets and corresponding ground-truth explanations allows us to mimic the data generated by various real-world applications. We include ShapeGGen and additional XAI-ready real-world graph datasets into an open-source graph explainability library, GraphXAI. In addition, GraphXAI provides a broader ecosystem of data loaders, data processing functions, synthetic and real-world graph datasets with ground-truth explanations, visualizers, GNN model implementations, and a set of evaluation metrics to benchmark the performance of any given GNN explainer.

  9. d

    Event construal and temporal distance in natural language - Dataset - B2FIND...

    • b2find.dkrz.de
    Updated Apr 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Event construal and temporal distance in natural language - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/bf45ce3f-b925-5b81-b0a3-cba9f1bee2b3
    Explore at:
    Dataset updated
    Apr 12, 2023
    Description

    Construal level theory proposes that events that are temporally proximate are represented more concretely than events that are temporally distant. We tested this prediction using two large natural language text corpora. In study 1 we examined posts on Twitter that referenced the future, and found that tweets mentioning temporally proximate dates used more concrete words than those mentioning distant dates. In study 2 we obtained all New York Times articles that referenced U.S. presidential elections between 1987 and 2007. We found that the concreteness of the words in these articles increased with the temporal proximity to their corresponding election. Additionally the reduction in concreteness after the election was much greater than the increase in concreteness leading up to the election, though both changes in concreteness were well described by an exponential function. We replicated this finding with New York Times articles referencing US public holidays. Overall, our results provide strong support for the predictions of construal level theory, and additionally illustrate how large natural language datasets can be used to inform psychological theory.This network project brings together economists, psychologists, computer and complexity scientists from three leading centres for behavioural social science at Nottingham, Warwick and UEA. This group will lead a research programme with two broad objectives: to develop and test cross-disciplinary models of human behaviour and behaviour change; to draw out their implications for the formulation and evaluation of public policy. Foundational research will focus on three inter-related themes: understanding individual behaviour and behaviour change; understanding social and interactive behaviour; rethinking the foundations of policy analysis. The project will explore implications of the basic science for policy via a series of applied projects connecting naturally with the three themes. These will include: the determinants of consumer credit behaviour; the formation of social values; strategies for evaluation of policies affecting health and safety. The research will integrate theoretical perspectives from multiple disciplines and utilise a wide range of complementary methodologies including: theoretical modeling of individuals, groups and complex systems; conceptual analysis; lab and field experiments; analysis of large data sets. The Network will promote high quality cross-disciplinary research and serve as a policy forum for understanding behaviour and behaviour change. Experimental data. In study 1, we collected and analyzed millions of time-indexed posts on Twitter. In this study we obtained a large number of tweets that referenced dates in the future, and were able to use these tweets to determine the concreteness of the language used to describe events at these dates. This allowed us to observe how psychological distance influences everyday discourse, and put the key assumptions of the CLT to a real-world test. In study 2, we analyzed word concreteness in news articles using the New York Times (NYT) Annotated Corpus (Sandhaus, 2008). This corpus contains over 1.8 million NYT articles written between 1987 and 2007. Importantly for our purposes, these articles are tagged with keywords describing the topics of the articles. In this study we obtained all NYT articles written before and after the 1988, 1992, 1996, 2000, and 2004 US Presidential elections, which were tagged as pertaining to these elections. We subsequently tested how the concreteness of the words used in the articles varied as a function of temporal distance to the election they reference. We also performed this analysis with NYT articles referencing three popular public holidays. Unlike study 1 and prior work (such as Snefjella & Kuperman, 2015), study 2 allowed us to examine the influence of temporal distance in the past and in the future, while controlling for the exact time when specific events occurred.

  10. Z

    Cross-language corpora of privacy policies

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ciclosi, Francesco (2023). Cross-language corpora of privacy policies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7729545
    Explore at:
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    Vidor, Silvia
    Massacci, Fabio
    Ciclosi, Francesco
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset consists of three different privacy policy corpora (in English and Italian) composed of 81 unique privacy policy texts spanning the period 2018-2021. This dataset makes available an example of three corpora of privacy policies. The first corpus is the English-language corpus, the original used in the study by Tang et al. [2]. The other two are cross-language corpora built (one, the source corpus, in English, and the other, the replication corpus, in Italian, which is the language of a potential replication study) from the first corpus.

    The policies were collected from:

    the Alexa top 10 Italy and U.S. websites rank;

    the Play Store apps rank in the "most profitable games" category of the Play Store for Italy and the U.S.

    We manually analyzed the Alexa top 10 Italy websites as of November 2021. Analogously, we analyzed selected apps that, in the same period, had ranked better in the "most profitable games" category of the Play Store for Italy.

    All the privacy policies are ANSI-encoded text files and have been manually read and verified. The dataset is helpful as a starting point for building comparable cross-language privacy policies corpora. The availability of these comparable cross-language privacy policies corpora helps replicate studies in different languages. Details on the methodology can be found in the accompanying paper.

    The available files are as follows:

    policies-texts.zip --> contains a directory of text files with the policy texts. File names are the SHA1 hashes of the policy text.

    policy-metadata.csv --> Contains a CSV file with the metadata for each privacy policy.

    This dataset is the original dataset used in the publication [1]. The original English U.S. corpus is described in the publication [2].

    [1] F. Ciclosi, S. Vidor and F. Massacci. "Building cross-language corpora for human understanding of privacy policies." Workshop on Digital Sovereignty in Cyber Security: New Challenges in Future Vision. Communications in Computer and Information Science. Springer International Publishing, 2023, In press.

    [2] J. Tang, H. Shoemaker, A. Lerner, and E. Birrell. Defining Privacy: How Users Interpret Technical Terms in Privacy Policies. Proceedings on Privacy Enhancing Technologies, 3:70–94, 2021.

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Alireza Javadian Sabet; Sarah H. Bana; Renzhe Yu; Morgan Frank (2024). Course-Skill Atlas: A national longitudinal dataset of skills taught in U.S. higher education curricula [Dataset]. http://doi.org/10.6084/m9.figshare.25632429.v7
Organization logo

Data from: Course-Skill Atlas: A national longitudinal dataset of skills taught in U.S. higher education curricula

Related Article
Explore at:
application/gzipAvailable download formats
Dataset updated
Oct 8, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Alireza Javadian Sabet; Sarah H. Bana; Renzhe Yu; Morgan Frank
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Higher education plays a critical role in driving an innovative economy by equipping students with knowledge and skills demanded by the workforce.While researchers and practitioners have developed data systems to track detailed occupational skills, such as those established by the U.S. Department of Labor (DOL), much less effort has been made to document which of these skills are being developed in higher education at a similar granularity.Here, we fill this gap by presenting Course-Skill Atlas -- a longitudinal dataset of skills inferred from over three million course syllabi taught at nearly three thousand U.S. higher education institutions. To construct Course-Skill Atlas, we apply natural language processing to quantify the alignment between course syllabi and detailed workplace activities (DWAs) used by the DOL to describe occupations. We then aggregate these alignment scores to create skill profiles for institutions and academic majors. Our dataset offers a large-scale representation of college education's role in preparing students for the labor market.Overall, Course-Skill Atlas can enable new research on the source of skills in the context of workforce development and provide actionable insights for shaping the future of higher education to meet evolving labor demands, especially in the face of new technologies.

Search
Clear search
Close search
Google apps
Main menu