Facebook
Twitterhttps://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.
Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.
We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.
Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.
The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.
To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.
The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.
The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:
Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.
The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.
A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.
The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.
The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of
Facebook
TwitterThe Michigan Public Policy Survey (MPPS) is a program of state-wide surveys of local government leaders in Michigan. The MPPS is designed to fill an important information gap in the policymaking process. While there are ongoing surveys of the business community and of the citizens of Michigan, before the MPPS there were no ongoing surveys of local government officials that were representative of all general purpose local governments in the state. Therefore, while we knew the policy priorities and views of the state's businesses and citizens, we knew very little about the views of the local officials who are so important to the economies and community life throughout Michigan. The MPPS was launched in 2009 by the Center for Local, State, and Urban Policy (CLOSUP) at the University of Michigan and is conducted in partnership with the Michigan Association of Counties, Michigan Municipal League, and Michigan Townships Association. The associations provide CLOSUP with contact information for the survey's respondents, and consult on survey topics. CLOSUP makes all decisions on survey design, data analysis, and reporting, and receives no funding support from the associations. The surveys investigate local officials' opinions and perspectives on a variety of important public policy issues and solicit factual information about their localities relevant to policymaking. Over time, the program has covered issues such as fiscal, budgetary and operational policy, fiscal health, public sector compensation, workforce development, local-state governmental relations, intergovernmental collaboration, economic development strategies and initiatives such as placemaking and economic gardening, the role of local government in environmental sustainability, energy topics such as hydraulic fracturing ("fracking") and wind power, trust in government, views on state policymaker performance, opinions on the impacts of the Federal Stimulus Program (ARRA), and more. The program will investigate many other issues relevant to local and state policy in the future. A searchable database of every question the MPPS has asked is available on CLOSUP's website. Results of MPPS surveys are currently available as reports, and via online data tables. The MPPS datasets are being released in two forms: public-use datasets and restricted-use datasets. Unlike the public-use datasets, the restricted-use datasets represent full MPPS survey waves, and include all of the survey questions from a wave. Restricted-use datasets also allow for multiple waves to be linked together for longitudinal analysis. The MPPS staff do still modify these restricted-use datasets to remove jurisdiction and respondent identifiers and to recode other variables in order to protect confidentiality. However, it is theoretically possible that a researcher might be able, in some rare cases, to use enough variables from a full dataset to identify a unique jurisdiction, so access to these datasets is restricted and approved on a case-by-case basis. CLOSUP encourages researchers interested in the MPPS to review the codebooks included in this data collection to see the full list of variables including those not found in the public-use datasets, and to explore the MPPS data using the public-use datasets. On 2016-08-20, the openICPSR web site was moved to new software. In the migration process, some projects were not published in the new system because the decisions made in the old site did not map easily to the new setup. This project is temporarily available as restricted data while ICPSR verifies that all files were migrated correctly.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains trace data describing user interactions with the Inter-university Consortium for Political and Social Research website (ICPSR). We gathered site usage data from Google Analytics. We focused our analysis on user sessions, which are groups of interactions with resources (e.g., website pages) and events initiated by users. ICPSR tracks a subset of user interactions (i.e., other than page views) through event triggers. We analyzed sequences of interactions with resources, including the ICPSR data catalog, variable index, data citations collected in the ICPSR Bibliography of Data-related Literature, and topical information about project archives. As part of our analysis, we calculated the total number of unique sessions and page views in the study period. Data in our study period fell between September 1, 2012, and 2016. ICPSR's website was updated and relaunched in September 2012 with new search functionality, including a Social Science Variables Database (SSVD) tool. ICPSR then reorganized its website and changed its analytics collection procedures in 2016, marking this as the cutoff date for our analysis. Data are relevant for two reasons. First, updates to the ICPSR website during the study period focused only on front-end design rather than the website's search functionality. Second, the core features of the website over the period we examined (e.g., faceted and variable search, standardized metadata, the use of controlled vocabularies, and restricted data applications) are shared with other major data archives, making it likely that the trends in user behavior we report are generalizable.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/35519/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/35519/terms
The National Survey of Early Care and Education (NSECE) is a set of four integrated, nationally representative surveys conducted in 2012. These were surveys of (1) households with children under 13, (2) home-based providers, (3) center-based providers, and (4) the center-based provider workforce.
The NSECE documents the nation's current utilization and availability of early care and education (including school-age care), in order to deepen the understanding of the extent to which families' needs and preferences coordinate well with providers' offerings and constraints. The experiences of low-income families are of special interest as they are the focus of a significant component of early care and education/school-age (ECE/SA) public policy. The NSECE calls for nationally-representative samples including interviews in all fifty states and Washington, DC.
The study is funded by the Office of Planning, Research and Evaluation (OPRE) in the Administration for Children and Families (ACF), United States Department of Health and Human Services. The project team is led by the National Opinion Research Center (NORC) at the University of Chicago, in partnership with Chapin Hall at the University of Chicago and Child Trends.
The Quick Tabulation and Public-Use Files are currently available via this site. Restricted-Use Files are also available at three different access levels; to determine which level of file access will best meet your needs, please see the NSECE Data Files Overview for more information.
Restricted-Use Files are available via the Child and Family Data Archive. To obtain the Level 1 files, researchers must agree to the terms and conditions of the Restricted Data Use Agreement and complete an application via ICPSR's online Data Access Request System.
Level 2 and 3 Restricted-Use Files are available via the National Opinion Research Center (NORC). For more information, please see the access instructions for NSECE Levels 2/3 Restricted-Use Data.
NORC is also beginning to release preliminary 2019 NSECE Quick Tabulation data files in summer 2020. These preliminary files and documentation are available for download from the DATA FILES box on the NORC website.
For additional information about this study, please see:
For more information, tutorials, and reports related to the National Survey of Early Care and Education, please visit the Child and Family Data Archive's Data Training Resources from the NSECE page.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/7374/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/7374/terms
This study represents one of four research projects on service delivery systems in metropolitan areas, covering fire protection (DECISION-RELATED RESEARCH ON THE ORGANIZATION OF SERVICE DELIVERY SYSTEMS IN METROPOLITAN AREAS: FIRE PROTECTION [ICPSR 7409]), police protection (DECISION-RELATED RESEARCH ON THE ORGANIZATION OF SERVICE DELIVERY SYSTEMS IN METROPOLITAN AREAS: POLICE PROTECTION [ICPSR 7427]), solid waste management (DECISION-RELATED RESEARCH ON THE ORGANIZATION OF SERVICE DELIVERY SYSTEMS IN METROPOLITAN AREAS: SOLID WASTE MANAGEMENT [ICPSR 7487]), and public health (the present study). All four projects used a common unit of analysis, namely all 200 Standard Metropolitan Statistical Areas (SMSAs) that, according to the 1970 Census, had a population of less than 1,500,000 and were entirely located within a single state. In each project, a limited amount of information was collected for all 200 SMSAs. More extensive data were gathered within independently drawn samples of these SMSAs, for all local geographical units and each administrative jurisdiction or agency in the service delivery areas. Two standardized systems of geocoding -- the Federal Information Processing Standard (FIPS) codes and the Office of Revenue Sharing (ORS) codes -- were used, so that data from various sources could be combined. The use of these two coding schemes also allows users to combine data from two or more of the research projects conducted in conjunction with the present one, or to add data from a wide variety of public data files. The delivery of public health services was investigated in 200 SMSAs plus Minneapolis and St. Paul. The basic data collection effort involved the use of public data sources as well as proprietary data from the American Medical Association (AMA) and the Commission on Professional and Hospital Activities (CPHA). Because of the proprietary nature of some of the data and for the preservation of confidentiality, all analyses were performed at the SMSA level. Unlike the other three related research projects, the present study does not provide disaggregated units of analysis such as the administrative jurisdiction, the individual hospital, or other facilities. Variables describe the characteristics of available professionals and facilities, regulatory factors reflecting the impact of federal and state programs available in the area, and financing factors, including the coverage of state Medicaid programs, Blue Cross and Blue Shield, and Medicare programs. Information is also provided regarding the demographic and socioeconomic characteristics of the population served in each SMSA.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
The Development Log provides an record of large-scale development projects occurring in the City of Cambridge. The Log, updated on a quarterly basis, is distributed to City departments and the public to keep them posted about development progress, from permitting through construction to completion. The Historical Projects table include information about projects completed through 2023. The table includes general project information, such as development status and statistics related to the entire project. Limited information is available projects completed prior to 2011.
Since a project may include more than one use, data on each specific use found within a project is found in the associated Project Use table found here: https://data.cambridgema.gov/Planning/Development-Log-Historical-Projects-Use-Data/r5mv-isth.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Earlier this year, Dr. Hoffman and Dr. Fafard published a book chapter on the efficacy and legality of border closures enacted by governments in response to changing COVID-19 conditions. The authors concluded border closures are at best, regarded as powerful symbolic acts taken by governments to show they are acting forcefully, even if the actions lack an epidemiological impact and breach international law. This COVID-19 travel restriction project was developed out of a necessity and desire to further examine the empirical implications of border closures. The current dataset contains bilateral travel restriction information on the status of 179 countries between 1 January 2020 and 8 June 2020. The data was extracted from the ‘international controls’ column from the Oxford COVID-19 Government Response Tracker (OxCGRT). The data in the ‘international controls’ column outlined a country’s change in border control status, as a response to COVID-19 conditions. Accompanying source links were further verified through random selection and comparison with external news sources. Greater weight is given to official national government sources, then to provincial and municipal news-affiliated agencies. The database is presented in matrix form for each country-pair and date. Subsequently, each cell is represented by datum Xdmn and indicates the border closure status on date d by country m on country n. The coding is as follows: no border closure (code = 0), targeted border closure (= 1), and a total border closure (= 99). The dataset provides further details in the ‘notes’ column if the type of closure is a modified form of a targeted closure, either as a land or port closure, flight or visa suspension, or a re-opening of borders to select countries. Visa suspensions and closure of land borders were coded separately as de facto border closures and analyzed as targeted border closures in quantitative analyses. The file titled ‘BTR Supplementary Information’ covers a multitude of supplemental details to the database. The various tabs cover the following: 1) Codebook: variable name, format, source links, and description; 2) Sources, Access dates: dates of access for the individual source links with additional notes; 3) Country groups: breakdown of EEA, EU, SADC, Schengen groups with source links; 4) Newly added sources: for missing countries with a population greater than 1 million (meeting the inclusion criteria), relevant news sources were added for analysis; 5) Corrections: external news sources correcting for errors in the coding of international controls retrieved from the OxCGRT dataset. At the time of our study inception, there was no existing dataset which recorded the bilateral decisions of travel restrictions between countries. We hope this dataset will be useful in the study of the impact of border closures in the COVID-19 pandemic and widen the capabilities of studying border closures on a global scale, due to its interconnected nature and impact, rather than being limited in analysis to a single country or region only. Statement of contributions: Data entry and verification was performed mainly by GL, with assistance from MJP and RN. MP and IW provided further data verification on the nine countries purposively selected for the exploratory analysis of political decision-making.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Taken verbatim from the source: Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
Facebook
TwitterThe dataset includes a comprehensive list of asbestos certificates that have been issued to individuals for the purpose of air sampling, handlers, inspectors, management planners, operations and maintenance, project designers, project monitors, restricted allied trades, and supervisors within New York State. The dataset includes the individual’s name and license information such as, certificate number, category certificate number, certificate type, issued date, and expiration date
Facebook
TwitterControlled CO2 release experiments and studies of natural CO2 seeps have been undertaken at sites across the globe for CCS applications. The scientific motivation, experimental design, baseline assessment and CO2 detection and monitoring equipment deployed vary significantly between these study sites, addressing questions including impacts on benthic communities, testing of novel monitoring technologies, quantifying seep formation/style and determining CO2 flux rates. A review and synthesis of these sites studied for CCS will provide valuable information to: i. Enable the design of effective monitoring and survey strategies ii. Identify realistic site-specific environmental and ecosystem impact scenarios iii. Rationalise regulatory definitions with what is scientifically likely or achievable iv. Guide novel future scientific studies at natural or artificial release sites. Two global databases were constructed in Spring 2013, informed by a wide literature review and, where appropriate, contact with the research project leader. i. Artificial CO2 release sites ii. Natural CO2 seeps studied for CCS purposes The location and select information from each of these datasets are intended to be displayed as separate GoogleMap files which can be embedded in the QICS or UKCCSRC web server. These databases are not expected to be complete. Information should be added as more publications or become available or more case studies emerge or are set up. To facilitate this process, a contact email should be included beneath the map to allow viewers to recommend new or overlooked study sites for the dataset. Grant number: UKCCSRC-C1-31. These data are currently restricted.
Facebook
TwitterThis Religion and State-Minorities (RASM) dataset is supplemental to the Religion and State Round 2 (RAS2) dataset. It codes the RAS religious discrimination variable using the minority as the unit of analysis (RAS2 uses a country as the unit of analysis and, is a general measure of all discrimination in the country). RASM codes religious discrimination by governments against all 566 minorities in 175 countries which make a minimum population cut off. Any religious minority which is at least 0.25 percent of the population or has a population of at least 500,000 (in countries with populations of 200 million or more) are included. The dataset also includes all Christian minorities in Muslim countries and all Muslim minorities in Christian countries for a total of 597 minorities. The data cover 1990 to 2008 with yearly codings.
These religious discrimination variables are designed to examine restrictions the government places on the practice of religion by minority religious groups. It is important to clarify two points. First, these variables focus on restrictions on minority religions. Restrictions that apply to all religions are not coded in this set of variables. This is because the act of restricting or regulating the religious practices of minorities is qualitatively different from restricting or regulating all religions. Second, this set of variables focuses only on restrictions of the practice of religion itself or on religious institutions and does not include other types of restrictions on religious minorities. The reasoning behind this is that there is much more likely to be a religious motivation for restrictions on the practice of religion than there is for political, economic, or cultural restrictions on a religious minority. These secular types of restrictions, while potentially motivated by religion, also can be due to other reasons. That political, economic, and cultural restrictions are often placed on ethnic minorities who share the same religion and the majority group in their state is proof of this.
This set of variables is essentially a list of specific types of religious restrictions which a government may place on some or all minority religions. These variables are identical to those included in the RAS2 dataset, save that one is not included because it focuses on foreign missionaries and this set of variables focuses on minorities living in the country. Each of the items in this category is coded on the following scale:
0. The activity is not restricted or the government does not engage in this practice.
1. The activity is restricted slightly or sporadically or the government engages in a mild form of this practice or a severe form sporadically.
2. The activity is significantly restricted or the government engages in this activity often and on a large scale.
A composite version combining the variables to create a measure of religious discrimination against minority religions which ranges from 0 to 48 also is included.
ARDA Note: This file was revised on October 6, 2017. At the PIs request, we removed the variable reporting on the minority's percentage of a country's population after finding inconsistencies with the reported values. For detailed data on religious demographics, see the "/data-archive?fid=RCSREG2" Target="_blank">Religious Characteristics of States Dataset Project.
Facebook
TwitterWILIS 2 was a follow-up study to develop an alumni tracking system aimed at recent graduates that could potentially be used by all LIS programs. The project built on WILIS 1, a comprehensive IMLS funded study of career patterns of graduates of LIS programs in North Carolina. WILIS 2 builds on WILIS 1 by fully developing and testing the career tracking model on a national level. The WILIS 2 survey was designed for recent graduates of LIS programs in North America. The survey gathered data on demographics, employment, LIS Master’s Program experience and evaluation and knowledge and skills provided by the LIS Program. 39 LIS programs participated in the study. Programs were asked to select a random sample of 250 of their master’s degree graduates from the previous five years; however, several programs included a few graduates from earlier years. Fewer than four percent of these respondents graduated prior to 2003. Programs with multiple degrees were able to select the degree programs included in their sample. The graduates received an email invitation and three email reminders. A few programs mailed paper invitations to encourage better response rates. The response rate for the survey was 41% (n=3507). Response rates for individual programs ranged from 16 % to 80%. The dataset of the 39 LIS programs includes alumni that graduated between 2000 and 2009.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction. The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication. Data Import: Reading CSV For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command. Data Import: Setting up a MongoDB (Recommended) To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database. To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here. For the Fitbit data, run the following: mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
Facebook
Twitterhttps://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de449913https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de449913
Abstract (en): The National Longitudinal Study of Adolescent Health (Add Health) is a longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States during the 1994-1995 school year. The Add Health cohort has been followed into young adulthood with four in-home interviews, the most recent in 2008, when the sample was aged 24-32. The additional files contained in this component of the Add Health project are from the Adolescent Health and Academic Achievement (AHAA) study and provide an opportunity to examine the effects of education on adolescent behavior, academic achievement, and cognitive and psychosocial development in the 1990s. The AHAA study contributes to Add Health by providing the high school transcripts of Add Health Wave III sample members. The AHAA data provides indicators of (1) educational achievement, (2) course taking patterns, (3) curricular exposure, and (4) educational contexts within and between schools, all of which can be linked to the Add Health survey data. The Adolescent Health and Academic Achievement (AHAA) study provides an opportunity to examine the effects of education on adolescent behavior, academic achievement, and cognitive and psychosocial development in the 1990s. ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Performed consistency checks.; Standardized missing values.; Checked for undocumented or out-of-range codes.. Adolescents in grades 7-12 and their families. Wave I, Stage 1 School sample: stratified, random sample of all high schools in the United States. A school was eligible for the sample if it included an 11th grade and had a minimum enrollment of 30 students. A feeder school, a school that sent graduates to the high school and that included a 7th grade, was also recruited from the community. Wave I, Stage 2: An in-home sample of 20,745 adolescents consisting of a core sample from each community plus selected special oversamples was interviewed in 1995. Eligibility for the oversamples was determined by the adolescent's responses on the In-School Questionnaire. Adolescents could qualify for more than one sample. At Wave II, respondents who were in grades 7-11 at Wave I were re-interviewed. Wave III: The in-home Wave III sample consists of Wave I respondents who could be located and re-interviewed six years later. Wave III also collected High School Transcript Release Forms to be used for the AHAA study. At Wave IV, 15,701 Wave I respondents were re-interviewed in 2008. 2012-09-10 The following three pages have been added to the Restricted Data Use Agreement for this study: a "General Information and Checklists" page, a "Using the Add Health Transcript Data in the ICPSR-DSDR Secure Data Enclave" page, and a page titled "ATTACHMENT A: Output Disclosure Risk Checks." Funding insitution(s): United States Department of Health and Human Services. National Institutes of Health. Eunice Kennedy Shriver National Institute of Child Health and Human Development (P01-HD31921). United States Department of Health and Human Services. National Institutes of Health. National Cancer Institute. United States Department of Health and Human Services. National Institutes of Health. National Institute on Alcohol Abuse and Alcoholism. United States Department of Health and Human Services. National Institutes of Health. National Institute on Deafness and Other Communication Disorders. United States Department of Health and Human Services. National Institutes of Health. National Institute on Drug Abuse. United States Department of Health and Human Services. National Institutes of Health. National Institute of General Medical Sciences. United States Department of Health and Human Services. National Institutes of Health. National Institute of Mental Health. United States Department of Health and Human Services. National Institutes of Health. National Institute of Nursing Research. United States Department of Health and Human Services. National Institutes of Health. Office of AIDS Research. United States Department of Health and Human Services. National Institutes of Health. Office of Behavioral and Social Sciences Research. United ...
Facebook
TwitterThe Research Data Collections Project led by Monash University Library aimed to identify and describe research data collections arising from publicly funded research, and to showcase these collections by contributing information about them to Research Data Australia (RDA). As a result of this project, 60+ Monash researchers have been provided with an additional channel to promote their research, including showcasing their work to new generalist and cross-disciplinary audiences. For those researchers with data available for re-use, an increase in research impact (e.g. citation of publications) may result. For those researchers with data that is restricted or only available by negotiation, appearing in RDA may still increase opportunities for collaboration and highlight data collection and analysis techniques that may be of interest to other researchers. The in-depth nature of the interviews has also enabled the Library to learn more about researchers? data management practices and needs. This information will inform the continuous improvement of the Library's data management advisory services and technical infrastructure. Finally, and perhaps most importantly, the project has led to a dramatic increase in the participation of Library staff in data management activities. By building on the knowledge and networks of our subject librarians, and carefully structuring the project around a buddy system that provided staff with a safe environment and stage process in which to learn, the project successfully involved more than thirty Library staff in addition to the staff that were formally seconded to the project team.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionGeriatric co-management is known to improve treatment of older adults in various clinical settings, however, widespread application of the concept is limited due to restricted resources. Digitalization may offer options to overcome these shortages by providing structured, relevant information and decision support tools for medical professionals. We present the SURGE-Ahead project (Supporting SURgery with GEriatric co-management and Artificial Intelligence) addressing this challenge.MethodsA digital application with a dashboard-style user interface will be developed, displaying 1) evidence-based recommendations for geriatric co-management and 2) artificial intelligence-enhanced suggestions for continuity of care (COC) decisions. The development and implementation of the SURGE-Ahead application (SAA) will follow the Medical research council framework for complex medical interventions. In the development phase a minimum geriatric data set (MGDS) will be defined that combines parametrized information from the hospital information system with a concise assessment battery and sensor data. Two literature reviews will be conducted to create an evidence base for co-management and COC suggestions that will be used to display guideline-compliant recommendations. Principles of machine learning will be used for further data processing and COC proposals for the postoperative course. In an observational and AI-development study, data will be collected in three surgical departments of a University Hospital (trauma surgery, general and visceral surgery, urology) for AI-training, feasibility testing of the MGDS and identification of co-management needs. Usability will be tested in a workshop with potential users. During a subsequent project phase, the SAA will be tested and evaluated in clinical routine, allowing its further improvement through an iterative process.DiscussionThe outline offers insights into a novel and comprehensive project that combines geriatric co-management with digital support tools to improve inpatient surgical care and continuity of care of older adults.Trial registrationGerman clinical trials registry (Deutsches Register für klinische Studien, DRKS00030684), registered on 21st November 2022.
Facebook
TwitterThis dataset contains electronic health records used to study associations between PFAS occurrence and multimorbidity in a random sample of UNC Healthcare system patients. The dataset contains the medical record number to uniquely identify each individual as well as information on PFAS occurrence at the zip code level, the zip code of residence for each individual, chronic disease diagnoses, patient demographics, and neighborhood socioeconomic information from the 2010 US Census. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Because this data has PII from electronic health records the data can only be accessed with an approved IRB application. Project analytic code is available at L:/PRIV/EPHD_CRB/Cavin/CARES/Project Analytic Code/Cavin Ward/PFAS Chronic Disease and Multimorbidity. Format: This data is formatted as a R dataframe and associated comma-delimited flat text file. The data has the medical record number to uniquely identify each individual (which also serves as the primary key for the dataset), as well as information on the occurrence of PFAS contamination at the zip code level, socioeconomic data at the census tract level from the 2010 US Census, demographics, and the presence of chronic disease as well as multimorbidity (the presence of two or more chronic diseases). This dataset is associated with the following publication: Ward-Caviness, C., J. Moyer, A. Weaver, R. Devlin, and D. Diazsanchez. Associations between PFAS occurrence and multimorbidity as observed in an electronic health record cohort. Environmental Epidemiology. Wolters Kluwer, Alphen aan den Rijn, NETHERLANDS, 6(4): p e217, (2022).
Facebook
TwitterA database based on a random sample of the noninstitutionalized population of the United States, developed for the purpose of studying the effects of demographic and socio-economic characteristics on differentials in mortality rates. It consists of data from 26 U.S. Current Population Surveys (CPS) cohorts, annual Social and Economic Supplements, and the 1980 Census cohort, combined with death certificate information to identify mortality status and cause of death covering the time interval, 1979 to 1998. The Current Population Surveys are March Supplements selected from the time period from March 1973 to March 1998. The NLMS routinely links geographical and demographic information from Census Bureau surveys and censuses to the NLMS database, and other available sources upon request. The Census Bureau and CMS have approved the linkage protocol and data acquisition is currently underway. The plan for the NLMS is to link information on mortality to the NLMS every two years from 1998 through 2006 with research on the resulting database to continue, at least, through 2009. The NLMS will continue to incorporate data from the yearly Annual Social and Economic Supplement into the study as the data become available. Based on the expected size of the Annual Social and Economic Supplements to be conducted, the expected number of deaths to be added to the NLMS through the updating process will increase the mortality content of the study to nearly 500,000 cases out of a total number of approximately 3.3 million records. This effort would also include expanding the NLMS population base by incorporating new March Supplement Current Population Survey data into the study as they become available. Linkages to the SEER and CMS datasets are also available. Data Availability: Due to the confidential nature of the data used in the NLMS, the public use dataset consists of a reduced number of CPS cohorts with a fixed follow-up period of five years. NIA does not make the data available directly. Research access to the entire NLMS database can be obtained through the NIA program contact listed. Interested investigators should email the NIA contact and send in a one page prospectus of the proposed project. NIA will approve projects based on their relevance to NIA/BSR''s areas of emphasis. Approved projects are then assigned to NLMS statisticians at the Census Bureau who work directly with the researcher to interface with the database. A modified version of the public use data files is available also through the Census restricted Data Centers. However, since the database is quite complex, many investigators have found that the most efficient way to access it is through the Census programmers. * Dates of Study: 1973-2009 * Study Features: Longitudinal * Sample Size: ~3.3 Million Link: *ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/00134
Facebook
TwitterSpatial analysis and statistical summaries of the Protected Areas Database of the United States (PAD-US) provide land managers and decision makers with a general assessment of management intent for biodiversity protection, natural resource management, and recreation access across the nation. The PAD-US 3.0 Combined Fee, Designation, Easement feature class (with Military Lands and Tribal Areas from the Proclamation and Other Planning Boundaries feature class) was modified to remove overlaps, avoiding overestimation in protected area statistics and to support user needs. A Python scripted process ("PADUS3_0_CreateVectorAnalysisFileScript.zip") associated with this data release prioritized overlapping designations (e.g. Wilderness within a National Forest) based upon their relative biodiversity conservation status (e.g. GAP Status Code 1 over 2), public access values (in the order of Closed, Restricted, Open, Unknown), and geodatabase load order (records are deliberately organized in the PAD-US full inventory with fee owned lands loaded before overlapping management designations, and easements). The Vector Analysis File ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") associated item of PAD-US 3.0 Spatial Analysis and Statistics ( https://doi.org/10.5066/P9KLBB5D ) was clipped to the Census state boundary file to define the extent and serve as a common denominator for statistical summaries. Boundaries of interest to stakeholders (State, Department of the Interior Region, Congressional District, County, EcoRegions I-IV, Urban Areas, Landscape Conservation Cooperative) were incorporated into separate geodatabase feature classes to support various data summaries ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip") and Comma-separated Value (CSV) tables ("PADUS3_0SummaryStatistics_TabularData_CSV.zip") summarizing "PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip" are provided as an alternative format and enable users to explore and download summary statistics of interest (Comma-separated Table [CSV], Microsoft Excel Workbook [.XLSX], Portable Document Format [.PDF] Report) from the PAD-US Lands and Inland Water Statistics Dashboard ( https://www.usgs.gov/programs/gap-analysis-project/science/pad-us-statistics ). In addition, a "flattened" version of the PAD-US 3.0 combined file without other extent boundaries ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") allow for other applications that require a representation of overall protection status without overlapping designation boundaries. The "PADUS3_0VectorAnalysis_State_Clip_CENSUS2020" feature class ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.gdb") is the source of the PAD-US 3.0 raster files (associated item of PAD-US 3.0 Spatial Analysis and Statistics, https://doi.org/10.5066/P9KLBB5D ). Note, the PAD-US inventory is now considered functionally complete with the vast majority of land protection types represented in some manner, while work continues to maintain updates and improve data quality (see inventory completeness estimates at: http://www.protectedlands.net/data-stewards/ ). In addition, changes in protected area status between versions of the PAD-US may be attributed to improving the completeness and accuracy of the spatial data more than actual management actions or new acquisitions. USGS provides no legal warranty for the use of this data. While PAD-US is the official aggregation of protected areas ( https://www.fgdc.gov/ngda-reports/NGDA_Datasets.html ), agencies are the best source of their lands data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project Tycho datasets contain case counts for reported disease conditions for countries around the world. The Project Tycho data curation team extracts these case counts from various reputable sources, typically from national or international health authorities, such as the US Centers for Disease Control or the World Health Organization. These original data sources include both open- and restricted-access sources. For restricted-access sources, the Project Tycho team has obtained permission for redistribution from data contributors. All datasets contain case count data that are identical to counts published in the original source and no counts have been modified in any way by the Project Tycho team. The Project Tycho team has pre-processed datasets by adding new variables, such as standard disease and location identifiers, that improve data interpretability. We also formatted the data into a standard data format.
Each Project Tycho dataset contains case counts for a specific condition (e.g. measles) and for a specific country (e.g. The United States). Case counts are reported per time interval. In addition to case counts, datasets include information about these counts (attributes), such as the location, age group, subpopulation, diagnostic certainty, place of acquisition, and the source from which we extracted case counts. One dataset can include many series of case count time intervals, such as "US measles cases as reported by CDC", or "US measles cases reported by WHO", or "US measles cases that originated abroad", etc.
Depending on the intended use of a dataset, we recommend a few data processing steps before analysis: - Analyze missing data: Project Tycho datasets do not include time intervals for which no case count was reported (for many datasets, time series of case counts are incomplete, due to incompleteness of source documents) and users will need to add time intervals for which no count value is available. Project Tycho datasets do include time intervals for which a case count value of zero was reported. - Separate cumulative from non-cumulative time interval series. Case count time series in Project Tycho datasets can be "cumulative" or "fixed-intervals". Cumulative case count time series consist of overlapping case count intervals starting on the same date, but ending on different dates. For example, each interval in a cumulative count time series can start on January 1st, but end on January 7th, 14th, 21st, etc. It is common practice among public health agencies to report cases for cumulative time intervals. Case count series with fixed time intervals consist of mutually exclusive time intervals that all start and end on different dates and all have identical length (day, week, month, year). Given the different nature of these two types of case count data, we indicated this with an attribute for each count value, named "PartOfCumulativeCountSeries".
Facebook
Twitterhttps://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.
Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.
We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.
Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.
The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.
To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.
The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.
The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:
Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.
The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.
A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.
The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.
The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of