Facebook
Twitterhttps://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.
Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.
We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.
Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.
The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.
To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.
The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.
The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:
Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.
The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.
A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.
The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.
The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of
Facebook
TwitterWhat is the National Child Abuse and Neglect Data System (NCANDS)? The National Child Abuse and Neglect Data System (NCANDS) is a federally sponsored effort that annually collects and analyzes data on child abuse and neglect known to child protective services (CPS) agencies in the United States. The mandate for NCANDS is based on the 1988 amendments to the Child Abuse Prevention and Treatment Act (CAPTA) which directed the Secretary of the U.S. Department of Health and Human Services to create a national data collection and analysis program for state-level child abuse and neglect information. Subsequent amendments to CAPTA have led to new data collection requirements, many of which are incorporated into NCANDS. A successful federal-state partnership is the core component of NCANDS. Each state designates one person to be the NCANDS state contact, who works closely with the Children’s Bureau and the NCANDS Technical Team to uphold the high-quality standards associated with NCANDS data. Webinars, technical bulletins, virtual meetings, email, and phone conferences are used regularly to facilitate information sharing and provision of technical assistance. Annual Data Collection Process Every year, NCANDS data are submitted voluntarily by the 50 states, the District of Columbia, and the Commonwealth of Puerto Rico. The NCANDS reporting year is based on the FFY calendar which spans October 1 to September 30. States submit case-level data, called a Child File, by constructing an electronic file of child-specific records for each report of alleged child abuse and neglect that received a CPS response in the form of an investigation or alternative response. Case-level data include information about the characteristics of the reports of abuse and neglect, the children involved, the types of maltreatment, the CPS findings, the risk factors of the child and the caregivers, the services provided, and the perpetrators. The Child File is supplemented by agency-level aggregate statistics in a separate data submission called the Agency File. The Agency File contains data that are not reportable at the child-specific level and are often gathered from agencies external to CPS. Information collected in the Agency File include receipt of prevention and postresponse services and caseload and workforce data. States are asked to submit both the Child File and the Agency File each year. How are the data used? The NCANDS data are a critical source of information for many publications, reports, child welfare personnel, researchers, and others. NCANDS data are used to measure the performance of several federal programs, and are an integral part of the Child and Family Services Reviews (CFSRs) and the Child Welfare Outcomes: Report to Congress. NCANDS data are also used for the annual Child Maltreatment report series. Each report summarizes the major national and state-by-state findings for the given fiscal year, and is a key resource for thousands of people and organizations across the world. The Children’s Bureau has published an annual Child Maltreatment report every year since 1992. Where are the data available? The Child Maltreatment reports are available on the Children’s Bureau website at /programs/cb/research-data-technology/statisti.... Restricted use files of the NCANDS data are archived at the National Data Archive on Child Abuse and Neglect (NDACAN) at Cornell University and available to researchers who are interested in using these data for statistical analyses. Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This note describes the data sets used for all analyses contained in the manuscript 'Oxytocin - a social peptide?’[1] that is currently under review.
Data Collection
The data sets described here were originally retrieved from Web of Science (WoS) Core Collection via the University of Edinburgh’s library subscription [2]. The aim of the original study for which these data were gathered was to survey peer-reviewed primary studies on oxytocin and social behaviour. To capture relevant papers, we used the following query:
TI = (“oxytocin” OR “pitocin” OR “syntocinon”) AND TS = (“social*” OR “pro$social” OR “anti$social”)
The final search was performed on the 13 September 2021. This returned a total of 2,747 records, of which 2,049 were classified by WoS as ‘articles’. Given our interest in primary studies only – articles reporting original data – we excluded all other document types. We further excluded all articles sub-classified as ‘book chapters’ or as ‘proceeding papers’ in order to limit our analysis to primary studies published in peer-reviewed academic journals. This reduced the set to 1,977 articles. All of these were published in the English language, and no further language refinements were unnecessary.
All available metadata on these 1,977 articles was exported as plain text ‘flat’ format files in four batches, which we later merged together via Notepad++. Upon manually examination, we discovered examples of papers classified as ‘articles’ by WoS that were, in fact, reviews. To further filter our results, we searched all available PMIDs in PubMed (1,903 had associated PMIDs - ~96% of set). We then filtered results to identify all records classified as ‘review’, ‘systematic review’, or ‘meta-analysis’, identifying 75 records 3. After examining a sample and agreeing with the PubMed classification, these were removed these from our dataset - leaving a total of 1,902 articles.
From these data, we constructed two datasets via parsing out relevant reference data via the Sci2 Tool [4]. First, we constructed a ‘node-attribute-list’ by first linking unique reference strings (‘Cite Me As’ column in WoS data files) to unique identifiers, we then parsed into this dataset information on the identify of a paper, including the title of the article, all authors, journal publication, year of publication, total citations as recorded from WoS, and WoS accession number. Second, we constructed an ‘edge-list’ that records the citations from a citing paper in the ‘Source’ column and identifies the cited paper in the ‘Target’ column, using the unique identifies as described previously to link these data to the node-attribute-list.
We then constructed a network in which papers are nodes, and citation links between nodes are directed edges between nodes. We used Gephi Version 0.9.2 [5] to manually clean these data by merging duplicate references that are caused by different reference formats or by referencing errors. To do this, we needed to retain both all retrieved records (1,902) as well as including all of their references to papers whether these were included in our original search or not. In total, this produced a network of 46,633 nodes (unique reference strings) and 112,520 edges (citation links). Thus, the average reference list size of these articles is ~59 references. The mean indegree (within network citations) is 2.4 (median is 1) for the entire network reflecting a great diversity in referencing choices among our 1,902 articles.
After merging duplicates, we then restricted the network to include only articles fully retrieved (1,902), and retrained only those that were connected together by citations links in a large interconnected network (i.e. the largest component). In total, 1,892 (99.5%) of our initial set were connected together via citation links, meaning a total of ten papers were removed from the following analysis – and these were neither connected to the largest component, nor did they form connections with one another (i.e. these were ‘isolates’).
This left us with a network of 1,892 nodes connected together by 26,019 edges. It is this network that is described by the ‘node-attribute-list’ and ‘edge-list’ provided here. This network has a mean in-degree of 13.76 (median in-degree of 4). By restricting our analysis in this way, we lose 44,741 unique references (96%) and 86,501 citations (77%) from the full network, but retain a set of articles tightly knitted together, all of which have been fully retrieved due to possessing certain terms related to oxytocin AND social behaviour in their title, abstract, or associated keywords.
Before moving on, we calculated indegree for all nodes in this network – this counts the number of citations to a given paper from other papers within this network – and have included this in the node-attribute-list. We further clustered this network via modularity maximisation via the Leiden algorithm [6]. We set the algorithm to resolution 1, and allowed the algorithm to run over 100 iterations and 100 restarts. This gave Q=0.43 and identified seven clusters, which we describe in detail within the body of the paper. We have included cluster membership as an attribute in the node-attribute-list.
Data description
We include here two datasets: (i) ‘OTSOC-node-attribute-list.csv’ consists of the attributes of 1,892 primary articles retrieved from WoS that include terms indicating a focus on oxytocin and social behaviour; (ii) ‘OTSOC-edge-list.csv’ records the citations between these papers. Together, these can be imported into a range of different software for network analysis; however, we have formatted these for ease of upload into Gephi 0.9.2. Below, we detail their contents:
Id, the unique identifier
Label, the reference string of the paper to which the attributes in this row correspond. This is taken from the ‘Cite Me As’ column from the original WoS download. The reference string is in the following format: last name of first author, publication year, journal, volume, start page, and DOI (if available).
Wos_id, unique Web of Science (WoS) accession number. These can be used to query WoS to find further data on all papers via the ‘UT= ’ field tag.
Title, paper title.
Authors, all named authors.
Journal, journal of publication.
Pub_year, year of publication.
Wos_citations, total number of citations recorded by WoS Core Collection to a given paper as of 13 September 2021
Indegree, the number of within network citations to a given paper, calculated for the network shown in Figure 1 of the manuscript.
Cluster, provides the cluster membership number as discussed within the manuscript (Figure 1). This was established via modularity maximisation via the Leiden algorithm (Res 1; Q=0.43|7 clusters)
Source, the unique identifier of the citing paper.
Target, the unique identifier of the cited paper.
Type, edges are ‘Directed’, and this column tells Gephi to regard all edges as such.
Syr_date, this contains the date of publication of the citing paper.
Tyr_date, this contains the date of publication of the cited paper.
Software recommended for analysis
Gephi version 0.9.2 was used for the visualisations within the manuscript, and both files can be read and into Gephi without modification.
Notes
[1] Leng, G., Leng, R. I., Ludwig, M. (Submitted). Oxytocin – a social peptide? Deconstructing the evidence.
[2] Edinburgh University’s subscription to Web of Science covers the following databases: (i) Science Citation Index Expanded, 1900-present; (ii) Social Sciences Citation Index, 1900-present; (iii) Arts & Humanities Citation Index, 1975-present; (iv) Conference Proceedings Citation Index- Science, 1990-present; (v) Conference Proceedings Citation Index- Social Science & Humanities, 1990-present; (vi) Book Citation Index– Science, 2005-present; (vii) Book Citation Index– Social Sciences & Humanities, 2005-present; (viii) Emerging Sources Citation Index, 2015-present.
[3] For those interested, the following PMIDs were identified as ‘articles’ by WoS, but as ‘reviews’ by PubMed: ‘34502097’ ‘33400920’ ‘32060678’ ‘31925983’ ‘31734142’ ‘30496762’ ‘30253045’ ‘29660735’ ‘29518698’ ‘29065361’ ‘29048602’ ‘28867943’ ‘28586471’ ‘28301323’ ‘27974283’ ‘27626613’ ‘27603523’ ‘27603327’ ‘27513442’ ‘27273834’ ‘27071789’ ‘26940141’ ‘26932552’ ‘26895254’ ‘26869847’ ‘26788924’ ‘26581735’ ‘26548910’ ‘26317636’ ‘26121678’ ‘26094200’ ‘25997760’ ‘25631363’ ‘25526824’ ‘25446893’ ‘25153535’ ‘25092245’ ‘25086828’ ‘24946432’ ‘24637261’ ‘24588761’ ‘24508579’ ‘24486356’ ‘24462936’ ‘24239932’ ‘24239931’ ‘24231551’ ‘24216134’ ‘23955310’ ‘23856187’ ‘23686025’ ‘23589638’ ‘23575742’ ‘23469841’ ‘23055480’ ‘22981649’ ‘22406388’ ‘22373652’ ‘22141469’ ‘21960250’ ‘21881219’ ‘21802859’ ‘21714746’ ‘21618004’ ‘21150165’ ‘20435805’ ‘20173685’ ‘19840865’ ‘19546570’ ‘19309413’ ‘15288368’ ‘12359512’ ‘9401603’ ‘9213136’ ‘7630585’
[4] Sci2 Team. (2009). Science of Science (Sci2) Tool. Indiana University and SciTech Strategies. Stable URL: https://sci2.cns.iu.edu
[5] Bastian, M., Heymann, S., & Jacomy, M. (2009).
Facebook
TwitterBy Noah Rippner [source]
This dataset provides an in-depth look at the data elements for the US College CollegeScorecard Graduation and Opportunity Project Use Case. It contains information on the variables used to create a comprehensive report, including Year, dev-category, developer-friendly name, VARIABLE NAME, API data type, label, VALUE, LABEL , SCORECARD? Y/N , SOURCE and NOTES. The data is provided by the U.S Department of Education and allows parents, students and policymakers to take meaningful action to improve outcomes. This dataset contains more than enough information to allow people like Maria - a 25 year old recent US Army veteran who wants a degree in Management Systems and Information Technology -to distinguish between her school options; access services; find affordable housing near high-quality schools which are located in safe neighborhoods that have access to transport links as well as employment opportunities nearby. This highly useful dataset provides detailed analysis of all this criteria so that users can make an informed decision about which school is best for them!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains data related to college students, including their college graduation rates, access to opportunity indicators such as geographic mobility and career readiness, and other important indicators of the overall learning experience in the United States. This guide will show you how to use this dataset to make meaningful conclusions about high education in America.
First, you will need to be familiar with the different fields included in this CollegeScorecard’s US College Graduation and Opportunity Data set. Each record is comprised of several data elements which are defined by concise labels on the left side of each observation row. These include labels such as Name of Data Element, Year, dev-category (i.e., developmental category), Variable Name, API data type (i.e., type information for programmatic interface), Label (i.e., descriptive content labeling for visual reporting), Value , Label (i.e., descriptive value labeling for visual reporting). SCORECARD? Y/N indicates whether or not a field pertains to U.S Department of Education’s College Scorecard program and SOURCE indicates where the source of the variable can be found among other minor details about that variable are found within Notes column attributed beneath each row entry for further analysis or comparison between elements captured across observations
Now that you understand the components associated within each element or label related within Observation Rows identified beside each header label let’s go over some key steps you can take when working with this particular dataset:
- Utilize year specific filters on specified fields if needed — e.g.; Year = 2020 & API Data Type = Character
Look up any ‘NCalPlaceHolder” values if applicable — these are placeholders often stating values have been absolved fromScorecards display versioning due conflicting formatting requirements across standard conditions being met or may state these details have still yet been updated recently so upon assessment wait patiently until returns minor changes via API interface incorporate latest returned results statements inventory configuration options relevant against budgetary cycle limits established positions
Pivot data points into more custom tabular structured outputs tapering down complex unstructured RAW sources into more digestible Medium Level datasets consumed often via PowerBI / Tableau compatible Snapshots expanding upon Delimited text exports baseline formats provided formerly
Explore correlations between education metrics our third parties documents generated frequently such values indicative educational adherence effects ROI growth potential looking beyond Campus Panoramic recognition metrics often supported outside Social Medial Primary
- Creating an interactive dashboard to compare school performance in terms of safety, entrepreneurship and other criteria.
- Using the data to create a heat map visualization that shows which cities are most conducive to a successful educational experience for students like Maria.
- Gathering information about average course costs at different universities and mapping them relative to US unemployment rates indicates which states might offer the best value for money when it comes to higher education expenses
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
A. SUMMARY This table contains all victims (parties who are injured) involved in a traffic crash resulting in an injury in the City of San Francisco. Fatality year-to-date crash data is obtained from the Office of the Chief Medical Examiner (OME) death records, and only includes those cases that meet the San Francisco Vision Zero Fatality Protocol maintained by the San Francisco Department of Public Health (SFDPH), San Francisco Police Department (SFPD), and San Francisco Municipal Transportation Agency (SFMTA). Injury crash data is obtained from SFPD’s Interim Collision System for 2018 to YTD, Crossroads Software Traffic Collision Database (CR) for years 2013-2017 and the Statewide Integrated Transportation Record System (SWITRS) maintained by the California Highway Patrol for all years prior to 2013. Only crashes with valid geographic information are mapped. All geocodable crash data is represented on the simplified San Francisco street centerline model maintained by the Department of Public Works (SFDPW). Collision injury data is queried and aggregated on a quarterly basis. Crashes occurring at complex intersections with multiple roadways are mapped onto a single point and injury and fatality crashes occurring on highways are excluded.
The crash, party, and victim tables have a relational structure. The traffic crashes table contains information on each crash, one record per crash. The party table contains information from all parties involved in the crashes, one record per party. Parties are individuals involved in a traffic crash including drivers, pedestrians, bicyclists, and parked vehicles. The victim table contains information about each party injured in the collision, including any passengers. Injury severity is included in the victim table.
For example, a crash occurs (1 record in the crash table) that involves a driver party and a pedestrian party (2 records in the party table). Only the pedestrian is injured and thus is the only victim (1 record in the victim table).
B. HOW THE DATASET IS CREATED Traffic crash injury data is collected from the California Highway Patrol 555 Crash Report as submitted by the police officer within 30 days after the crash occurred. All fields that match the SWITRS data schema are programmatically extracted, de-identified, geocoded, and loaded into TransBASE. See Section D below for details regarding TransBASE.
C. UPDATE PROCESS After review by SFPD and SFDPH staff, the data is made publicly available approximately a month after the end of the previous quarter (May for Q1, August for Q2, November for Q3, and February for Q4).
D. HOW TO USE THIS DATASET This data is being provided as public information as defined under San Francisco and California public records laws. SFDPH, SFMTA, and SFPD cannot limit or restrict the use of this data or its interpretation by other parties in any way. Where the data is communicated, distributed, reproduced, mapped, or used in any other way, the user should acknowledge TransBASE.sfgov.org as the source of the data, provide a reference to the original data source where also applicable, include the date the data was pulled, and note any caveats specified in the associated metadata documentation provided. However, users should not attribute their analysis or interpretation of this data to the City of San Francisco. While the data has been collected and/or produced for the use of the City of San Francisco, it cannot guarantee its accuracy or completeness. Accordingly, the City of San Francisco, including SFDPH, SFMTA, and SFPD make no representation as to the accuracy of the information or its suitability for any purpose and disclaim any liability for omissions or errors that may be contained therein. As all data is associated with methodological assumptions and limitations, the City recommends that users review methodological documentation associated with the data prior to its analysis, interpretation, or communication.
This dataset can also be queried on the TransBASE Dashboard. TransBASE is a geospatially enabled database maintained by SFDPH that currently includes over 200 spatially referenced variables from multiple agencies and across a range of geographic scales, including infrastructure, transportation, zoning, sociodemographic, and collision data, all linked to an intersection or street segment. TransBASE facilitates a data-driven approach to understanding and addressing transportation-related health issues, informed by a large and growing evidence base regarding the importance of transportation system design and land use decisions for health. TransBASE’s purpose is to inform public and private efforts to improve transportation system safety, sustainability, community health and equity in San Francisco.
E. RELATED DATASETS Traffic Crashes Resulting in Injury Traffic Crashes Resulting in Injury: Parties Involved TransBASE Dashboard iSWITRS TIMS
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Climate warming changes the phenology of many species. When interacting organisms respond differently, climate change may disrupt their interactions and affect the stability of ecosystems. Here, we used GBIF occurrence records to examine phenology trends in plants and their associated insect pollinators in Germany since the 1980s. We found strong phenological advances in plants, but differences in the extent of shifts among pollinator groups. The temporal trends in plant and insect phenologies were generally associated with interannual temperature variation, and thus likely driven by climate change. When examining the synchrony of species-level plant-pollinator interactions, their temporal trends differed among pollinator groups. Overall, plant-pollinator interactions become more synchronized, mainly because the phenology of plants, which historically lagged behind that of the pollinators, responded more strongly to climate change. However, if the observed trends continue, many interactions may become more asynchronous again in the future. Our study suggests that climate change affects the phenologies of both plants and insects, and that it also influences the synchrony of plant-pollinator interactions. Methods
The given datasets contain estimated species phenology shifts for time and temperature. The way the slopes were estimated is given below. We worked with occurrence records of plants and insects available from the GBIF database (GBIF.org 2021c, 2021a, 2021b, 2021d). For the plants, we restricted ourselves to species covered by the BiolFlor database of plant traits (Klotz et al. 2002), because we originally intended to classify plants by their level of pollinator dependence – an idea we later abandoned. For the insects we restricted ourselves to beetles (Coleoptera), flies (Diptera), bees (Hymenoptera) as well as butterflies and moths (Lepidoptera), as these groups contain most insect pollinators (Kevan and Baker 1983). We used the R package rgbif (Chamberlain and Boettiger 2017) to download all available records of the above taxa from GBIF. Our basic criteria for including records were that they originated from Germany, and that their basis of record (as defined in GBIF) was either a living specimen (e.g., a captured insect), a human observation (i.e., an observation of a species made without collecting it), just an observation (i.e., when the exact type of observation was not clear), or a preserved specimen (e.g., an herbarium record or a collected specimen). If names of plant species were not accepted names, we used the R package taxsize (Chamberlain and Szöcs 2013) to check the names against the GBIF backbone taxonomy and determine the actual accepted name. Prior to the data analyses, we subjected the data to several steps of quality control. First, we removed all records from before 1980 as these turned out to be too inconsistent, with few records per year and large gaps due to consecutive years without records. We also removed the records from 2021 as the year had not been complete at the time of our analysis. Second, we removed all records from the days of year (DOY) 1, 95, 121, 163, 164, 166 and 181, and in particular DOY 365 and 366 from the National Museum of Natural History in Luxembourg because the high numbers of records on these days indicated that either records without collecting date had been assigned these by default, or the dates were used by BioBlitz events where very large numbers of records are taken on a specific day of the year. Including these data would have strongly biased the intra-annual distributions of our records. Finally, we removed some records for which no elevation or temperature data could be obtained (see below). To ensure reasonable coverage of the studied time interval, we then restricted the records to species which had at least 10 occurrence records in every decade covered (with the year 2020 included in the last decade). After these data curation steps, we maintained just above 12 million occurrence records that covered altogether 1,764 species, with 11.4 million records of 1,438 plant species, around 590,000 records of 207 species of butterflies and moths, some 76,000 records of 20 bee species and 30,000 records of 22 fly species, and almost 25,000 records of 77 species of beetles (Table 1). There were large differences between plants and insects not only in the numbers of records but also in their temporal distribution across the studied period (Figure S 1). While plants had relatively even record numbers across years, the insect groups, in particular flies and bees, were strongly underrepresented in the earlier decades, and record numbers increased rapidly in the last 20 years, probably due to the advent of platforms like iNaturalist.org and naturgucker.de, which allow recording of species occurrences by citizen naturalists, and which made up most of the insect occurrence data for Germany in GBIF.
Temperature and elevation data, and individual interactions Besides the main phenology data from GBIF records, we obtained several other data sets required for our analyses. To test for associations with climate, we used temperature data from the Climate Research Unit (CRU, https://crudata.uea.ac.uk) at the University of East Anglia, specifically the Time-Series dataset version 4.05 (University of East Anglia Climatic Research Unit et al. 2021), which contains gridded temperature data from 1901-2020 at a resolution of 0.5° longitude by 0.5° latitude. From this dataset we extracted the monthly mean temperatures and averaged them to obtain the annual mean temperatures at the sites of occurrence records. To be able to control for elevation at the locations of occurrence records, we used elevation data at a 90 m resolution from the NASA Shuttle Radar Topography Mission (SRTM), obtained from the SRTM 90m DEM Digital Elevation Database (Jarvis et al. 2021) and accessed through the raster package (Hijmans 2020) in R. Data analysis All data wrangling and analysis was done in R (R Core Team 2008). Before analysing phenology data, we examined patterns of climate change in Germany through a linear model that regressed the annual mean temperature values at the collection sites (= the corresponding 0.5° × 0.5° grid cells) over time. To understand phenology changes in plants and insects, we first estimated the phenological shifts in each taxonomic group (i.e., plants, beetles, flies, bees, and butterflies/moths) as the slopes of linear regressions linking activity, i.e., the DOY of a record to its year of observation. We estimated taxonomic group-specific phenological shifts in two linear mixed-effect models: one that estimated shifts over time and one that related phenology variation to temperature. Both models included the DOY of a record as the response variable, and the latitude, longitude, and elevation of a record as fixed effects. The temporal-change model additionally included the year of a record as a fixed effect and as a random effect across species; the temperature-change model instead included the annual mean temperature at the site of a record as a fixed effect and as a random effect across species. We used the lme4 package (Bates et al. 2015) in R to fit these models, and assessed model fits by visually inspecting the relationships between residuals and fitted values, and between residuals and covariates (Supplementary diagnostic plots). As the random effects from the models agreed well enough with more complex generalized additive models, we considered our linear model reasonably robust.
Literature
Literature Bates, Douglas; Mächler, Martin; Bolker, Ben; Walker, Steve (2015): Fitting Linear Mixed-Effects Models Using lme4. In J. Stat. Soft. 67 (1). DOI: 10.18637/jss.v067.i01. Chamberlain, Scott A.; Boettiger, Carl (2017): R Python, and Ruby clients for GBIF species occurrence data. DOI: 10.7287/peerj.preprints.3304v1.
Chamberlain, Scott A.; Szöcs, Eduard (2013): taxize: taxonomic search and retrieval in R. In F1000Research 2, p. 191. DOI: 10.12688/f1000research.2-191.v2. GBIF.org (2021a): Occurrence Download. DOI: 10.15468/dl.z9wz76.
GBIF.org (2021b): Occurrence Download. DOI: 10.15468/dl.3anj3a.
GBIF.org (2021c): Occurrence Download. DOI: 10.15468/dl.t2x9cj.
GBIF.org (2021d): Occurrence Download. DOI: 10.15468/dl.hb2fgs.
Hijmans, Robert J. (2020): raster: Geographic Data Analysis and Modeling. Available online at https://CRAN.R-project.org/package=raster.
Jarvis, Andy; Reuter, Hannes I.; Nelson, Andy; Guevara, E. (2021): Hole-filled seamless SRTM data V4. International Centre for Tropical Agriculture (CIAT). Available online at https://srtm.csi.cgiar.org, updated on 9/7/2021.
Kevan, P. G.; Baker, H G (1983): Insects as Flower Visitors and Pollinators. In Annu. Rev. Entomol. 28 (1), pp. 407–453. DOI: 10.1146/annurev.en.28.010183.002203. Klotz, Stefan; Kühn, Ingolf; Durka, Walter (2002): BIOLFLOR – Eine Datenbank zu Biologisch-Ökologischen Merkmalen der Gefäßpflanzen in Deutschland. In : Schriftenreihe für Vegetationskunde, vol. 38, pp. 1–333.
R Core Team (2008): R: A language and environment for statistical computing. Vienna, Austria. Available online at http://www.r-project.org./.
University of East Anglia Climatic Research Unit; Harris, I. C.; Jones, P. D.; Osborn, T. (2021): CRU TS4.05. Climatic Research Unit (CRU) Time-Series (TS) version 4.05 of high-resolution gridded data of month-by-month variation in climate (Jan. 1901- Dec. 2020). NERC EDS Centre for Environmental Data Analysis. Available online at https://catalogue.ceda.ac.uk/uuid/c26a65020a5e4b80b20018f148556681, checked on 9/1/2021.
Facebook
TwitterA. SUMMARY This table contains all fatalities resulting from a traffic crash in the City of San Francisco. Fatality year-to-date crash data is obtained from the Office of the Chief Medical Examiner (OME) death records, and only includes those cases that meet the San Francisco Vision Zero Fatality Protocol maintained by the San Francisco Department of Public Health (SFDPH), San Francisco Police Department (SFPD), and San Francisco Municipal Transportation Agency (SFMTA). Injury crash data is obtained from SFPD’s Interim Collision System for 2018 to YTD, Crossroads Software Traffic Collision Database (CR) for years 2013-2017 and the Statewide Integrated Transportation Record System (SWITRS) maintained by the California Highway Patrol for all years prior to 2013. Only crashes with valid geographic information are mapped. All geocodable crash data is represented on the simplified San Francisco street centerline model maintained by the Department of Public Works (SFDPW). Collision injury data is queried and aggregated on a quarterly basis. Crashes occurring at complex intersections with multiple roadways are mapped onto a single point and injury and fatality crashes occurring on highways are excluded. The fatality table contains information about each party injured or killed in the collision, including any passengers. B. HOW THE DATASET IS CREATED Traffic crash injury data is collected from the California Highway Patrol 555 Crash Report as submitted by the police officer within 30 days after the crash occurred. All fields that match the SWITRS data schema are programmatically extracted, de-identified, geocoded, and loaded into TransBASE. See Section D below for details regarding TransBASE. This table is filtered for fatal traffic crashes. C. UPDATE PROCESS After review by SFPD and SFDPH staff, the data is made publicly available approximately a month after the end of the previous quarter (May for Q1, August for Q2, November for Q3, and February for Q4). D. HOW TO USE THIS DATASET This data is being provided as public information as defined under San Francisco and California public records laws. SFDPH, SFMTA, and SFPD cannot limit or restrict the use of this data or its interpretation by other parties in any way. Where the data is communicated, distributed, reproduced, mapped, or used in any other way, the user should acknowledge the Vision Zero initiative and the TransBASE database as the source of the data, provide a reference to the original data source where also applicable, include the date the data was pulled, and note any caveats specified in the associated metadata documentation provided. However, users should not attribute their analysis or interpretation of this data to the City of San Francisco. While the data has been collected and/or produced for the use of the City of San Francisco, it cannot guarantee its accuracy or completeness. Accordingly, the City of San Francisco, including SFDPH, SFMTA, and SFPD make no representation as to the accuracy of the information or its suitability for any purpose and disclaim any liability for omissions or errors that may be contained therein. As all data is associated with methodological assumptions and limitations, the City recommends that users review methodological documentation associated with the data prior to its analysis, interpretation, or communication. TransBASE is a geospatially enabled database maintained by SFDPH that currently includes over 200 spatially referenced variables from multiple agencies and across a range of geographic scales, including infrastructure, transportation, zoning, sociodemographic, and collision data, all linked to an intersection or street segment. TransBASE facilitates a data-driven approach to understanding and addressing transportation-related health issues, informed by a large and growing evidence base regarding the importance of transportation system design and land u
Facebook
TwitterThe global market research industry reached a record high market size of approximately ** billion U.S. dollars in 2023. Over the last decade, the global market research industry has performed contrary to broader economic trends as the industry has continued to grow. Figures for 2023 signaled an increase of about *** billion U.S. dollars compared to the previous year. Market research industryMarket research is the activity of gathering information about markets in which an organization sells their produces and/or services. This often includes detailed qualitative understandings of consumer attitudes and preferences through tools such interviews, surveys, and increasingly, big-data analytics. The leading market research company worldwide was U.S.-based Gartner in 2022. Slow growth in EuropeWhile growth in the United States has been significant, the revenue of the market research industry in Europe grew just slightly since 2014. Some analysts expect this poor performance to continue into the near future for *** reasons. First is the short- and mid-term uncertainty created by Brexit, impacting the reliability of any market research conducted prior to the issue being resolved. Second is the implementation of the EU General Data Protection Regulation (GDPR) laws in May 2018, which limit what companies are able to do with personal data. A majority of IT professionals in France, Germany and the UK agree the GDPR laws will prevent personal data being passed on to third parties, reducing the amount of data available to researchers in Europe compared to other regions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.
The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.
Methodology
To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).
These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.
To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.
Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.
Description of the data in this data set
Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies
The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information
Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet
2) Complete reference - the complete source information to refer to the study
3) Year of publication - the year in which the study was published
4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter}
5) DOI / Website- a link to the website where the study can be found
6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science
7) Availability in OA - availability of an article in the Open Access
8) Keywords - keywords of the paper as indicated by the authors
9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}
Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?
Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)?
18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))
HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term?
20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output")
21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description)
22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles?
23) Data - what data do HVD cover?
24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)
Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx
Licenses or restrictions CC-BY
For more info, see README.txt
Facebook
TwitterThe Ontario government, generates and maintains thousands of datasets. Since 2012, we have shared data with Ontarians via a data catalogue. Open data is data that is shared with the public. Click here to learn more about open data and why Ontario releases it. Ontario’s Open Data Directive states that all data must be open, unless there is good reason for it to remain confidential. Ontario’s Chief Digital and Data Officer also has the authority to make certain datasets available publicly. Datasets listed in the catalogue that are not open will have one of the following labels: If you want to use data you find in the catalogue, that data must have a licence – a set of rules that describes how you can use it. A licence: Most of the data available in the catalogue is released under Ontario’s Open Government Licence. However, each dataset may be shared with the public under other kinds of licences or no licence at all. If a dataset doesn’t have a licence, you don’t have the right to use the data. If you have questions about how you can use a specific dataset, please contact us. The Ontario Data Catalogue endeavors to publish open data in a machine readable format. For machine readable datasets, you can simply retrieve the file you need using the file URL. The Ontario Data Catalogue is built on CKAN, which means the catalogue has the following features you can use when building applications. APIs (Application programming interfaces) let software applications communicate directly with each other. If you are using the catalogue in a software application, you might want to extract data from the catalogue through the catalogue API. Note: All Datastore API requests to the Ontario Data Catalogue must be made server-side. The catalogue's collection of dataset metadata (and dataset files) is searchable through the CKAN API. The Ontario Data Catalogue has more than just CKAN's documented search fields. You can also search these custom fields. You can also use the CKAN API to retrieve metadata about a particular dataset and check for updated files. Read the complete documentation for CKAN's API. Some of the open data in the Ontario Data Catalogue is available through the Datastore API. You can also search and access the machine-readable open data that is available in the catalogue. How to use the API feature: Read the complete documentation for CKAN's Datastore API. The Ontario Data Catalogue contains a record for each dataset that the Government of Ontario possesses. Some of these datasets will be available to you as open data. Others will not be available to you. This is because the Government of Ontario is unable to share data that would break the law or put someone's safety at risk. You can search for a dataset with a word that might describe a dataset or topic. Use words like “taxes” or “hospital locations” to discover what datasets the catalogue contains. You can search for a dataset from 3 spots on the catalogue: the homepage, the dataset search page, or the menu bar available across the catalogue. On the dataset search page, you can also filter your search results. You can select filters on the left hand side of the page to limit your search for datasets with your favourite file format, datasets that are updated weekly, datasets released by a particular organization, or datasets that are released under a specific licence. Go to the dataset search page to see the filters that are available to make your search easier. You can also do a quick search by selecting one of the catalogue’s categories on the homepage. These categories can help you see the types of data we have on key topic areas. When you find the dataset you are looking for, click on it to go to the dataset record. Each dataset record will tell you whether the data is available, and, if so, tell you about the data available. An open dataset might contain several data files. These files might represent different periods of time, different sub-sets of the dataset, different regions, language translations, or other breakdowns. You can select a file and either download it or preview it. Make sure to read the licence agreement to make sure you have permission to use it the way you want. Read more about previewing data. A non-open dataset may be not available for many reasons. Read more about non-open data. Read more about restricted data. Data that is non-open may still be subject to freedom of information requests. The catalogue has tools that enable all users to visualize the data in the catalogue without leaving the catalogue – no additional software needed. Have a look at our walk-through of how to make a chart in the catalogue. Get automatic notifications when datasets are updated. You can choose to get notifications for individual datasets, an organization’s datasets or the full catalogue. You don’t have to provide and personal information – just subscribe to our feeds using any feed reader you like using the corresponding notification web addresses. Copy those addresses and paste them into your reader. Your feed reader will let you know when the catalogue has been updated. The catalogue provides open data in several file formats (e.g., spreadsheets, geospatial data, etc). Learn about each format and how you can access and use the data each file contains. A file that has a list of items and values separated by commas without formatting (e.g. colours, italics, etc.) or extra visual features. This format provides just the data that you would display in a table. XLSX (Excel) files may be converted to CSV so they can be opened in a text editor. How to access the data: Open with any spreadsheet software application (e.g., Open Office Calc, Microsoft Excel) or text editor. Note: This format is considered machine-readable, it can be easily processed and used by a computer. Files that have visual formatting (e.g. bolded headers and colour-coded rows) can be hard for machines to understand, these elements make a file more human-readable and less machine-readable. A file that provides information without formatted text or extra visual features that may not follow a pattern of separated values like a CSV. How to access the data: Open with any word processor or text editor available on your device (e.g., Microsoft Word, Notepad). A spreadsheet file that may also include charts, graphs, and formatting. How to access the data: Open with a spreadsheet software application that supports this format (e.g., Open Office Calc, Microsoft Excel). Data can be converted to a CSV for a non-proprietary format of the same data without formatted text or extra visual features. A shapefile provides geographic information that can be used to create a map or perform geospatial analysis based on location, points/lines and other data about the shape and features of the area. It includes required files (.shp, .shx, .dbt) and might include corresponding files (e.g., .prj). How to access the data: Open with a geographic information system (GIS) software program (e.g., QGIS). A package of files and folders. The package can contain any number of different file types. How to access the data: Open with an unzipping software application (e.g., WinZIP, 7Zip). Note: If a ZIP file contains .shp, .shx, and .dbt file types, it is an ArcGIS ZIP: a package of shapefiles which provide information to create maps or perform geospatial analysis that can be opened with ArcGIS (a geographic information system software program). A file that provides information related to a geographic area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open using a GIS software application to create a map or do geospatial analysis. It can also be opened with a text editor to view raw information. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format for sharing data in a machine-readable way that can store data with more unconventional structures such as complex lists. How to access the data: Open with any text editor (e.g., Notepad) or access through a browser. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format to store and organize data in a machine-readable way that can store data with more unconventional structures (not just data organized in tables). How to access the data: Open with any text editor (e.g., Notepad). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A file that provides information related to an area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open with a geospatial software application that supports the KML format (e.g., Google Earth). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. This format contains files with data from tables used for statistical analysis and data visualization of Statistics Canada census data. How to access the data: Open with the Beyond 20/20 application. A database which links and combines data from different files or applications (including HTML, XML, Excel, etc.). The database file can be converted to a CSV/TXT to make the data machine-readable, but human-readable formatting will be lost. How to access the data: Open with Microsoft Office Access (a database management system used to develop application software). A file that keeps the original layout and
Facebook
TwitterList of the data tables as part of the Immigration system statistics Home Office release. Summary and detailed data tables covering the immigration system, including out-of-country and in-country visas, asylum, detention, and returns.
If you have any feedback, please email MigrationStatsEnquiries@homeoffice.gov.uk.
The Microsoft Excel .xlsx files may not be suitable for users of assistive technology.
If you use assistive technology (such as a screen reader) and need a version of these documents in a more accessible format, please email MigrationStatsEnquiries@homeoffice.gov.uk
Please tell us what format you need. It will help us if you say what assistive technology you use.
Immigration system statistics, year ending September 2025
Immigration system statistics quarterly release
Immigration system statistics user guide
Publishing detailed data tables in migration statistics
Policy and legislative changes affecting migration to the UK: timeline
Immigration statistics data archives
https://assets.publishing.service.gov.uk/media/691afc82e39a085bda43edd8/passenger-arrivals-summary-sep-2025-tables.ods">Passenger arrivals summary tables, year ending September 2025 (ODS, 31.5 KB)
‘Passengers refused entry at the border summary tables’ and ‘Passengers refused entry at the border detailed datasets’ have been discontinued. The latest published versions of these tables are from February 2025 and are available in the ‘Passenger refusals – release discontinued’ section. A similar data series, ‘Refused entry at port and subsequently departed’, is available within the Returns detailed and summary tables.
https://assets.publishing.service.gov.uk/media/691b03595a253e2c40d705b9/electronic-travel-authorisation-datasets-sep-2025.xlsx">Electronic travel authorisation detailed datasets, year ending September 2025 (MS Excel Spreadsheet, 58.6 KB)
ETA_D01: Applications for electronic travel authorisations, by nationality
ETA_D02: Outcomes of applications for electronic travel authorisations, by nationality
https://assets.publishing.service.gov.uk/media/6924812a367485ea116a56bd/visas-summary-sep-2025-tables.ods">Entry clearance visas summary tables, year ending September 2025 (ODS, 53.3 KB)
https://assets.publishing.service.gov.uk/media/691aebbf5a253e2c40d70598/entry-clearance-visa-outcomes-datasets-sep-2025.xlsx">Entry clearance visa applications and outcomes detailed datasets, year ending September 2025 (MS Excel Spreadsheet, 30.2 MB)
Vis_D01: Entry clearance visa applications, by nationality and visa type
Vis_D02: Outcomes of entry clearance visa applications, by nationality, visa type, and outcome
Additional data relating to in country and overse
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
The UK hourly solar radiation data contain the amount of solar irradiance received during the hour ending at the specified time. All sites report 'global' radiation amounts. This is also known as 'total sky radiation' as it includes both direct solar irradiance and 'diffuse' irradiance as a result of light scattering. Some sites also provide separate diffuse and direct irradiation amounts, depending on the instrumentation at the site. For these the sun's path is tracked with two pyrometers - one where the path to the sun is blocked by a suitable disc to allow the scattered sunlight to be measured to give the diffuse measurement, while the other has a tube pointing at the sun to measure direct solar irradiance whilst blanking out scattered sun light.
For details about the different measurements made and the limited number of sites making them please see the MIDAS Solar Irradiance table linked to in the online resources section of this record.
This version supersedes the previous version of this dataset and a change log is available in the archive, and in the linked documentation for this record, detailing the differences between this version and the previous version. The change logs detail new, replaced and removed data. These include the addition of data for calendar year 2023.
The data were collected by observation stations operated by the Met Office across the UK and transmitted within the following message types: SYNOP, HCM, AWSHRLY, MODLERAD, ESAWRADT and DRADR35 messages. The data spans from 1947 to 2023.
This dataset is part of the Midas-open dataset collection made available by the Met Office under the UK Open Government Licence, containing only UK mainland land surface observations owned or operated by the Met Office. It is a subset of the fuller, restricted Met Office Integrated Data Archive System (MIDAS) Land and Marine Surface Stations dataset, also available through the Centre for Environmental Data Analysis - see the related dataset section on this record.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GENERAL INFORMATION
SHARING/ACCESS INFORMATION
METHODOLOGICAL INFORMATION
DATA & FILE OVERVIEW
DATA-SPECIFIC INFORMATION FOR: 002-interviewtranscript_p02_cristiananascimento_v01.txt
DATA-SPECIFIC INFORMATION FOR: 001-interviewaudio_p01_cristiananascimento_v01.mp3
CITATION TOPICS Web of Science Macro Level Citation topic: Social Sciences Web of Science Meso label: 6.86 Human Geography 6.153 Climate Change 6.153 Climate Change 6.153 Climate Change 6.263 Agricultural Policy 6.263 Agricultural Policy 6.303 Sociology 6.303 Sociology Web of Science Micro label: 6.86.149 Gentrification 6.153.558 Climate Change Adaptation 6.153.742 Science Communication 6.153.2227 Strategic Environmental Assessment 6.263.898 Farmers 6.263.1407 Urban Agriculture 6.303.1915 Public Sociology 6.303.2393 Social Policies
--END---
Facebook
TwitterIntroduction:
This dataset analysis aims to explore and analyze a Credit Score dataset to gain insights into customer creditworthiness and segmentation. The dataset contains information on various factors that influence credit scores, such as payment history, credit utilization ratio, number of credit accounts, education level, and employment status. The analysis will utilize the k-means algorithm to perform clustering and identify distinct groups of customers based on their credit scores.
The Credit Score dataset comprises a collection of records, each representing an individual's credit profile. The features included in the dataset are as follows:
The data set Contains following all features:
(1). Age: This feature represents the age of the individual.
(2). Gender: This feature captures the gender of the individual.
(3). Marital Status: This feature denotes the marital status of the individual.
(4). Education Level: This feature represents the highest level of education attained by the individual.
(5). Employment Status: This feature indicates the current employment status of the individual.
(6). Credit Utilization Ratio: This feature reflects the ratio of credit used by the individual compared to their total available credit limit.
(7). Payment History: It represents the monthly net payment behaviour of each customer, taking into account factors such as on-time payments, late payments, missed payments, and defaults.
(8). Number of Credit Accounts: It represents the count of active credit accounts the person holds.
(9). Loan Amount: It indicates the monetary value of the loan.
(10). Interest Rate: This feature represents the interest rate associated with the loan.
(11). Loan Term: This feature denotes the duration or term of the loan.
(12). Type of Loan: It includes categories like “Personal Loan,” “Auto Loan,” or potentially other types of loans.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundMost estimates of HIV retention are derived at the clinic level through antiretroviral (ART) patient management systems, which capture ART clinic visit data, yet these cannot account for silent transfers across HIV treatment sites. Patient laboratory monitoring visits may also be observed in routinely collected laboratory data, which include ART monitoring tests such as CD4 count and HIV viral load, key to our work here.MethodsIn this analysis, we utilized the NHLS National HIV Cohort (a system-wide viewpoint) to investigate the accuracy of facility-level estimates of retention in care for adult patients accessing care (defined using clinic visit data on patients under ART recorded in an electronic patient management system) at Themba Lethu Clinic (TLC). Furthermore, we describe patterns of facility switching among all patients and those patients classified as lost to follow-up (LTFU) at the facility level.ResultsOf the 43,538 unique patients in the TLC dataset, we included 20,093 of 25,514 possible patient records (78.8%) in our analysis that were linked with the NHLS National Cohort, and we restricted the analytic sample to patients initiating ART between 1 January 2007 and 31 December 2017. Most (60%) patients were female, and the median age (IQR) at ART initiation was 37 (31–45) years. We found the laboratory records augmented retention estimates by a median of 860 additional active records (about 8% of all median active records across all years) from the facility viewpoint; this augmentation was more noticeable from the system-wide viewpoint, which added evidence of activity of about one-third of total active records in 2017. In 2017, we found 7.0% misclassification at the facility-level viewpoint, a gap which is potentially solvable through data integration/triangulation. We observed 1,134/20,093 (5.6%) silent transfers; these were noticeably more female and younger than the entire dataset. We also report the most common locations for clinic switching at a provincial level.DiscussionIntegration of multiple data sources has the potential to reduce the misclassification of patients as being lost to care and help understand situations where clinic switching is common. This may help in prioritizing interventions that would assist patients moving between clinics and hopefully contribute to services that normalize formal transfers and fewer silent transfers.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
This release provides the results of questions on animal health and welfare practices adopted by farmers. Link to main notice: https://www.gov.uk/government/organisations/department-for-environment-food-rural-affairs/series/farm-business-survey#publications Survey methodology This release includes the results for the questions asked on business management practices. Comparisons to results from the previous business management practices module conducted in 2007/08 have where possible been included in this publication. Results from IT usage question were released on the 20 March 2013, for the detailed results please see: https://www.gov.uk/government/publications/farm-practices-survey-october-2012-computer-usage The Farm Business Survey (FBS) is an annual survey providing information on the financial position and physical and economic performance of farm businesses in England. The sample of around 1,900 farm businesses covers all regions of England and all types of farming with the data being collected by face to face interview with the farmer. Results are weighted to represent the whole population of farm businesses that have at least 25,000 Euros of standard output as recorded in the annual June Survey of Agriculture and Horticulture. In 2011 there were just over 56,000 farm businesses meeting this criteria. In the 2011/12 survey, an additional module was included to collect information on business management practices from a sub-sample of farm businesses. The information collected covered (i) business management practices such as benchmarking, risk management, IT usage and management accounting, (ii) practices specific to animal health and welfare e.g. biosecurity, veterinary strategy, animal health plans, (iii) the environmental footprint of farming, GHG abatement, energy use and, (iv) climate change adaptation. When combined with other data from the survey this helps to explain farm businesses’ behaviour and how this varies with parameters such as farm type, farm size and performance. Completion of the business management practices module was voluntary with a response rate of 71% in 2011/12. The farms that responded to the business management practices module had similar characteristics to those farms in the main FBS in terms of farm type and geographical location. There is a smaller proportion of large and very large farms in the module subset than in the main FBS For further information about the Farm Business Survey please see: https://www.gov.uk/government/organisations/department-for-environment-food-rural-affairs/series/farm-business-survey Data analysis The results from the FBS relate to farms which have a standard output of at least 25,000 Euros . Initial weights are applied to the FBS records based on the inverse sampling fraction for each design stratum (farm type by farm size). These weights are then adjusted (calibration weighting) so that they can produce unbiased estimators of a number of different target variables. Completion of the business management practices module was voluntary and a sample of around 1,350 farms was achieved. In order to take account of non-response, the results have been reweighted using a method that preserves marginal totals for populations according to farm type and farm size groups. As such, farm population totals for other classifications (e.g. regions) will not be in-line with results using the main FBS weights, nor will any results produced for variables derived from the rest of the FBS (e.g. farm business income). Comparisons between 2007/08 and 2011/12 Results from the 2007/08 and 2011/12 business management practices modules are not directly comparable due to changes in the coverage of the survey and changes in the classification of farms for the 2010/11 campaign. In 2010/11 the survey was restricted to include farms which have at least 25,000 Euros of standard output; prior to this the survey was restricted to farms with ½ Standard Labour Requirement or more. The classification of farms into farm types was also revised for the 2010/11 Farm Business Survey, to bring the classification in line with European guidelines. Equivalent results from 2007/08 have been presented alongside 2011/12 results in many of the charts and tables; however comparisons should be treated with extreme caution due to the reasons given above. To enable more robust comparisons between the 2007/08 and 2011/12 business management practices module, we have examined the subset of farms that participated in both years (approximately 770 farms). For this subset of farms we have carried out significance testing using McNemar’s test to determine whether the differences observed between the two time periods are statistically significant. The McNemar’s test is applied to 2x2 contingency tables, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal. Where a statistically significant difference has been observed this has been indicated on the tables and charts for the full module results with a *. Commentary alongside the charts and tables will refer to this analysis rather than make comparisons with the 2007/08 data displayed. Accuracy and reliability of the results Where possible, we have shown 95% confidence intervals against the figures. These show the range of values that may apply to the figures. They mean that we are 95% confident this range contains the true value . They are calculated as the standard errors (se) multiplied by 1.96 to give the 95% confidence interval (95% CI). The standard errors only give an indication of the sampling error. They do not reflect any other sources of survey errors, such as non-response bias. The confidence limits shown are appropriate for comparing groups within the same year; they should not be used for comparing, different years’ results from the Farm Business Survey since they do not allow for the fact that in the FBS many of the same farms contributed in both years. We have also shown error bars on the figures in this notice. These error bars represent the 95% confidence intervals for the figures (as defined above).. Estimates based on less than 5 observations have been suppressed to prevent disclosure of the identity of the contributing farms. Estimates based on less than 15 observations have been highlighted in italics in the tables and should be treated with caution as they are likely to be less precise. Definitions Economic performance for each farm is measured as the ratio between economic output (mainly sales revenue) and inputs (costs+ unpaid labour). The higher the ratio, the higher the economic efficiency and performance. Performance bands based on economic performance percentiles are as follows: Low performers - farmers who took part in the Business Management Practices survey and were in the bottom 25% of economic performers in this sample Medium performers -farmers who took part in the Business Management Practices survey and were in the middle 50% of performers in this sample High performers - farmers who took part in the Business Management Practices survey and were in the top 25% of performers in this sample. These are based on economic performance in 2011/12. Availability of results Defra statistical notices can be viewed on the Food and Farming Statistics pages on the Defra website at https://www.gov.uk/government/organisations/department-for-environment-food-rural-affairs/about/statistics. This site also shows details of future publications, with pre-announced dates.
Facebook
TwitterCurrent, CTD, and other data were collected from the YAQUINA and other platforms from the coastal waters of Washington/Oregon from 28 January 1975 to 01 September 1975. Data were collected by Oregon State University (OSU) as part of the International Decade of Ocean Exploration / Coastal Upwelling Ecosystems Analysis (IDOE/CUEA). Data were processed by NODC to the NODC standard F015 Current Meter Components and the F022 High-Resolution CTD/STD Output formats. Full format descriptions are available at nodc.noaa.gov/. Analog data are availabe for this accession by contacting NODC user services.
The F015 format contains time series measurements of ocean currents. These data are obtained from current meter moorings and represent the Eulerian method of current measurement, i.e., the meters are deployed at a fixed point and measure flow past a sensor. Position, bottom depth, sensor depth and meter characteristics are reported for each station. The data record includes values of east-west (u) and north-south (v) current vector components at specified date and time. Current direction is defined as the direction toward which the water is flowing with positive directions east and north. Data values may be subject to averaging or filtering and are typically reported at 10 - 15 minute time intervals. Water temperature, pressure and conductivity or salinity may also be reported. A text record is available for optional comments.
The F022 format contains high-resolution data collected using CTD (conductivity-temperature-depth) and STD (salinity-temperature-depth) instruments. As they are lowered and raised in the oceans, these electronic devices provide nearly continuous profiles of temperature, salinity, and other parameters. Data values may be subject to averaging or filtering or obtained by interpolation and may be reported at depth intervals as fine as 1m. Cruise and instrument information, position, date, time and sampling interval are reported for each station. Environmental data at the time of the cast (meteorological and sea surface conditions) may also be reported. The data record comprises values of temperature, salinity or conductivity, density (computed sigma-t), and possibly dissolved oxygen or transmissivity at specified depth or pressure levels. Data may be reported at either equally or unequally spaced depth or pressure intervals. A text record is available for comments.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The LSC (Leicester Scientific Corpus)
April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online
The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R
The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:
Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.
Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
Facebook
Twitterhttps://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de435834https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de435834
Abstract (en): The purpose of the Cooperative Agreement (CA) Research Program was to monitor risk factors, risk behaviors, and rates of HIV seroprevalence and seroincidence among out-of-treatment, multi-ethnic/racial injection drug users and crack cocaine users. The program evaluated the efficacy of experimental interventions designed to prevent, eliminate, or reduce HIV risk behaviors and developed new treatment interventions. All participants received the standard intervention, which consisted of street-based outreach and HIV prevention counseling. Those assigned to enhanced interventions received more counseling sessions, educational videos, social gatherings, and support group activities. The public-use data file contains 31,088 respondent records, collected from 21 CA program facilities in the United States and one facility each in Puerto Rico and Brazil. Hence, the process data file contains 23 records of facility information that can be linked to individual respondents. Respondent interviews include a baseline Risk Behavior Assessment (completed prior to first intervention) and a Follow-Up Assessment, conducted either three months or six months after the baseline survey. Respondent data were augmented with eligibility information, biological markers of drug use, HIV test results, and intervention assignment. At baseline and post-intervention, the surveys measured drug use and drug treatment, sexual activity and sex for money/drugs, arrests, work/income, HIV/STD/pregnancy status, perceptions of risk, and risk reduction behaviors. The process questionnaires were completed by staff or principal investigators at the 23 site locations. Process data describe the program structure and process, other intervention projects in the community, needle exchange programs and pharmacy syringe sales, and local HIV infection rates. Drugs reported on include alcohol, marijuana/hashish, crack/cocaine, heroin (including speedball), non-prescription methadone, other opiates, and amphetamines. ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Performed consistency checks.; Standardized missing values.; Created online analysis version with question text.; Checked for undocumented or out-of-range codes.. Multi-ethnic/racial male and female drug injectors and crack users at risk for HIV in the United States. The Cooperative Agreement (CA) used a randomized and quasi-experimental design. Respondents were recruited using a targeted sampling strategy that employed mapping geographic areas of local drug use activity and HIV infection. These ethnographic and epidemiologic sampling techniques were utilized at the individual and community level. Each site had a monthly recruitment goal of 35 multi-ethnic/racial male (70 percent) and female (30 percent) drug injectors and crack users who were at risk for HIV. Respondents were eligible if they had self-reported injection, crack, or cocaine use within the past 30 days, were at least 18 years of age at the time of baseline, were not currently enrolled in treatment, and had not been interviewed by the National AIDS Demonstration Research program or the CA program within the past year. Individuals or communities were randomly assigned to a standard or enhanced intervention track. 2008-10-23 New files were added. These files included one or more of the following: Stata setup, SAS transport (CPORT), SPSS system, Stata system, SAS supplemental syntax, and Stata supplemental syntax files, and tab-delimited ASCII data file. Funding insitution(s): United States Department of Health and Human Services. National Institutes of Health. National Institute on Drug Abuse (N01DA-6-5052). Data were collected and prepared for release by CSR Incorporated, Washington, DC.To protect the privacy of respondents, all variables that could be used to identify individual clients or facilities have been encrypted, collapsed, or removed from the public use files. These modifications should not affect the analytic uses of the public use files.All participants received the standard intervention, while those assigned to the enhanced intervention received more sessions. Additional information about the stu...
Facebook
TwitterAI Generated Summary: The Ontario Data Catalogue is a data portal providing access to open datasets generated and maintained by the Ontario government. It allows users to search, access, visualize, and download data in various machine-readable formats, often through APIs, while also indicating licensing terms and data update frequencies. The catalogue also provides tools for data visualization and notifications for dataset updates. About: The Ontario government generates and maintains thousands of datasets. Since 2012, we have shared data with Ontarians via a data catalogue. Open data is data that is shared with the public. Click here to learn more about open data and why Ontario releases it. Ontario’s Digital and Data Directive states that all data must be open, unless there is good reason for it to remain confidential. Ontario’s Chief Digital and Data Officer also has the authority to make certain datasets available publicly. Datasets listed in the catalogue that are not open will have one of the following labels: If you want to use data you find in the catalogue, that data must have a licence – a set of rules that describes how you can use it. A licence: Most of the data available in the catalogue is released under Ontario’s Open Government Licence. However, each dataset may be shared with the public under other kinds of licences or no licence at all. If a dataset doesn’t have a licence, you don’t have the right to use the data. If you have questions about how you can use a specific dataset, please contact us. The Ontario Data Catalogue endeavors to publish open data in a machine readable format. For machine readable datasets, you can simply retrieve the file you need using the file URL. The Ontario Data Catalogue is built on CKAN, which means the catalogue has the following features you can use when building applications. APIs (Application programming interfaces) let software applications communicate directly with each other. If you are using the catalogue in a software application, you might want to extract data from the catalogue through the catalogue API. Note: All Datastore API requests to the Ontario Data Catalogue must be made server-side. The catalogue's collection of dataset metadata (and dataset files) is searchable through the CKAN API. The Ontario Data Catalogue has more than just CKAN's documented search fields. You can also search these custom fields. You can also use the CKAN API to retrieve metadata about a particular dataset and check for updated files. Read the complete documentation for CKAN's API. Some of the open data in the Ontario Data Catalogue is available through the Datastore API. You can also search and access the machine-readable open data that is available in the catalogue. How to use the API feature: Read the complete documentation for CKAN's Datastore API. The Ontario Data Catalogue contains a record for each dataset that the Government of Ontario possesses. Some of these datasets will be available to you as open data. Others will not be available to you. This is because the Government of Ontario is unable to share data that would break the law or put someone's safety at risk. You can search for a dataset with a word that might describe a dataset or topic. Use words like “taxes” or “hospital locations” to discover what datasets the catalogue contains. You can search for a dataset from 3 spots on the catalogue: the homepage, the dataset search page, or the menu bar available across the catalogue. On the dataset search page, you can also filter your search results. You can select filters on the left hand side of the page to limit your search for datasets with your favourite file format, datasets that are updated weekly, datasets released by a particular ministry, or datasets that are released under a specific licence. Go to the dataset search page to see the filters that are available to make your search easier. You can also do a quick search by selecting one of the catalogue’s categories on the homepage. These categories can help you see the types of data we have on key topic areas. When you find the dataset you are looking for, click on it to go to the dataset record. Each dataset record will tell you whether the data is available, and, if so, tell you about the data available. An open dataset might contain several data files. These files might represent different periods of time, different sub-sets of the dataset, different regions, language translations, or other breakdowns. You can select a file and either download it or preview it. Make sure to read the licence agreement to make sure you have permission to use it the way you want. A non-open dataset may be not available for many reasons. Read more about non-open data. Read more about restricted data. Data that is non-open may still be subject to freedom of information requests. The catalogue has tools that enable all users to visualize the data in the catalogue without leaving the catalogue – no additional software needed. Get automatic notifications when datasets are updated. You can choose to get notifications for individual datasets, an organization’s datasets or the full catalogue. You don’t have to provide and personal information – just subscribe to our feeds using any feed reader you like using the corresponding notification web addresses. Copy those addresses and paste them into your reader. Your feed reader will let you know when the catalogue has been updated. The catalogue provides open data in several file formats (e.g., spreadsheets, geospatial data, etc). Learn about each format and how you can access and use the data each file contains. A file that has a list of items and values separated by commas without formatting (e.g. colours, italics, etc.) or extra visual features. This format provides just the data that you would display in a table. XLSX (Excel) files may be converted to CSV so they can be opened in a text editor. How to access the data: Open with any spreadsheet software application (e.g., Open Office Calc, Microsoft Excel) or text editor. Note: This format is considered machine-readable, it can be easily processed and used by a computer. Files that have visual formatting (e.g. bolded headers and colour-coded rows) can be hard for machines to understand, these elements make a file more human-readable and less machine-readable. A file that provides information without formatted text or extra visual features that may not follow a pattern of separated values like a CSV. How to access the data: Open with any word processor or text editor available on your device (e.g., Microsoft Word, Notepad). A spreadsheet file that may also include charts, graphs, and formatting. How to access the data: Open with a spreadsheet software application that supports this format (e.g., Open Office Calc, Microsoft Excel). Data can be converted to a CSV for a non-proprietary format of the same data without formatted text or extra visual features. A shapefile provides geographic information that can be used to create a map or perform geospatial analysis based on location, points/lines and other data about the shape and features of the area. It includes required files (.shp, .shx, .dbt) and might include corresponding files (e.g., .prj). How to access the data: Open with a geographic information system (GIS) software program (e.g., QGIS). A package of files and folders. The package can contain any number of different file types. How to access the data: Open with an unzipping software application (e.g., WinZIP, 7Zip). Note: If a ZIP file contains .shp, .shx, and .dbt file types, it is an ArcGIS ZIP: a package of shapefiles which provide information to create maps or perform geospatial analysis that can be opened with ArcGIS (a geographic information system software program). A file that provides information related to a geographic area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open using a GIS software application to create a map or do geospatial analysis. It can also be opened with a text editor to view raw information. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format for sharing data in a machine-readable way that can store data with more unconventional structures such as complex lists. How to access the data: Open with any text editor (e.g., Notepad) or access through a browser. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format to store and organize data in a machine-readable way that can store data with more unconventional structures (not just data organized in tables). How to access the data: Open with any text editor (e.g., Notepad). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A file that provides information related to an area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open with a geospatial software application that supports the KML format (e.g., Google Earth). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. This format contains files with data from tables used for statistical analysis and data visualization of Statistics Canada census data. How to access the data: Open with the Beyond 20/20 application. A database which links and combines data from different files or
Facebook
Twitterhttps://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.
Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.
We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.
Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.
The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.
To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.
The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.
The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:
Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.
The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.
A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.
The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.
The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of