76 datasets found

o
Career promotions, research publications, Open Access dataset
ordo.open.ac.uk
zip
Updated Feb 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo Cancellieri; Nancy Pontika; David Pride; Petr Knoth; Hannah Metzler; Antonia Correia; Helene Brinken; Bikash Gyawali (2022). Career promotions, research publications, Open Access dataset [Dataset]. http://doi.org/10.21954/ou.rd.19228785.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.21954/ou.rd.19228785.v1
Dataset updated
Feb 28, 2022
Dataset provided by
The Open University
Authors
Matteo Cancellieri; Nancy Pontika; David Pride; Petr Knoth; Hannah Metzler; Antonia Correia; Helene Brinken; Bikash Gyawali
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a compilation of processed data on citation and references for research papers including their author, institution and open access info for a selected sample of academics analysed using Microsoft Academic Graph (MAG) data and CORE. The data for this dataset was collected during December 2019 to January 2020.Six countries (Austria, Brazil, Germany, India, Portugal, United Kingdom and United States) were the focus of the six questions which make up this dataset. There is one csv file per country and per question (36 files in total). More details about the creation of this dataset are available on the public ON-MERRIT D3.1 deliverable report.The dataset is a combination of two different data sources, one part is a dataset created on analysing promotion policies across the target countries, while the second part is a set of data points available to understand the publishing behaviour. To facilitate the analysis the dataset is organised in the following seven folders:PRTThe dataset with the file name "PRT_policies.csv" contains the related information as this was extracted from promotion, review and tenure (PRT) policies. Q1: What % of papers coming from a university are Open Access?- Dataset Name format: oa_status_countryname_papers.csv- Dataset Contents: Open Access (OA) status of all papers of all the universities listed in Times Higher Education World University Rankings (THEWUR) for the given country. A paper is marked OA if there is at least an OA link available. OA links are collected using the CORE Discovery API.- Important considerations about this dataset: - Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. - The service we used to recognise if a paper is OA, CORE Discovery, does not contain entries for all paperids in MAG. This implies that some of the records in the dataset extracted will not have either a true or false value for the _is_OA_ field. - Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).Q2: How are papers, published by the selected universities, distributed across the three scientific disciplines of our choice?- Dataset Name format: fsid_countryname_papers.csv- Dataset Contents: For the given country, all papers for all the universities listed in THEWUR with the information of fieldofstudy they belong to.- Important considerations about this dataset: * MAG can associate a paper to multiple fieldofstudyid. If a paper belongs to more than one of our fieldofstudyid, separate records were created for the paper with each of those _fieldofstudyid_s.- MAG assigns fieldofstudyid to every paper with a score. We preserve only those records whose score is more than 0.5 for any fieldofstudyid it belongs to.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.Q3: What is the gender distribution in authorship of papers published by the universities?- Dataset Name format: author_gender_countryname_papers.csv- Dataset Contents: All papers with their author names for all the universities listed in THEWUR.- Important considerations about this dataset :- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.- An external script was executed to determine the gender of the authors. The script is available here.Q4: Distribution of staff seniority (= number of years from their first publication until the last publication) in the given university.- Dataset Name format: author_ids_countryname_papers.csv- Dataset Contents: For a given country, all papers for authors with their publication year for all the universities listed in THEWUR.- Important considerations about this work :- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.- Calculating staff seniority can be achieved in various ways. The most straightforward option is to calculate it as _academic_age = MAX(year) - MIN(year) _for each authorid.Q5: Citation counts (incoming) for OA vs Non-OA papers published by the university.- Dataset Name format: cc_oa_countryname_papers.csv- Dataset Contents: OA status and OA links for all papers of all the universities listed in THEWUR and for each of those papers, count of incoming citations available in MAG.- Important considerations about this dataset :- CORE Discovery was used to establish the OA status of papers.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to.- Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).Q6: Count of OA vs Non-OA references (outgoing) for all papers published by universities.- Dataset Name format: rc_oa_countryname_-papers.csv- Dataset Contents: Counts of all OA and unknown papers referenced by all papers published by all the universities listed in THEWUR.- Important considerations about this dataset :- CORE Discovery was used to establish the OA status of papers being referenced.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.Additional files:- _fieldsofstudy_mag_.csv: this file contains a dump of fieldsofstudy table of MAG mapping each of the ids to their actual field of study name.
World Population
kaggle.com
Updated Dec 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
khaIid (2021). World Population [Dataset]. https://www.kaggle.com/datasets/khaiid/world-population/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 29, 2021
Dataset provided by
Kaggle
Authors
khaIid
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Area covered
World
Description
Content

The dataset has 6 columns described as following:

Rank: Country rank by population

Country: Country name

Region: Country region

Population: Country population

Percentage: Percentage of population worldwide

Date: Date when population was measured

Questions to be answered

What is the population of each region ? Which country has the most population in each region ? What is the percentage of the first 10 countries ?
m
COVID-19 Scholarly Production Dataset
data.mendeley.com
Updated Jul 7, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gisliany Alves (2020). COVID-19 Scholarly Production Dataset [Dataset]. http://doi.org/10.17632/kx7wwc8dzp.5
Explore at:
Unique identifier
https://doi.org/10.17632/kx7wwc8dzp.5
Dataset updated
Jul 7, 2020
Authors
Gisliany Alves
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COVID-2019 has been recognized as a global threat, and several studies are being conducted in order to contribute to the fight and prevention of this pandemic. This work presents a scholarly production dataset focused on COVID-19, providing an overview of scientific research activities, making it possible to identify countries, scientists and research groups most active in this task force to combat the coronavirus disease. The dataset is composed of 40,212 records of articles' metadata collected from Scopus, PubMed, arXiv and bioRxiv databases from January 2019 to July 2020. Those data were extracted by using the techniques of Python Web Scraping and preprocessed with Pandas Data Wrangling.
f
Top 20 most productive countries in terms of AI research in information...
figshare.com
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kai-Yu Tang; Chun-Hua Hsiao; Gwo-Jen Hwang (2023). Top 20 most productive countries in terms of AI research in information science domain. [Dataset]. http://doi.org/10.1371/journal.pone.0266565.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0266565.t001
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Kai-Yu Tang; Chun-Hua Hsiao; Gwo-Jen Hwang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Top 20 most productive countries in terms of AI research in information science domain.
H
Global Roads Open Access Data Set, Version 1 (gROADSv1)
dataverse.harvard.edu
datasets.ai
+5more
Updated Sep 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Center for International Earth Science Information Network - CIESIN - Columbia University, and Information Technology Outreach Services - ITOS - University of Georgia (2025). Global Roads Open Access Data Set, Version 1 (gROADSv1) [Dataset]. http://doi.org/10.7910/DVN/NEXOVP
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/NEXOVP
Dataset updated
Sep 9, 2025
Dataset provided by
Harvard Dataverse
Authors
Center for International Earth Science Information Network - CIESIN - Columbia University, and Information Technology Outreach Services - ITOS - University of Georgia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1980 - Dec 31, 2010
Area covered
North America, South America, Oceania, Africa, Asia, Europe
Description
The Global Roads Open Access Data Set, Version 1 (gROADSv1) was developed under the auspices of the CODATA Global Roads Data Development Task Group. The data set combines the best available roads data by country into a global roads coverage, using the UN Spatial Data Infrastructure Transport (UNSDI-T) version 2 as a common data model. All country road networks have been joined topologically at the borders, and many countries have been edited for internal topology. Source data for each country are provided in the documentation, and users are encouraged to refer to the readme file for use constraints that apply to a small number of countries. Because the data are compiled from multiple sources, the date range for road network representations ranges from the 1980s to 2010 depending on the country (most countries have no confirmed date), and spatial accuracy varies. The baseline global data set was compiled by the Information Technology Outreach Services (ITOS) of the University of Georgia. Updated data for 27 countries and 6 smaller geographic entities were assembled by Columbia University's Center for International Earth Science Information Network (CIESIN), with a focus largely on developing countries with the poorest data coverage. To provide an open access, well documented global data set of roads between settlements using a consistent data model (UNSDI-T v.2) which is, to the extent possible, topologically integrated.
World - Twitter Sentiment By Country
kaggle.com
zip
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Jiang (2020). World - Twitter Sentiment By Country [Dataset]. https://www.kaggle.com/wjia26/twittersentimentbycountry
Explore at:
zip(787579784 bytes)Available download formats
Dataset updated
Nov 10, 2020
Authors
William Jiang
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Area covered
World
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1041505%2F0625876b77e55a56422bb5a37d881e0d%2Fawdasdw.jpg?generation=1595666545033847&alt=media" alt="">

Introduction

Ever wondered what people are saying about certain countries? Whether it's in a positive/negative light? What are the most commonly used phrases/words to describe the country? In this dataset I present tweets where a certain country gets mentioned in the hashtags (e.g. #HongKong, #NewZealand). It contains around 150 countries in the world. I've added an additional field called polarity which has the sentiment computed from the text field. Feel free to explore! Feedback is much appreciated!

Content

Each row represents a tweet. Creation Dates of Tweets Range from 12/07/2020 to 25/07/2020. Will update on a Monthly cadence. - The Country can be derived from the file_name field. (this field is very Tableau friendly when it comes to plotting maps) - The Date at which the tweet was created can be got from created_at field. - The Search Query used to query the Twitter Search Engine can be got from search_query field. - The Tweet Full Text can be got from the text field. - The Sentiment can be got from polarity field. (I've used the Vader Model from NLTK to compute this.)

Notes

There maybe slight duplications in tweet id's before 22/07/2020. I have since fixed this bug.

Acknowledgements

Thanks to the tweepy package for making the data extraction via Twitter API so easy.

Shameless Plug

Feel free to checkout my blog if you want to learn how I built the datalake via AWS or for other data shenanigans.

Here's an App I built using a live version of this data.
g
Coronavirus COVID-19 Global Cases by the Center for Systems Science and...
github.com
systems.jhu.edu
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE), Coronavirus COVID-19 Global Cases by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) [Dataset]. https://github.com/CSSEGISandData/COVID-19
Explore at:
Dataset provided by
Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)
Area covered
Global
Description
2019 Novel Coronavirus COVID-19 (2019-nCoV) Visual Dashboard and Map:
https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6
Confirmed Cases by Country/Region/Sovereignty
Confirmed Cases by Province/State/Dependency
Deaths
Recovered
Downloadable data:
https://github.com/CSSEGISandData/COVID-19
Additional Information about the Visual Dashboard:
https://systems.jhu.edu/research/public-health/ncov
e
Flash Eurobarometer 239 (Young people and science) - Dataset - B2FIND
b2find.eudat.eu
Updated Apr 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Flash Eurobarometer 239 (Young people and science) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/b5349e90-53eb-5e2b-81e8-4944c7c0a440
Explore at:
Dataset updated
Apr 26, 2023
Description
Attitudes of young people towards science. Topics: interest in each of the following topics: sports, politics, science and technology, economics, culture and entertainment; interest in each of the following subjects: information and communication technologies, earth and environment, universe, medical discoveries, new inventions and technologies; attitude towards selected statements on science and technology: science brings more benefits than harm, help eliminate hunger and poverty around the world, technology creates more jobs than it eliminates, science is too much influenced by profit, make lives healthier and more comfortable; attitude towards the following statements on the purpose of scientific research: should above all serve the development of knowledge, should above all serve economic development, should above all serve businesses and enterprises; awareness about innovations in the following areas of research: genetically modified food, nanotechnology, nuclear energy, mobile phones, human embryo research, brain research, computer and video surveillance techniques; attitude towards risks and advantages of the aforementioned research areas; most effective measures in tackling green-house effect and global warming; expected development in the following areas in the next twenty years in the own country: food quality, quality of air in cities, health, water quality, communication between people; assessment of the health risks of: air pollution caused by cars, pesticides used in plant production, genetically modified foods, fertilizers in underground water, vicinity of nuclear power plants, use of mobile phones, vicinity of high tension power lines, vicinity of chemical plants, new epidemics; preferred authorities to have biggest influence on decisions with regard to financing research: scientific community, government, citizens, private enterprises, research organisations, European Union, media; attitude towards the following statements on scientists: devoted to the good of humanity, dangerous power due to their knowledge; considerations to take up studies in the following fields: natural sciences, mathematics, engineering, biology or medicine, social sciences or humanities, economics; reasons for not taking up studies in the aforementioned fields; preferred kind of scientific profession: researcher in public sector, teacher, researcher in private sector, engineer, technician, health professional; attitude towards selected statements: young people’s interest in science is essential for future prosperity, girls and young women should be encouraged to take up careers in science, science classes at school are not appealing, national government should spend more money on scientific research, EU should spend more money on scientific research, need for better cooperation between member states and EU. Demography: sex; age; highest completed level of full time education; full time student; occupation of main income earner in the household; professional position of main income earner in the household; type of community. Additionally coded was: respondent ID; interviewer ID; language of the interview; country; date of interview; time of the beginning of the interview; duration of the interview; type of phone line; region; weighting factor. Interesse junger Menschen an Wissenschaft und Technologie. Themen: Interesse an Nachrichten über: Sport, Politik, Wissenschaft und Technologie, Wirtschaft, Kultur und Unterhaltung; Interesse an den Themen: Informations- und Kommunikationstechnologien, Erde und Umwelt, Universum, menschlicher Körper und Medizin, Erfindungen und Technologien; Einstellung zu Wissenschaft und Technologie (Skala): Wissenschaft als Nutzen oder Schaden, Verringerung der Armut, Schaffung von Arbeitsplätzen, Wissenschaft durch Profit beeinflusst, Lebenserleichterung; Zweck von Wissenschaft: Wissensgenerierung, wirtschaftliche Entwicklung, Nutzen für Unternehmen; Kenntnis von Innovationen im Bereich: genetisch veränderten Lebensmitteln, Nanotechnologie, Mobiltelefonie, Atomenergie, Embryonenforschung, Gehirnforschung, Überwachungstechniken sowie Einschätzung der Risiken dieser Forschungsfelder für die Gesellschaft; Lösung des Klimawandels durch Technik, Lebensweise oder Gesetze; Verbesserung der Situation im eigenen Land bei: Lebensmittelqualität sowie der Stadtluft und der Wasserqualität, Gesundheit der Bevölkerung, Kommunikation zwischen Menschen; Einschätzung des Risikos für die Menschheit durch: Luftverschmutzung, Pestizide, genetisch veränderte Lebensmittel, Verschmutzung des Grundwassers durch Düngen, Atomkraft, Mobiltelefone, Hochspannungsleitungen, Chemiewerke, Epidemien; präferierte gesellschaftliche Gruppe mit dem größten Einfluss auf Entscheidungen zur Forschungsfinanzierung; Meinung über Wissenschaftler: hingebungsvolle Menschen, die für das Wohl der Menschheit arbeiten, Gefahr der Wissensmacht; Interesse an einem Studium; Berufsziel; Gründe gegen ein Studium; Meinung zur Bedeutung der Wissenschaft für die Gesellschaft (Skala): entscheidend für zukünftigen Wohlstand, Ermutigung von jungen Leuten, ein wissenschaftliches Studium oder Berufe in der Wissenschaft zu ergreifen, Unattraktivität des Wissenschaftsunterrichts in der Schule, mehr Forschungsförderung durch die eigene Regierung sowie durch die EU, Forderung nach besserer Koordination der Forschung zwischen Mitgliedsstaaten der EU. Demographie: Geschlecht; Alter; höchster Bildungsabschluss; Vollzeitstudent; Beruf des Haupteinkommensbeziehers im Haushalt; berufliche Stellung des Haupteinkommensbeziehers im Haushalt; Urbanisierungsgrad. Zusätzlich verkodet wurde: Befragten-ID; Interviewer-ID; Interviewsprache; Land; Interviewdatum; Interviewdauer (Interviewbeginn und Interviewende); Interviewmodus (Mobiltelefon oder Festnetz); Region; Gewichtungsfaktor.
TIMSS 1995 International Database
timss.bc.edu
ascii, sas, spss +1
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIMSS & PIRLS International Study Center, TIMSS 1995 International Database [Dataset]. https://timss.bc.edu/timss1995i/Database.html
Explore at:
sas, spss, stata, asciiAvailable download formats
Dataset provided by
International Association for the Evaluation of Educational Achievement
TIMSS & PIRLS International Study Center [distributor]
Authors
TIMSS & PIRLS International Study Center
License
https://timssandpirls.bc.edu/Copyright/index.htmlhttps://timssandpirls.bc.edu/Copyright/index.html
Time period covered
1995
Area covered
Germany, Lithuania, Spain, Switzerland, Colombia, Australia, South Africa, Canada, Mexico, Russian Federation
Dataset funded by
International Association for the Evaluation of Educational Achievement
Description
The Third International Mathematics and Science Study, known as TIMSS 1995, was the largest and most ambitious international study of student achievement conducted up to that time. In 1994 - 1995, it was conducted at five grade levels in more than 40 countries (the third, fourth, seventh, and eighth grades, and the final year of secondary school).
Students were tested in mathematics and science and extensive information about the teaching and learning of mathematics and science was collected from students, teachers, and school principals. Altogether, TIMSS tested and gathered contextual data for more than half a million students and administered questionnaires to thousands of teachers and school principals.
Also, TIMSS investigated the mathematics and science curricula of the participating countries through an analysis of curriculum guides, textbooks, and other curricular materials. The TIMSS results were released in 1996 and 1997 in a series of reports, providing valuable information to policy makers and practitioners in the participating countries about mathematics and science instruction and the achievement of their students. Technical reports and the complete international database also have been published.
The TIMSS international database contains a myriad of educational variables collected in more than 40 countries, including achievement results in mathematics and science for third-, fourth-grade students (Population 1), seventh-, and eighth-grade students (Population 2), the final year of secondary school students (Population 3), their teachers, and their school principals.
Participating countries include: Argentina, Australia, Austria, Belgium (Flemish), Belgium (French), Bulgaria, Canada, Colombia, Cyprus, Czech Republic, Denmark, England, France, Germany, Greece, Hong Kong, Hungary, Iceland, Indonesia, Iran, Ireland, Israel, Italy, Japan, Korea, Kuwait, Latvia, Lithuania, Mexico, Netherlands, New Zealand, Norway, Philippines, Portugal, Romania, Russian Federation, Scotland, Singapore, Slovak Republic, Slovenia, South Africa, Spain, Sweden, Switzerland, Thailand, United States.
Z
Current trends in scientific research on global warming: A bibliometric...
data.niaid.nih.gov
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
R. Aleixandre-Benavent (2020). Current trends in scientific research on global warming: A bibliometric analysis (2005-2014) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1218021
Explore at:
Dataset updated
Jan 21, 2020
Dataset provided by
M. Bolaños-Pizarro
R. Aleixandre-Benavent
J.L. Aleixandre
J.L. Aleixandre-Tudó
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was created in the context of the project: " Current trends in scientific research on global warming: A bibliometric analysis (2005-2014)".

Global warming is a topic of increasing public importance, but there have not been published scientometric studies on this topic. The objective of this paper is to contribute to a better understanding of the scientific knowledge in global warming and his effect, as well as to investigate its evolution through the published papers included in Web of Science database. Items under study were collected from Web of Science database from Thomson Reuters. A bibliometric and social network analyses was performed to obtain indicators of scientific productivity, impact and collaboration between researchers, institutions and countries. A subject analysis was also carried out taking into account the key words assigned to papers and subject areas of journals. 1,672 articles were analysed since 2005 until 2014. The most productive journals were Journal of Climate (n=95) and Geophysical Resarch Letters (n=78). The most frequent keywords have been Climate Change (n=722), Model (n=216) and Temperature (n=196). The network of collaboration between countries shows the central position of the United States, together with other leading countries such as United Kingdom, Germany, France and Peoples Republic of China. The research on global warming had grown steadily during the last decade. A vast amount of journals from several subject areas publishes the papers on the topic, including journals of general purpose with high impact factor. Almost all the countries have USA as the main country with which one collaborates. The analysis of key words shows that topics related with climate change, impact, temperature, models and variability are the most important concerns on global warming.

The dataset consist of the following:

1) The list of papers included in the analyses: Papers.xlsx

This file contains 1672 titles, each line representing a paper (including title of the paper, journal ISSN and year of publication).

2) The list of authors: Authors.xlsx

This file contains all 4488 authors, each line representing an author (including full name, total number of papers and year of publication).

3) The list of scientific journals: Journals.xlsx

This file containts all 687 journals, each line representing a journal (including name of the journal, ISSN, total number of papers and year of publication).

4) The list of countries: Country.xlsx

This file contains all 84 countries, each line representing a country (including country name, total number of papers, total number of citations, and number of citations per paper).

5) The list of keywords: Keywords.xlsx

This file contains all 6422 keywords, each line representing a keyword (including keywords, number of papers and year of publication)
Kaggle DS Survey 2019
kaggle.com
Updated Dec 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Asri (2019). Kaggle DS Survey 2019 [Dataset]. https://www.kaggle.com/alanasri/kaggle-ds-survey-2019/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alan Asri
Description
Context

This notebook contains a thorough analysis and explanation related to the survey conducted by Kaggle. The survey was conducted on respondents from work backgrounds, age variations, where they lived, the companies where they worked. Survey questions contain about the world of the field they work in related to Data Scient and Machine Learning.

Content

The following Explanatory Data Analysis is taking data from survey results conducted by Kaggle in 2019 on respondents who give questions about Mechine Learning and Data Scients. Some core points that are in this analysis are as follows, 1. Graph Distribution Age with Formal Education 2. Plot Graph Company and Spent Money in Mechine Learning 3. Comparison spent cost level in Mechine Learning by each company 4. Data Scientist Experience & Their Compensation 5. Correlation between Mechine Learning Experience and Salary benefit 6. Correlation Data Scientist with his Compensation 7. Favourite Media source on Data Scients Topic 8. Favourite media by Age Distribution, Most Likely media by Data Scientist 9. Course Platform for Data Scientist 10. Role Job for each Title, Primary Job of Data Scientist 11. Reguler Programming Languange by Job Title, especially for Data Scientist 12. Comparison Ability spesific programming and Compensation 13. What is the Languange programming learn first aspiring Data Scientist? 14. Integrated Development Environments reguler basis 15. Top 5 IDE and Which Country is using it. Microsoft not dominant in USA 16. What is Notebook as majority likely as a Reguler Basis. Google domination 17. Which Country and What Company use What Hardware for Mechine Learning 18. Role Job based on Spesific Company Type 19. Computer Vision method mostly used by Company 20. Distribution Company by each country 21. Cloud Product, Amazon domination, Goole follow 22. Big Data Product, Amazon majority in Enterprise, Google majority in All

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
e
Athena survey of science engineering and Technology (ASSET) - Dataset -...
b2find.eudat.eu
Updated Oct 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Athena survey of science engineering and Technology (ASSET) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/9701ea39-d474-5254-9af0-a48cb6142a57
Explore at:
Dataset updated
Oct 21, 2023
Description
The surveys contain quantitative data on position, seniority, subject area, contract type, salary, career history and some demographics (age, gender, family status). In addition there were a range of open-ended questions relating to experiences of employment, expectations for careers and views on what leads to success. A significant part of the project was to undertake a coding exercise of all of the open-ended questions in the survey. There are two SPSS data files – 4,282 in Higher Education and 2,444 in Research Institutes – covering 70/75 questions in HE and RI respectively from the survey. There are a further 300 variables, mostly indicator variables, which were derived in the quantitative and qualitative analysis. This project investigates the career patterns of research scientists in the UK using data collected by the Athena Survey of Science Engineering and Technology. It aims to identify the factors associated with a successful career, and to examine why the experiences of men and women in the profession differ so significantly. Specifically, women take home only 80% of the earnings of their male counterparts and, though they account for a third of the country’s research scientists, compose only 2% of the highest grades. It compares the experience of researchers employed by three different types of organisation, universities, research institutes and industry, and will assess the impact of each on career opportunity, progression and pay. The analysis of the factors determining pay and promotion will control for age, seniority, subject area and employer. It will utilise the descriptions that people give of their usual tasks and responsibilities, details of involvement in research projects, editing journals as indicators of productivity and prestige. This regression analysis will be supplemented by a qualitative analysis of what scientists report about their employment conditions and work environment, and how this has affected their career. We find evidence that female scientists in the UK face glass ceilings both in terms of pay and promotion. Not only do women earn less because they are less likely to be promoted, they are also likely to earn less when they are employed within the same grades. Interestingly, the point at which women hit the glass ceiling depends upon institutions. In Universities the glass ceiling is thickest at the point of promotion from senior lecturer to professor, a typical glass ceiling, whereas in Research Institutes women seem to face disadvantage in obtaining promotion from scientist (post-doc) to senior scientist, perhaps better described as a sticky floor. In both cases, these are the most demanding promotions but ceteris paribus they are significantly more demanding for women. Online survey using the Bristol Online Survey tool. The surveys contain quantitative data on position, seniority, subject area, contract type, salary, career history and some demographics (age, gender, family status). In addition there were a range of open-ended questions relating to experiences of employment, expectations for careers and views on what leads to success. A significant part of the project was to undertake a coding exercise of all of the open-ended questions in the survey.
Climate Change: Earth Surface Temperature Data
kaggle.com
redivis.com
zip
Updated May 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berkeley Earth (2017). Climate Change: Earth Surface Temperature Data [Dataset]. https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-surface-temperature-data
Explore at:
zip(88843537 bytes)Available download formats
Dataset updated
May 1, 2017
Dataset authored and provided by
Berkeley Earthhttp://berkeleyearth.org/
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Earth
Description
Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. We are turning some of the data over to you so you can form your own view.

Even more than with other data sets that Kaggle has featured, there’s a huge amount of data cleaning and preparation that goes into putting together a long-time study of climate trends. Early data was collected by technicians using mercury thermometers, where any variation in the visit time impacted measurements. In the 1940s, the construction of airports caused many weather stations to be moved. In the 1980s, there was a move to electronic thermometers that are said to have a cooling bias.

Given this complexity, there are a range of organizations that collate climate trends data. The three most cited land and ocean temperature data sets are NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.

We have repackaged the data from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. They also use methods that allow weather observations from shorter time series to be included, meaning fewer observations need to be thrown away.

In this dataset, we have include several files:

Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv):

Date: starts in 1750 for average land temperature and 1850 for max and min land temperatures and global ocean and land temperatures

LandAverageTemperature: global average land temperature in celsius

LandAverageTemperatureUncertainty: the 95% confidence interval around the average

LandMaxTemperature: global average maximum land temperature in celsius

LandMaxTemperatureUncertainty: the 95% confidence interval around the maximum land temperature

LandMinTemperature: global average minimum land temperature in celsius

LandMinTemperatureUncertainty: the 95% confidence interval around the minimum land temperature

LandAndOceanAverageTemperature: global average land and ocean temperature in celsius

LandAndOceanAverageTemperatureUncertainty: the 95% confidence interval around the global average land and ocean temperature

Other files include:

Global Average Land Temperature by Country (GlobalLandTemperaturesByCountry.csv)

Global Average Land Temperature by State (GlobalLandTemperaturesByState.csv)

Global Land Temperatures By Major City (GlobalLandTemperaturesByMajorCity.csv)

Global Land Temperatures By City (GlobalLandTemperaturesByCity.csv)

The raw data comes from the Berkeley Earth data page.
What are the prospects for citizen science in agriculture? Evidence from...
plos.figshare.com
datasetcatalog.nlm.nih.gov
docx
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eskender Beza; Jonathan Steinke; Jacob van Etten; Pytrik Reidsma; Carlo Fadda; Sarika Mittra; Prem Mathur; Lammert Kooistra (2023). What are the prospects for citizen science in agriculture? Evidence from three continents on motivation and mobile telephone use of resource-poor farmers [Dataset]. http://doi.org/10.1371/journal.pone.0175700
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0175700
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Eskender Beza; Jonathan Steinke; Jacob van Etten; Pytrik Reidsma; Carlo Fadda; Sarika Mittra; Prem Mathur; Lammert Kooistra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As the sustainability of agricultural citizen science projects depends on volunteer farmers who contribute their time, energy and skills, understanding their motivation is important to attract and retain participants in citizen science projects. The objectives of this study were to assess 1) farmers’ motivations to participate as citizen scientists and 2) farmers’ mobile telephone usage. Building on motivational factors identified from previous citizen science studies, a questionnaire based methodology was developed which allowed the analysis of motivational factors and their relation to farmers’ characteristics. The questionnaire was applied in three communities of farmers, in countries from different continents, participating as citizen scientists. We used statistical tests to compare motivational factors within and among the three countries. In addition, the relations between motivational factors and farmers characteristics were assessed. Lastly, Principal Component Analysis (PCA) was used to group farmers based on their motivations. Although there was an overlap between the types of motivations, for Indian farmers a collectivistic type of motivation (i.e., contribute to scientific research) was more important than egoistic and altruistic motivations. For Ethiopian and Honduran farmers an egoistic intrinsic type of motivation (i.e., interest in sharing information) was most important. While fun has appeared to be an important egoistic intrinsic factor to participate in other citizen science projects, the smallholder farmers involved in this research valued ‘passing free time’ the lowest. Two major groups of farmers were distinguished: one motivated by sharing information (egoistic intrinsic), helping (altruism) and contribute to scientific research (collectivistic) and one motivated by egoistic extrinsic factors (expectation, expert interaction and community interaction). Country and education level were the two most important farmers’ characteristics that explain around 20% of the variation in farmers motivations. For educated farmers, contributing to scientific research was a more important motivation to participate as citizen scientists compared to less educated farmers. We conclude that motivations to participate in citizen science are different for smallholders in agriculture compared to other sectors. Citizen science does have high potential, but easy to use mechanisms are needed. Moreover, gamification may increase the egoistic intrinsic motivation of farmers.
e
Flash Eurobarometer 239 (Young people and science) - Dataset - B2FIND
b2find.eudat.eu
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Flash Eurobarometer 239 (Young people and science) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/62f2ec42-d910-5797-b100-c8623722c830
Explore at:
Dataset updated
Jul 29, 2025
Description
Interesse junger Menschen an Wissenschaft und Technologie. Themen: Interesse an Nachrichten über: Sport, Politik, Wissenschaft und Technologie, Wirtschaft, Kultur und Unterhaltung; Interesse an den Themen: Informations- und Kommunikationstechnologien, Erde und Umwelt, Universum, menschlicher Körper und Medizin, Erfindungen und Technologien; Einstellung zu Wissenschaft und Technologie (Skala): Wissenschaft als Nutzen oder Schaden, Verringerung der Armut, Schaffung von Arbeitsplätzen, Wissenschaft durch Profit beeinflusst, Lebenserleichterung; Zweck von Wissenschaft: Wissensgenerierung, wirtschaftliche Entwicklung, Nutzen für Unternehmen; Kenntnis von Innovationen im Bereich: genetisch veränderten Lebensmitteln, Nanotechnologie, Mobiltelefonie, Atomenergie, Embryonenforschung, Gehirnforschung, Überwachungstechniken sowie Einschätzung der Risiken dieser Forschungsfelder für die Gesellschaft; Lösung des Klimawandels durch Technik, Lebensweise oder Gesetze; Verbesserung der Situation im eigenen Land bei: Lebensmittelqualität sowie der Stadtluft und der Wasserqualität, Gesundheit der Bevölkerung, Kommunikation zwischen Menschen; Einschätzung des Risikos für die Menschheit durch: Luftverschmutzung, Pestizide, genetisch veränderte Lebensmittel, Verschmutzung des Grundwassers durch Düngen, Atomkraft, Mobiltelefone, Hochspannungsleitungen, Chemiewerke, Epidemien; präferierte gesellschaftliche Gruppe mit dem größten Einfluss auf Entscheidungen zur Forschungsfinanzierung; Meinung über Wissenschaftler: hingebungsvolle Menschen, die für das Wohl der Menschheit arbeiten, Gefahr der Wissensmacht; Interesse an einem Studium; Berufsziel; Gründe gegen ein Studium; Meinung zur Bedeutung der Wissenschaft für die Gesellschaft (Skala): entscheidend für zukünftigen Wohlstand, Ermutigung von jungen Leuten, ein wissenschaftliches Studium oder Berufe in der Wissenschaft zu ergreifen, Unattraktivität des Wissenschaftsunterrichts in der Schule, mehr Forschungsförderung durch die eigene Regierung sowie durch die EU, Forderung nach besserer Koordination der Forschung zwischen Mitgliedsstaaten der EU. Demographie: Geschlecht; Alter; höchster Bildungsabschluss; Vollzeitstudent; Beruf des Haupteinkommensbeziehers im Haushalt; berufliche Stellung des Haupteinkommensbeziehers im Haushalt; Urbanisierungsgrad. Zusätzlich verkodet wurde: Befragten-ID; Interviewer-ID; Interviewsprache; Land; Interviewdatum; Interviewdauer (Interviewbeginn und Interviewende); Interviewmodus (Mobiltelefon oder Festnetz); Region; Gewichtungsfaktor. Attitudes of young people towards science. Topics: interest in each of the following topics: sports, politics, science and technology, economics, culture and entertainment; interest in each of the following subjects: information and communication technologies, earth and environment, universe, medical discoveries, new inventions and technologies; attitude towards selected statements on science and technology: science brings more benefits than harm, help eliminate hunger and poverty around the world, technology creates more jobs than it eliminates, science is too much influenced by profit, make lives healthier and more comfortable; attitude towards the following statements on the purpose of scientific research: should above all serve the development of knowledge, should above all serve economic development, should above all serve businesses and enterprises; awareness about innovations in the following areas of research: genetically modified food, nanotechnology, nuclear energy, mobile phones, human embryo research, brain research, computer and video surveillance techniques; attitude towards risks and advantages of the aforementioned research areas; most effective measures in tackling green-house effect and global warming; expected development in the following areas in the next twenty years in the own country: food quality, quality of air in cities, health, water quality, communication between people; assessment of the health risks of: air pollution caused by cars, pesticides used in plant production, genetically modified foods, fertilizers in underground water, vicinity of nuclear power plants, use of mobile phones, vicinity of high tension power lines, vicinity of chemical plants, new epidemics; preferred authorities to have biggest influence on decisions with regard to financing research: scientific community, government, citizens, private enterprises, research organisations, European Union, media; attitude towards the following statements on scientists: devoted to the good of humanity, dangerous power due to their knowledge; considerations to take up studies in the following fields: natural sciences, mathematics, engineering, biology or medicine, social sciences or humanities, economics; reasons for not taking up studies in the aforementioned fields; preferred kind of scientific profession: researcher in public sector, teacher, researcher in private sector, engineer, technician, health professional; attitude towards selected statements: young people’s interest in science is essential for future prosperity, girls and young women should be encouraged to take up careers in science, science classes at school are not appealing, national government should spend more money on scientific research, EU should spend more money on scientific research, need for better cooperation between member states and EU. Demography: sex; age; highest completed level of full time education; full time student; occupation of main income earner in the household; professional position of main income earner in the household; type of community. Additionally coded was: respondent ID; interviewer ID; language of the interview; country; date of interview; time of the beginning of the interview; duration of the interview; type of phone line; region; weighting factor.
r
QoG Basic Dataset
researchdata.se
gimi9.com
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Dahlberg; Aksel Sundström; Sören Holmberg; Bo Rothstein; Natalia Alvarado Pachon; Cem Mert Dalli (2024). QoG Basic Dataset [Dataset]. http://doi.org/10.18157/qogbasjan22
Explore at:
(62883921)Available download formats
Unique identifier
https://doi.org/10.18157/qogbasjan22
Dataset updated
Aug 6, 2024
Dataset provided by
University of Gothenburg
Authors
Stefan Dahlberg; Aksel Sundström; Sören Holmberg; Bo Rothstein; Natalia Alvarado Pachon; Cem Mert Dalli
Time period covered
2015 - 2021
Description
The QoG Institute is an independent research institute within the Department of Political Science at the University of Gothenburg. Overall 30 researchers conduct and promote research on the causes, consequences and nature of Good Governance and the Quality of Government - that is, trustworthy, reliable, impartial, uncorrupted and competent government institutions.

The main objective of our research is to address the theoretical and empirical problem of how political institutions of high quality can be created and maintained. A second objective is to study the effects of Quality of Government on a number of policy areas, such as health, the environment, social policy, and poverty.

QoG Basic Dataset, which consists of approximately the 300 most used variables from QoG Standard Dataset, is a selection of variables that cover the most important concepts related to Quality of Government.

In the QoG Basic CS dataset, data from and around 2018 is included. Data from 2018 is prioritized, however, if no data is available for a country for 2018, data for 2019 is included. If no data exists for 2019, data for 2017 is included, and so on up to a maximum of +/- 3 years.

In the QoG Basic TS dataset, data from 1946 to 2021 is included and the unit of analysis is country-year (e.g., Sweden-1946, Sweden-1947, etc.).

The primary aim of QoG is to conduct and promote research on corruption. One aim of the QoG Institute is to make publicly available cross-national comparative data on QoG and its correlates.
Global Roads Open Access Data Set, Version 1 (gROADSv1) - Dataset - NASA...
data.nasa.gov
Updated May 16, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2013). Global Roads Open Access Data Set, Version 1 (gROADSv1) - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/global-roads-open-access-data-set-version-1-groadsv1
Explore at:
Dataset updated
May 16, 2013
Dataset provided by
NASAhttp://nasa.gov/
Description
The Global Roads Open Access Data Set, Version 1 (gROADSv1) was developed under the auspices of the CODATA Global Roads Data Development Task Group. The data set combines the best available roads data by country into a global roads coverage, using the UN Spatial Data Infrastructure Transport (UNSDI-T) version 2 as a common data model. All country road networks have been joined topologically at the borders, and many countries have been edited for internal topology. Source data for each country are provided in the documentation, and users are encouraged to refer to the readme file for use constraints that apply to a small number of countries. Because the data are compiled from multiple sources, the date range for road network representations ranges from the 1980s to 2010 depending on the country (most countries have no confirmed date), and spatial accuracy varies. The baseline global data set was compiled by the Information Technology Outreach Services (ITOS) of the University of Georgia. Updated data for 27 countries and 6 smaller geographic entities were assembled by Columbia University's Center for International Earth Science Information Network (CIESIN), with a focus largely on developing countries with the poorest data coverage.
e
Global Roads Open Access Data Set, Version 1 (gROADSv1)
covid19.esriuk.com
hub.arcgis.com
Updated Jul 10, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Columbia (2015). Global Roads Open Access Data Set, Version 1 (gROADSv1) [Dataset]. https://covid19.esriuk.com/maps/33119a20274f42399ba5f4521bc89eb1
Explore at:
Dataset updated
Jul 10, 2015
Dataset authored and provided by
Columbia
Area covered
Description
The data set combines the best available roads data by country into a global roads coverage, using the UN Spatial Data Infrastructure Transport (UNSDI-T) version 2 as a common data model. The purpose is to provide an open access, well documented global data set of roads between settlements using a consistent data model (UNSDI-T v.2) which is, to the extent possible, topologically integrated.Dataset SummaryThe Global Roads Open Access Data Set, Version 1 (gROADSv1) was developed under the auspices of the CODATA Global Roads Data Development Task Group. The data set combines the best available roads data by country into a global roads coverage, using the UN Spatial Data Infrastructure Transport (UNSDI-T) version 2 as a common data model. All country road networks have been joined topologically at the borders, and many countries have been edited for internal topology. Source data for each country are provided in the documentation, and users are encouraged to refer to the readme file for use constraints that apply to a small number of countries. Because the data are compiled from multiple sources, the date range for road network representations ranges from the 1980s to 2010 depending on the country (most countries have no confirmed date), and spatial accuracy varies. The baseline global data set was compiled by the Information Technology Outreach Services (ITOS) of the University of Georgia. Updated data for 27 countries and 6 smaller geographic entities were assembled by Columbia University's Center for International Earth Science Information Network (CIESIN), with a focus largely on developing countries with the poorest data coverage.Documentation for the Global Roads Open Access Data Set, Version 1 (gROADSv1)Recommended CitationCenter for International Earth Science Information Network - CIESIN - Columbia University, and Information Technology Outreach Services - ITOS - University of Georgia. 2013. Global Roads Open Access Data Set, Version 1 (gROADSv1). Palisades, NY: NASA Socioeconomic Data and Applications Center (SEDAC). http://dx.doi.org/10.7927/H4VD6WCT. Accessed DAY MONTH YEAR.
e
Semantic Query Analysis from the Global Science Gateway - Dataset - B2FIND
b2find.eudat.eu
Updated Oct 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Semantic Query Analysis from the Global Science Gateway - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/2cf68914-a4ff-535e-89bc-9b86b2ca555c
Explore at:
Dataset updated
Oct 12, 2024
Description
Nowadays web portals play an essential role in searching and retrieving information in the several fields of knowledge: they are ever more technologically advanced and designed for supporting the storage of a huge amount of information in natural language originating from the queries launched by users worldwide.A good example is given by the WorldWideScience search engine:The database is available at . It is based on a similar gateway, Science.gov, which is the major path to U.S. government science information, as it pulls together Web-based resources from various agencies. The information in the database is intended to be of high quality and authority, as well as the most current available from the participating countries in the Alliance, so users will find that the results will be more refined than those from a general search of Google. It covers the fields of medicine, agriculture, the environment, and energy, as well as basic sciences. Most of the information may be obtained free of charge (the database itself may be used free of charge) and is considered ‘‘open domain.’’ As of this writing, there are about 60 countries participating in WorldWideScience.org, providing access to 50+databases and information portals. Not all content is in English. (Bronson, 2009)Given this scenario, we focused on building a corpus constituted by the query logs registered by the GreyGuide: Repository and Portal to Good Practices and Resources in Grey Literature and received by the WorldWideScience.org (The Global Science Gateway) portal: the aim is to retrieve information related to social media which as of today represent a considerable source of data more and more widely used for research ends.This project includes eight months of query logs registered between July 2017 and February 2018 for a total of 445,827 queries. The analysis mainly concentrates on the semantics of the queries received from the portal clients: it is a process of information retrieval from a rich digital catalogue whose language is dynamic, is evolving and follows – as well as reflects – the cultural changes of our modern society.
COVID-19 Pandemic Wikipedia Readership
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Johnson; Leila Zia; Joseph Allemandou; Marcel Ruiz Forns; Nuria Ruiz; Fabian Kaelin (2023). COVID-19 Pandemic Wikipedia Readership [Dataset]. http://doi.org/10.6084/m9.figshare.14548032.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14548032.v3
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Isaac Johnson; Leila Zia; Joseph Allemandou; Marcel Ruiz Forns; Nuria Ruiz; Fabian Kaelin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data release includes two Wikipedia datasets related to the readership of the project as it relates to the early COVID-19 pandemic period. The first dataset is COVID-19 article page views by country, the second dataset is one hop navigation where one of the two pages are COVID-19 related. The data covers roughly the first six months of the pandemic, more specifically from January 1st 2020 to June 30th 2020. For more background on the pandemic in those months, see English Wikipedia's Timeline of the COVID-19 pandemic.Wikipedia articles are considered COVID-19 related according the methodology described here, the list of COVID-19 articles used for the released datasets is available in covid_articles.tsv. For simplicity and transparency, the same list of articles from 20 April 2020 was used for the entire dataset though in practice new COVID-19-relevant articles were constantly being created as the pandemic evolved.Privacy considerationsWhile this data is considered valuable for the insight that it can provide about information-seeking behaviors around the pandemic in its early months across diverse geographies, care must be taken to not inadvertently reveal information about the behavior of individual Wikipedia readers. We put in place a number of filters to release as much data as we can while minimizing the risk to readers.The Wikimedia foundation started to release most viewed articles by country from Jan 2021. At the beginning of the COVID-19 an exemption was made to store reader data about the pandemic with additional privacy protections:- exclude the page views from users engaged in an edit session- exclude reader data from specific countries (with a few exceptions)- the aggregated statistics are based on 50% of reader sessions that involve a pageview to a COVID-19-related article (see covid_pages.tsv). As a control, a 1% random sample of reader sessions that have no pageviews to COVID-19-related articles was kept. In aggregate, we make sure this 1% non-COVID-19 sample and 50% COVID-19 sample represents less than 10% of pageviews for a country for that day. The randomization and filters occurs on a daily cadence with all timestamps in UTC.- exclude power users - i.e. userhashes with greater than 500 pageviews in a day. This doubles as another form of likely bot removal, protects very heavy users of the project, and also in theory would help reduce the chance of a single user heavily skewing the data.- exclude readership from users of the iOS and Android Wikipedia apps. In effect, the view counts in this dataset represent comparable trends rather than the total amount of traffic from a given country. For more background on readership data per country data, and the COVID-19 privacy protections in particular, see this phabricator.To further minimize privacy risks, a k-anonymity threshold of 100 was applied to the aggregated counts. For example, a page needs to be viewed at least 100 times in a given country and week in order to be included in the dataset. In addition, the view counts are floored to a multiple of 100.DatasetsThe datasets published in this release are derived from a reader session dataset generated by the code in this notebook with the filtering described above. The raw reader session data itself will not be publicly available due to privacy considerations. The datasets described below are similar to the pageviews and clickstream data that the Wikimedia foundation publishes already, with the addition of the country specific counts.COVID-19 pageviewsThe file covid_pageviews.tsv contains:- pageview counts for COVID-19 related pages, aggregated by week and country- k-anonymity threshold of 100- example: In the 13th week of 2020 (23 March - 29 March 2020), the page 'Pandémie_de_Covid-19_en_Italie' on French Wikipedia was visited 11700 times from readers in Belgium- as a control bucket, we include pageview counts to all pages aggregated by week and country. Due to privacy considerations during the collection of the data, the control bucket was sampled at ~1% of all view traffic. The view counts for the control title are thus proportional to the total number of pageviews to all pages.The file is ~8 MB and contains ~134000 data points across the 27 weeks, 108 countries, and 168 projects.Covid reader session bigramsThe file covid_session_bigrams.tsv contains:- number of occurrences of visits to pages A -> B, where either A or B is a COVID-19 related article. Note that the bigrams are tuples (from, to) of articles viewed in succession, the underlying mechanism can be clicking on a link in an article, but it may also have been a new search or reading both articles based on links from third source articles. In contrast, the clickstream data is based on referral information only- aggregated by month and country- k-anonymity threshold of 100- example: In March of 2020, there were a 1000 occurences of readers accessing the page es.wikipedia/SARS-CoV-2 followed by es.wikipedia/Orthocoronavirinae from ChileThe file is ~10 MB and contains ~90000 bigrams across the 6 months, 96 countries, and 56 projects.ContactPlease reach out to research-feedback@wikimedia.org for any questions.

Facebook

Twitter

Click to copy link

Link copied

Cite

Matteo Cancellieri; Nancy Pontika; David Pride; Petr Knoth; Hannah Metzler; Antonia Correia; Helene Brinken; Bikash Gyawali (2022). Career promotions, research publications, Open Access dataset [Dataset]. http://doi.org/10.21954/ou.rd.19228785.v1

Career promotions, research publications, Open Access dataset

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Unique identifier

https://doi.org/10.21954/ou.rd.19228785.v1

Dataset updated

Feb 28, 2022

Dataset provided by

The Open University

Authors

Matteo Cancellieri; Nancy Pontika; David Pride; Petr Knoth; Hannah Metzler; Antonia Correia; Helene Brinken; Bikash Gyawali

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is a compilation of processed data on citation and references for research papers including their author, institution and open access info for a selected sample of academics analysed using Microsoft Academic Graph (MAG) data and CORE. The data for this dataset was collected during December 2019 to January 2020.Six countries (Austria, Brazil, Germany, India, Portugal, United Kingdom and United States) were the focus of the six questions which make up this dataset. There is one csv file per country and per question (36 files in total). More details about the creation of this dataset are available on the public ON-MERRIT D3.1 deliverable report.The dataset is a combination of two different data sources, one part is a dataset created on analysing promotion policies across the target countries, while the second part is a set of data points available to understand the publishing behaviour. To facilitate the analysis the dataset is organised in the following seven folders:PRTThe dataset with the file name "PRT_policies.csv" contains the related information as this was extracted from promotion, review and tenure (PRT) policies. Q1: What % of papers coming from a university are Open Access?- Dataset Name format: oa_status_countryname_papers.csv- Dataset Contents: Open Access (OA) status of all papers of all the universities listed in Times Higher Education World University Rankings (THEWUR) for the given country. A paper is marked OA if there is at least an OA link available. OA links are collected using the CORE Discovery API.- Important considerations about this dataset: - Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. - The service we used to recognise if a paper is OA, CORE Discovery, does not contain entries for all paperids in MAG. This implies that some of the records in the dataset extracted will not have either a true or false value for the _is_OA_ field. - Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).Q2: How are papers, published by the selected universities, distributed across the three scientific disciplines of our choice?- Dataset Name format: fsid_countryname_papers.csv- Dataset Contents: For the given country, all papers for all the universities listed in THEWUR with the information of fieldofstudy they belong to.- Important considerations about this dataset: * MAG can associate a paper to multiple fieldofstudyid. If a paper belongs to more than one of our fieldofstudyid, separate records were created for the paper with each of those _fieldofstudyid_s.- MAG assigns fieldofstudyid to every paper with a score. We preserve only those records whose score is more than 0.5 for any fieldofstudyid it belongs to.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.Q3: What is the gender distribution in authorship of papers published by the universities?- Dataset Name format: author_gender_countryname_papers.csv- Dataset Contents: All papers with their author names for all the universities listed in THEWUR.- Important considerations about this dataset :- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.- An external script was executed to determine the gender of the authors. The script is available here.Q4: Distribution of staff seniority (= number of years from their first publication until the last publication) in the given university.- Dataset Name format: author_ids_countryname_papers.csv- Dataset Contents: For a given country, all papers for authors with their publication year for all the universities listed in THEWUR.- Important considerations about this work :- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.- Calculating staff seniority can be achieved in various ways. The most straightforward option is to calculate it as _academic_age = MAX(year) - MIN(year) _for each authorid.Q5: Citation counts (incoming) for OA vs Non-OA papers published by the university.- Dataset Name format: cc_oa_countryname_papers.csv- Dataset Contents: OA status and OA links for all papers of all the universities listed in THEWUR and for each of those papers, count of incoming citations available in MAG.- Important considerations about this dataset :- CORE Discovery was used to establish the OA status of papers.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to.- Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).Q6: Count of OA vs Non-OA references (outgoing) for all papers published by universities.- Dataset Name format: rc_oa_countryname_-papers.csv- Dataset Contents: Counts of all OA and unknown papers referenced by all papers published by all the universities listed in THEWUR.- Important considerations about this dataset :- CORE Discovery was used to establish the OA status of papers being referenced.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.Additional files:- _fieldsofstudy_mag_.csv: this file contains a dump of fieldsofstudy table of MAG mapping each of the ids to their actual field of study name.

Clear search

Close search

Google apps

Main menu

Career promotions, research publications, Open Access dataset

World Population

Content

Questions to be answered

COVID-19 Scholarly Production Dataset

Top 20 most productive countries in terms of AI research in information...

Global Roads Open Access Data Set, Version 1 (gROADSv1)

World - Twitter Sentiment By Country

Introduction

Content

Notes

Acknowledgements

Shameless Plug

Coronavirus COVID-19 Global Cases by the Center for Systems Science and...

Flash Eurobarometer 239 (Young people and science) - Dataset - B2FIND

TIMSS 1995 International Database

Current trends in scientific research on global warming: A bibliometric...

Kaggle DS Survey 2019

Context

Content

Acknowledgements

Inspiration

Athena survey of science engineering and Technology (ASSET) - Dataset -...

Climate Change: Earth Surface Temperature Data

What are the prospects for citizen science in agriculture? Evidence from...

Flash Eurobarometer 239 (Young people and science) - Dataset - B2FIND

QoG Basic Dataset

Global Roads Open Access Data Set, Version 1 (gROADSv1) - Dataset - NASA...

Global Roads Open Access Data Set, Version 1 (gROADSv1)

Semantic Query Analysis from the Global Science Gateway - Dataset - B2FIND

COVID-19 Pandemic Wikipedia Readership

Career promotions, research publications, Open Access dataset