https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)
All bloggers included in the corpus fall into one of three age groups: - 8240 "10s" blogs (ages 13-17), - 8086 "20s" blogs (ages 23-27), - 2994 "30s" blogs (ages 33-47).
For each age group there are an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.
The corpus may be freely used for non-commercial research purposes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The blogs in the blogmix are selected through the lists Most visited private blogs, Most visited professional blogs, and the local lists for different regions, at bloggportalen.se. More information, such as the location and age of the blogger is also retrieved from Bloggportalen. The material has not been manually checked, which means that spam may occur. Some English blogs have been removed when discovered, and some blogs have not been added for technical reasons. The time of the blogs ranges from the first to the latest entries of the selected blogs, and the corpus is continually updated. The material is sentence scrambled. Urvalet av bloggar för bloggmixen görs med hjälp av topplistorna på bloggportalen.se, både Mest besökta privata bloggar, Mest besökta proffsbloggar och de lokala topplistorna för olika regioner. Närmare information, som bloggarens ort och ålder, hämtas också från Bloggportalen. Materialet har inte kontrollerats manuellt, vilket betyder att det kan förekomma spam. Några engelskspråkiga bloggar har plockats bort då de upptäckts, och vissa bloggar har inte kunnat läsas in av tekniska skäl. Tidsperioden sträcker sig från de första inläggen i de utvalda bloggarna till de senaste inläggen. Korpusen uppdateras regelbundet.
Overview
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).
The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.
Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.
The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).
The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.
Options to access the dataset
There are two ways how to get access to the dataset:
In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:
@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }
@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }
Dataset creation process
In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.
Ethical considerations
The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.
The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.
As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.
Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.
Reporting mistakes in the dataset The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.
Dataset structure
Raw data
At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.
Raw data are contained in these CSV files (and corresponding REST API endpoints):
sources.csv
articles.csv
article_media.csv
article_authors.csv
discussion_posts.csv
discussion_post_authors.csv
fact_checking_articles.csv
fact_checking_article_media.csv
claims.csv
feedback_facebook.csv
Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.
Annotations
Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.
Each annotation is described by the following attributes:
category of annotation (annotation_category
). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).
type of annotation (annotation_type_id
). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.
method which created annotation (method_id
). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.
its value (value
). The value is stored in JSON format and its structure differs according to particular annotation type.
At the same time, annotations are associated with a particular object identified by:
entity type (parameter entity_type
in case of entity annotations, or source_entity_type
and target_entity_type
in case of relation annotations). Possible values: sources, articles, fact-checking-articles.
entity id (parameter entity_id
in case of entity annotations, or source_entity_id
and target_entity_id
in case of relation annotations).
The dataset provides specifically these entity annotations:
Source reliability (binary). Determines validity of source (website) at a binary scale with two options: reliable source and unreliable source.
Article veracity. Aggregated information about veracity from article-claim pairs.
The dataset provides specifically these relation annotations:
Fact-checking article to claim mapping. Determines mapping between fact-checking article and claim.
Claim presence. Determines presence of claim in article.
Claim stance. Determines stance of an article to a claim.
Annotations are contained in these CSV files (and corresponding REST API endpoints):
entity_annotations.csv
relation_annotations.csv
Note: Identification of human annotators authors (email provided in the annotation app) is anonymised.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this paper, we aim to analyze the production of professional blogs by pre-service English teachers and the roles that such digital language practices may perform in the education of reflexive and critical language teachers. Specifically, we analyzed the blog posts that two pre-service teachers produced and the professional identities that are forged in such a digital language practice. The reported case study is of qualitative and interpretative nature. The data, composed by the pre-service English teachers' blog posts and their experiential narratives regarding the pedagogical practice they experienced, were generated in 2010 in an elective unit in a state university of the north of Paraná. The results demonstrate the emergence of identity conflicts due to the engagement of the pre-service English teachers in the production of digital language practices. These conflicts have generated an impulse towards the reconstruction of the identities of these future English language professionals.
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
This dataset comprises a set of Twitter accounts in Singapore that are used for social bot profiling research conducted by the Living Analytics Research Centre (LARC) at Singapore Management University (SMU). Here a bot is defined as a Twitter account that generates contents and/or interacts with other users automatically (at least according to human judgment). In this research, Twitter bots have been categorized into three major types:
Broadcast bot. This bot aims at disseminating information to general audience by providing, e.g., benign links to news, blogs or sites. Such bot is often managed by an organization or a group of people (e.g., bloggers). Consumption bot. The main purpose of this bot is to aggregate contents from various sources and/or provide update services (e.g., horoscope reading, weather update) for personal consumption or use. Spam bot. This type of bots posts malicious contents (e.g., to trick people by hijacking certain account or redirecting them to malicious sites), or promotes harmless but invalid/irrelevant contents aggressively.
This categorization is general enough to cater for new, emerging types of bot (e.g., chatbots can be viewed as a special type of broadcast bots). The dataset was collected from 1 January to 30 April 2014 via the Twitter REST and streaming APIs. Starting from popular seed users (i.e., users having many followers), their follow, retweet, and user mention links were crawled. The data collection proceeds by adding those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. Using this procedure, a total of 159,724 accounts have been collected. To identify bots, the first step is to check active accounts who tweeted at least 15 times within the month of April 2014. These accounts were then manually checked and labelled, of which 589 bots were found. As many more human users are expected in the Twitter population, the remaining accounts were randomly sampled and manually checked. With this, 1,024 human accounts were identified. In total, this results in 1,613 labelled accounts. Related Publication: R. J. Oentaryo, A. Murdopo, P. K. Prasetyo, and E.-P. Lim. (2016). On profiling bots in social media. Proceedings of the International Conference on Social Informatics (SocInfo’16), 92-109. Bellevue, WA. https://doi.org/10.1007/978-3-319-47880-7_6
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: This dataset holds the content of one day's micro-blogs sampled from Weibo(http://weibo.com) in the form of bags-of-words.-----------------------------------------------------Data Set Characteristics: TextNumber of Micro-blogs:189,223Total Number of Words:3,252,492Size of the Vocabulary:20,942Associated Tasks: short text topic modeling and etc.-----------------------------------------------------About PreprocessingFor tokenization, we use NLPIR. Stop words and those with term-frequence less than 20 were removed. Besides,words contain only one chinese-character were also removed.-----------------------------------------------------Data FormatThe format of released data is setted as follows:[document_1][document_2]...[document_M]in which each line is one document. [document_i] is the ith document of the dataset that consists of a list of Ni words/terms.[document_i] = [word_i1] [word_i2] ... [word_iNi]in which all word_ij are text strings and they are separated by the blank character.-----------------------------------------------------If you have any questions about the data set, please contact: jichang@buaa.edu.cn.
Although the commercial name for the The USAID University - Learning Management System is CSOD InCompass, the agencies that use the system have renamed (or rebranded) their specific agency portals to meet their own needs. lnCompass is a comprehensive talent management system that incorporates the following functional modules: 1) Learning -- The Learning module supports the management and tracking of training events and individual training records. Training events may be instructor Jed or online. Courses may be managed within the system to provide descriptions, availability, and registration. Online content is stored on the system. Training information stored for individuals includes courses completed, scores, and courses registered for, 2) Connect -- The Connect module supports employee collaboration efforts. Features include communities of practice, expertise location, blogs, and knowledge sharing support. Profile information that may be stored by the system includes job position, subject matter expertise, and previous accomplishments, 3) Performance -- The Performance module supports management of organizational goals and alignment of those goals to individual performance. The module supports managing skills and competencies for the organization. The module also supports employee performance reviews. The types of information gathered about employees include their skills, competencies, and performance evaluation, 4) Succession -- The Succession module supports workforce management and planning. The type of information gathered for this module includes prior work experience, skills, and competencies, and 5) Extended Enterprise -- The Extended Enterprise module supports delivery of training outside of the organization. Training provided may be for a fee. The type of information collected for this module includes individual data for identifying the person for training records management and related information for commercial transactions.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset comes from the Open National Address Base project initiated by OpenStreetMap France.
For more information on this project: http://openstreetmap.fr/blogs/cquest/bano-banco
Origin of data
< p>BANO is a composite database, made up from different sources:Distribution format
These files are available in shapefile format, in WGS84 projection (EPSG :4326) as well as in CSV format and experimentally as github project.
Description of content
For each address:
updates, corrections
To update and correct BANO data, simply make improvements directly in OpenStreetMap, they will be taken into account in the next update cycle.
A one-stop collaborative reporting/correction window will soon be set up to simplify the process of improving the content of the database. To participate in its co-construction, do not hesitate to contact us!
For any questions concerning the project or this dataset, you can contact bano@openstreetmap.fr
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset comes from the Open National Address Base project initiated by OpenStreetMap France.
For more information on this project: http://openstreetmap.fr/blogs/cquest/bano-banco
Origin of data
< p>BANO is a composite database, made up from different sources:Distribution format
These files are available in shapefile format, in WGS84 projection (EPSG :4326) as well as in CSV format and experimentally as github project.
Description of content
For each address:
updates, corrections
To update and correct BANO data, simply make improvements directly in OpenStreetMap, they will be taken into account in the next update cycle.
A one-stop collaborative reporting/correction window will soon be set up to simplify the process of improving the content of the database. To participate in its co-construction, do not hesitate to contact us!
For any questions concerning the project or this dataset, you can contact bano@openstreetmap.fr
Dataset Card for hf-blog-posts-dpo_raw
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/fdaudens/hf-blog-posts-dpo_raw/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/fdaudens/hf-blog-posts-dpo_raw.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset maps the location of anti-social graffiti around the University of Edinburgh's central campus. The data was collected over a 2 week period between the 19th May and the 2nd June 2014. The data was collected using a smartphone through an app called Fieldtrip GB (http://fieldtripgb.blogs.edina.ac.uk/). Multiple asset collectors were deployed to use a pre-defined data collection form which allowed users to log the following attributes: Date / Name of asset collector / Type of graffiti (image/tag/words/advert/.....) / What the graffiti was on (building/wall/lamppost/....) / What medium was used (paint/paper/chalk/....) / Density of graffiti / Photograph / Location. The data is by no means complete and realistically captured only around 50% of the graffiti in the study area. It is hoped that this dataset will be updated every 3 months to chart the distribution of graffiti over time. data was collected using the app Fieldtrip GB Once collected, data from multiple asset collectors was merged in FtGB's authoring tool and exported as a CSV file. This was then imported into QGIS and saved as a vector dataset in ESRI Shapefile format. GIS vector data. This dataset was first accessioned in the EDINA ShareGeo Open repository on 2014-06-06 and migrated to Edinburgh DataShare on 2017-02-22.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
HadUK-Grid is a collection of gridded climate variables derived from the network of UK land surface observations. The data have been interpolated from meteorological station data onto a uniform grid to provide complete and consistent coverage across the UK. The dataset at 12 km resolution is derived from the associated 1 km x 1 km resolution to allow for comparison to data from climate projections. The dataset spans the period from 1836 to 2022, but the start time is dependent on climate variable and temporal resolution.
The gridded data are produced for daily, monthly, seasonal and annual timescales, as well as long term averages for a set of climatological reference periods. Variables include air temperature (maximum, minimum and mean), precipitation, sunshine, mean sea level pressure, wind speed, relative humidity, vapour pressure, days of snow lying, and days of ground frost.
This data set supersedes the previous versions of this dataset which also superseded UKCP09 gridded observations. Subsequent versions may be released in due course and will follow the version numbering as outlined by Hollis et al. (2018, see linked documentation).
The changes for v1.2.0.ceda HadUK-Grid datasets are as follows:
Added data for calendar year 2022
Added newly digitised data for monthly sunshine 1910-1918
Added Rainfall Rescue version 2 doi:10.5281/zenodo.7554242
Updated shapefiles used for production of area average statistics https://github.com/ukcp-data/ukcp- spatial-files
Updated controlled vocabulary for metadata assignment https://github.com/ukcp-data/UKCP18_CVs
Updated assignment of timepoint for some periods so that the datetime is the middle of the period (e.g. season) rather than a fixed offset from the period start.
Updated ordering of regions within regional values files. Alphabetical ordering.
Files use netcdf level 4 compression using gzip https://www.unidata.ucar.edu/blogs/developer/entry/netcdf_compression
Net changes to the input station data used to generate this dataset:
Total of 125601744 observations
122621050 (97.6%) unchanged
26700 (0.02%) modified for this version
2953994 (2.35%) added in this version
16315 (0.01%) deleted from this version
Changes to monthly rainfall 1836-1960
Total of 4823973 observations
3315657 (68.7%) unchanged
21029 (0.4%) modified for this version
1487287 (30.8%) added in this version
11155 (0.2%) deleted from this version
The primary purpose of these data are to facilitate monitoring of UK climate and research into climate change, impacts and adaptation. The datasets have been created by the Met Office with financial support from the Department for Business, Energy and Industrial Strategy (BEIS) and Department for Environment, Food and Rural Affairs (DEFRA) in order to support the Public Weather Service Customer Group (PWSCG), the Hadley Centre Climate Programme, and the UK Climate Projections (UKCP18) project. The output from a number of data recovery activities relating to 19th and early 20th Century data have been used in the creation of this dataset, these activities were supported by: the Met Office Hadley Centre Climate Programme; the Natural Environment Research Council project "Analysis of historic drought and water scarcity in the UK"; the UK Research & Innovation (UKRI) Strategic Priorities Fund UK Climate Resilience programme; The UK Natural Environment Research Council (NERC) Public Engagement programme; the National Centre for Atmospheric Science; National Centre for Atmospheric Science and the NERC GloSAT project; and the contribution of many thousands of public volunteers. The dataset is provided under Open Government Licence.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset comes from the Open National Address Base project initiated by OpenStreetMap France.
For more information on this project: http://openstreetmap.fr/blogs/cquest/bano-banco
Origin of data
< p>BANO is a composite database, made up from different sources:Distribution format
These files are available in shapefile format, in WGS84 projection (EPSG :4326) as well as in CSV format and experimentally as github project.
Description of content
For each address:
updates, corrections
To update and correct BANO data, simply make improvements directly in OpenStreetMap, they will be taken into account in the next update cycle.
A one-stop collaborative reporting/correction window will soon be set up to simplify the process of improving the content of the database. To participate in its co-construction, do not hesitate to contact us!
For any questions concerning the project or this dataset, you can contact bano@openstreetmap.fr
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We often create dioramas from LEGO bricks for use with our presentations, blogs, and social media posts. We find it's much more fun and effective to reenact meetings and other scenes than to try and use real-life images. It also saves us from the hassle of worrying about receiving permission to use a person's photo in our work. By popular demand, we have released some of our favorite images as open data for you to use.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Geographical contours of the EPCI resulting from the crossing of the communal boundaries of OpenStreetMap and data from the General Directorate of Local Authorities dating from 2015.
These data are partly from the crowdsourcing carried out by the contributors to the OpenStreetMap project and are therefore under ODbL license which imposes an identical sharing and the mandatory attribution mention must be "**© the contributors of OpenStreetMap under ODbL license**" in accordance with http://osm.org/copyright
The data come from the Directorate General of Local Authorities (DGCL) crossed with the municipal division from the OpenStreetMap map database. These were created from the cadastre made available by the DGFiP on cadastre.gouv.fr.
Source for EPCI 2015: http://www.collectivites-locales.gouv.fr/liste-et-composition-2015
These files are offered in shapefile format, in WGS84 projection with several levels of detail:
The topology is retained during the simplification process (see: http://openstreetmap.fr/blogs/cquest/administrative-simplified limits)
These files contain all the EPCI contained in the DGCL file (see "Origin of data").
The following attributes are provided:
Previous versions available at: http://osm13.openstreetmap.fr/~cquest/openfla/export/
For any questions regarding these exports, you can contact exports@openstreetmap.fr
See also:
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
HadUK-Grid is a collection of gridded climate variables derived from the network of UK land surface observations. The data have been interpolated from meteorological station data onto a uniform grid to provide complete and consistent coverage across the UK. These data at 1 km resolution have been averaged across a set of discrete geographies defining UK countries consistent with data from UKCP18 climate projections. The dataset spans the period from 1836 to 2023, but the start time is dependent on climate variable and temporal resolution.
The gridded data are produced for daily, monthly, seasonal and annual timescales, as well as long term averages for a set of climatological reference periods. Variables include air temperature (maximum, minimum and mean), precipitation, sunshine, mean sea level pressure, wind speed, relative humidity, vapour pressure, days of snow lying, and days of ground frost.
This data set supersedes the previous versions of this dataset which also superseded UKCP09 gridded observations. Subsequent versions may be released in due course and will follow the version numbering as outlined by Hollis et al. (2018, see linked documentation).
The changes for v1.3.0.ceda HadUK-Grid datasets are as follows:
Added data for calendar year 2023
Added newly digitised data for monthly sunshine 1910-1918
Added Rainfall Rescue version 2 doi:10.5281/zenodo.7554242
Updated shapefiles used for production of area average statistics https://github.com/ukcp-data/ukcp-spatial-files
Updated controlled vocabulary for metadata assignment https://github.com/ukcp-data/UKCP18_CVs
Updated assignment of timepoint for some periods so that the datetime is the middle of the period (e.g. season) rather than a fixed offset from the period start.
Updated ordering of regions within regional values files. Alphabetical ordering.
Files use netcdf level 4 compression using gzip https://www.unidata.ucar.edu/blogs/developer/entry/netcdf_compression
Net changes to the input station data used to generate this dataset:
Total of 125601744 observations
122621050 (97.6%) unchanged
26700 (0.02%) modified for this version
2953994 (2.35%) added in this version
16315 (0.01%) deleted from this version
Changes to monthly rainfall 1836-1960
Total of 4823973 observations
3315657 (68.7%) unchanged
21029 (0.4%) modified for this version
1487287 (30.8%) added in this version
11155 (0.2%) deleted from this version
The primary purpose of these data are to facilitate monitoring of UK climate and research into climate change, impacts and adaptation. The datasets have been created by the Met Office with financial support from the Department for Business, Energy and Industrial Strategy (BEIS) and Department for Environment, Food and Rural Affairs (DEFRA) in order to support the Public Weather Service Customer Group (PWSCG), the Hadley Centre Climate Programme, and the UK Climate Projections (UKCP18) project. The output from a number of data recovery activities relating to 19th and early 20th Century data have been used in the creation of this dataset, these activities were supported by: the Met Office Hadley Centre Climate Programme; the Natural Environment Research Council project "Analysis of historic drought and water scarcity in the UK"; the UK Research & Innovation (UKRI) Strategic Priorities Fund UK Climate Resilience programme; The UK Natural Environment Research Council (NERC) Public Engagement programme; the National Centre for Atmospheric Science; National Centre for Atmospheric Science and the NERC GloSAT project; and the contribution of many thousands of public volunteers. The dataset is provided under Open Government Licence.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
HadUK-Grid is a collection of gridded climate variables derived from the network of UK land surface observations. The data have been interpolated from meteorological station data onto a uniform grid to provide complete and consistent coverage across the UK. The dataset at 25 km resolution is derived from the associated 1 km x 1 km resolution to allow for comparison to data from UKCP18 climate projections. The dataset spans the period from 1836 to 2022, but the start time is dependent on climate variable and temporal resolution.
The gridded data are produced for daily, monthly, seasonal and annual timescales, as well as long term averages for a set of climatological reference periods. Variables include air temperature (maximum, minimum and mean), precipitation, sunshine, mean sea level pressure, wind speed, relative humidity, vapour pressure, days of snow lying, and days of ground frost.
This data set supersedes the previous versions of this dataset which also superseded UKCP09 gridded observations. Subsequent versions may be released in due course and will follow the version numbering as outlined by Hollis et al. (2018, see linked documentation).
The changes for v1.2.0.ceda HadUK-Grid datasets are as follows:
Added data for calendar year 2022
Added newly digitised data for monthly sunshine 1910-1918
Added Rainfall Rescue version 2 doi:10.5281/zenodo.7554242
Updated shapefiles used for production of area average statistics https://github.com/ukcp-data/ukcp-spatial-files
Updated controlled vocabulary for metadata assignment https://github.com/ukcp-data/UKCP18_CVs
Updated assignment of timepoint for some periods so that the datetime is the middle of the period (e.g. season) rather than a fixed offset from the period start.
Updated ordering of regions within regional values files. Alphabetical ordering.
Files use netcdf level 4 compression using gzip https://www.unidata.ucar.edu/blogs/developer/entry/netcdf_compression
Net changes to the input station data used to generate this dataset:
Total of 125601744 observations
122621050 (97.6%) unchanged
26700 (0.02%) modified for this version
2953994 (2.35%) added in this version
16315 (0.01%) deleted from this version
Changes to monthly rainfall 1836-1960
Total of 4823973 observations
3315657 (68.7%) unchanged
21029 (0.4%) modified for this version
1487287 (30.8%) added in this version
11155 (0.2%) deleted from this version
The primary purpose of these data are to facilitate monitoring of UK climate and research into climate change, impacts and adaptation. The datasets have been created by the Met Office with financial support from the Department for Business, Energy and Industrial Strategy (BEIS) and Department for Environment, Food and Rural Affairs (DEFRA) in order to support the Public Weather Service Customer Group (PWSCG), the Hadley Centre Climate Programme, and the UK Climate Projections (UKCP18) project. The output from a number of data recovery activities relating to 19th and early 20th Century data have been used in the creation of this dataset, these activities were supported by: the Met Office Hadley Centre Climate Programme; the Natural Environment Research Council project "Analysis of historic drought and water scarcity in the UK"; the UK Research & Innovation (UKRI) Strategic Priorities Fund UK Climate Resilience programme; The UK Natural Environment Research Council (NERC) Public Engagement programme; the National Centre for Atmospheric Science; National Centre for Atmospheric Science and the NERC GloSAT project; and the contribution of many thousands of public volunteers. The dataset is provided under Open Government Licence.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
HadUK-Grid is a collection of gridded climate variables derived from the network of UK land surface observations. The data have been interpolated from meteorological station data onto a uniform grid to provide complete and consistent coverage across the UK. These data at 1 km resolution have been averaged across a set of discrete geographies defining UK river basins consistent with data from UKCP18 climate projections. The dataset spans the period from 1836 to 2022, but the start time is dependent on climate variable and temporal resolution.
The gridded data are produced for daily, monthly, seasonal and annual timescales, as well as long term averages for a set of climatological reference periods. Variables include air temperature (maximum, minimum and mean), precipitation, sunshine, mean sea level pressure, wind speed, relative humidity, vapour pressure, days of snow lying, and days of ground frost.
This data set supersedes the previous versions of this dataset which also superseded UKCP09 gridded observations. Subsequent versions may be released in due course and will follow the version numbering as outlined by Hollis et al. (2018, see linked documentation).
The changes for v1.2.0.ceda HadUK-Grid datasets are as follows:
Added data for calendar year 2022
Added newly digitised data for monthly sunshine 1910-1918
Added Rainfall Rescue version 2 doi:10.5281/zenodo.7554242
Updated shapefiles used for production of area average statistics https://github.com/ukcp-data/ukcp-spatial-files
Updated controlled vocabulary for metadata assignment https://github.com/ukcp-data/UKCP18_CVs
Updated assignment of timepoint for some periods so that the datetime is the middle of the period (e.g. season) rather than a fixed offset from the period start.
Updated ordering of regions within regional values files. Alphabetical ordering.
Files use netcdf level 4 compression using gzip https://www.unidata.ucar.edu/blogs/developer/entry/netcdf_compression
Net changes to the input station data used to generate this dataset:
Total of 125601744 observations
122621050 (97.6%) unchanged
26700 (0.02%) modified for this version
2953994 (2.35%) added in this version
16315 (0.01%) deleted from this version
Changes to monthly rainfall 1836-1960
Total of 4823973 observations
3315657 (68.7%) unchanged
21029 (0.4%) modified for this version
1487287 (30.8%) added in this version
11155 (0.2%) deleted from this version
The primary purpose of these data are to facilitate monitoring of UK climate and research into climate change, impacts and adaptation. The datasets have been created by the Met Office with financial support from the Department for Business, Energy and Industrial Strategy (BEIS) and Department for Environment, Food and Rural Affairs (DEFRA) in order to support the Public Weather Service Customer Group (PWSCG), the Hadley Centre Climate Programme, and the UK Climate Projections (UKCP18) project. The output from a number of data recovery activities relating to 19th and early 20th Century data have been used in the creation of this dataset, these activities were supported by: the Met Office Hadley Centre Climate Programme; the Natural Environment Research Council project "Analysis of historic drought and water scarcity in the UK"; the UK Research & Innovation (UKRI) Strategic Priorities Fund UK Climate Resilience programme; The UK Natural Environment Research Council (NERC) Public Engagement programme; the National Centre for Atmospheric Science; National Centre for Atmospheric Science and the NERC GloSAT project; and the contribution of many thousands of public volunteers. The dataset is provided under Open Government Licence.
Data and code for Malejka et al. (2021), "Correlation analysis to investigate unconscious mental processes". The present project focused on a particular domain of this literature, implicit learning. Studies conducted in this area try to determine whether we are able to detect regularities in our environment without awareness of those regularities. Finding evidence of awareness in these domains is important because it suggests that some degree of control may be available as well. In the present project we propose new methods for the study of unconscious learning. Many of the problems that we have detected in our previous research can be ameliorated by employing cutting-edge statistical analysis, including Bayesian and meta-analytic methods and model fitting. However, the validity of these approaches in the domain of implicit cognition remains untested.A consensus among researchers is that much of our behaviour is based on rather automatic processes we are barely aware of and over which we have little control. Research suggests that exposure to subtle cues can have dramatic effects on our decisions. For instance, asking people to provide the last 2 digits of their social security number biases how much they are willing to pay for products and commodities. Similarly, according to some researchers, people are more likely to be impolite and disrespectful if they have been exposed to words related to rudeness while solving anagrams. Another line of research suggests that we take many of our (important) decisions when distracted and thinking about other things and that this 'unconscious thought' process actually improves the quality of our decisions. These studies pertain to a larger area of research usually called 'implicit cognition', which explores how unconscious mechanisms contribute to cognitive processes including perception, learning, memory, and decision making. This area of research has attracted a great deal of attention from the media and features frequently in popular science books, blogs, and documentaries. Some authors have even suggested that parts of this research could be used to improve our decisions in different domains at a societal level (for example, in health behaviour and pension planning). The present project focuses on a particular domain of this literature, implicit learning. Studies conducted in this area try to determine whether we are able to detect regularities in our environment without awareness of those regularities. In other words, these studies address whether we can learn something without realising that we are indeed learning it. In recent years there have been thousands of demonstrations of implicit learning effects in the scientific literature and, not surprisingly, this literature has become increasingly influential in all areas of psychology, with an important impact in our understanding of human cognition and psychopathology. Unfortunately, our previous research suggests that much of this evidence is undermined by fundamental methodological problems that preclude any strong conclusions about the reliability of unconscious learning effects. We have shown that many of these studies find unconscious learning because researchers use weaker methods to assess whether people are conscious of what they have learned than to assess whether learning has taken place. Naturally, this implies that learning is easily detected but awareness is not, which creates the illusion that learning has taken place unconsciously. Finding evidence of awareness in these domains is important because it suggests that some degree of control may be available as well. In the present project we propose new methods for the study of unconscious learning. Many of the problems that we have detected in our previous research can be ameliorated by employing cutting-edge statistical analysis, including Bayesian and meta-analytic methods and model fitting. However, the validity of these approaches in the domain of implicit cognition remains untested. A second goal is to conduct a large-scale exploration of the prevalence and magnitude of these problems. Our previous studies have focused on a very particular effect studied in implicit learning research ('contextual cueing'). We suspect that many of these problems transcend this domain and affect a large proportion of current studies on implicit learning. The potential impact of this assessment is difficult to overestimate. Finally, we will set up a collaboration with other international laboratories working on this topic to gather the largest and most sensitive data set of implicit learning effects available so far. This data set will be publicly available for all researchers, which will make it a fundamental resource for the study of unconscious cognitive processes for many years to come.
Tutkimuksessa selvitettiin, miten tutkijat käyttävät erilaisia painettuja ja sähköisiä julkaisuja, kuten tieteellisiä lehtiä, artikkeleita, kirjoja, raportteja ja sosiaalista mediaa työssään. Tutkimus on osa yhdysvaltalaisten Carol Tenopirin ja Donald W. Kingin vuonna 1977 aloittamaa kyselytutkimusten sarjaa, jolla on seurattu tutkijoiden lukemisen käytäntöjä ja niissä tapahtuvia muutoksia eri tieteenaloilla ja eri maissa. Suomessa on kerätty aineistoa myös vuonna 2006, mutta sitä ei ole tallennettu Tietoarkistoon. Vuoden 2016 tutkimushanke toteutettiin osin Suomen Kulttuurirahaston apurahalla. Kyselyssä kartoitettiin tutkijoiden yleisiä lukemiskäytäntöjä sekä tieteellisen artikkelin ja muiden julkaisutyyppien julkaisemista ja lukemista. Kysyttiin myös, miten tutkijat hakevat, hankkivat, viittaavat ja julkaisevat tieteellistä tietoa. Lisäksi tiedusteltiin, kuinka paljon aikaa vastaajat käyttivät artikkelien lukemiseen, kuinka tuoreita ja minkä kielisiä julkaisut olivat sekä kuinka hyödyllisiä ne olivat työn kannalta. Sosiaalisen median merkitystä vastaajien työssä kartoitettiin kysymällä, kuinka tärkeitä eri palvelut ja välineet, kuten blogit, pilvipalvelut, akateemiset verkkoyhteisöt ja viitteidenhallintaohjelmat ovat työssä. Kysyttiin myös, miten tärkeitä elektronisten julkaisujen ominaisuuksia ovat esim. yhteensopivuus ja luettavuus eri laitteilla, mahdollisuus jakaa ja globaali kielituki. Lopuksi tiedusteltiin, miten tutkijan tieteellisen kirjallisuuden lukeminen tai luetun jakaminen on muuttunut viime vuosina ja miten se muuttuu lähivuosina. Taustamuuttujina olivat vastaajan tieteenala, ammattiasema, ikä ja työpaikan tyyppi. The study examined Finnish researchers' use of different printed and electronic publications in their work, such as scientific journals, articles, books, reports, and social media. The study is part of Carol Tenopir and Donald W. King's survey series launched in 1977 following the reading practices of researchers in different countries and scientific fields. Finnish data have also been collected in 2006 but this dataset has not been archived at the Finnish Social Science Data Archive. The 2016 project was partly funded by the Finnish Cultural Foundation. The survey charted researchers' common reading practices as well as publishing of different types of scientific articles and other publications. The way that the respondents' work time was distributed between different types of tasks was charted as well as how many publications of different types they had authored within the previous two years. It was also examined how researchers searched for information, published scientific work, and cited the work of others. Questions also covered how much time the respondents spent on reading articles, how many scientific articles and other types of publications they had read within the previous 30 days, how recent the publications that they read were, reasons for reading them, what language they were in, how they found the publications and received access, where they read the publications, which scientific field the publications represented, and how useful they considered different publication formats for their work. The significance of social media was charted with questions regarding, for instance, how important different services and tools were for their work (e.g. blogs, cloud services, institutional repositories, academic online communities, reference management software). The respondents were also asked about how important different features of electronic publications were (e.g. compatibility and readability on different devices, possibility to share publications, advanced navigation features, global language support, possibility to embed audio into publications). Background variables included scientific field, job title, age, and type of workplace. Ei-todennäköisyysotanta: itsestään muotoutunut näyteNonprobability.Availability Non-probability: AvailabilityNonprobability.Availability Itsetäytettävä lomake: verkkolomakeSelfAdministeredQuestionnaire.CAWI
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)
All bloggers included in the corpus fall into one of three age groups: - 8240 "10s" blogs (ages 13-17), - 8086 "20s" blogs (ages 23-27), - 2994 "30s" blogs (ages 33-47).
For each age group there are an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.
The corpus may be freely used for non-commercial research purposes.