20 datasets found
  1. h

    blog_authorship_corpus

    • huggingface.co
    • paperswithcode.com
    Updated Jul 27, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bar-Ilan University (2003). blog_authorship_corpus [Dataset]. https://huggingface.co/datasets/barilan/blog_authorship_corpus
    Explore at:
    Dataset updated
    Jul 27, 2003
    Dataset authored and provided by
    Bar-Ilan University
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

    Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

    All bloggers included in the corpus fall into one of three age groups: - 8240 "10s" blogs (ages 13-17), - 8086 "20s" blogs (ages 23-27), - 2994 "30s" blogs (ages 33-47).

    For each age group there are an equal number of male and female bloggers.

    Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

    The corpus may be freely used for non-commercial research purposes.

  2. d

    Blog mix 2013 (2017-02-24) Bloggmix 2013 (2017-02-24) - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Nov 21, 2012
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). Blog mix 2013 (2017-02-24) Bloggmix 2013 (2017-02-24) - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/17f02033-9108-50f7-80a0-95939c09764e
    Explore at:
    Dataset updated
    Nov 21, 2012
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The blogs in the blogmix are selected through the lists Most visited private blogs, Most visited professional blogs, and the local lists for different regions, at bloggportalen.se. More information, such as the location and age of the blogger is also retrieved from Bloggportalen. The material has not been manually checked, which means that spam may occur. Some English blogs have been removed when discovered, and some blogs have not been added for technical reasons. The time of the blogs ranges from the first to the latest entries of the selected blogs, and the corpus is continually updated. The material is sentence scrambled. Urvalet av bloggar för bloggmixen görs med hjälp av topplistorna på bloggportalen.se, både Mest besökta privata bloggar, Mest besökta proffsbloggar och de lokala topplistorna för olika regioner. Närmare information, som bloggarens ort och ålder, hämtas också från Bloggportalen. Materialet har inte kontrollerats manuellt, vilket betyder att det kan förekomma spam. Några engelskspråkiga bloggar har plockats bort då de upptäckts, och vissa bloggar har inte kunnat läsas in av tekniska skäl. Tidsperioden sträcker sig från de första inläggen i de utvalda bloggarna till de senaste inläggen. Korpusen uppdateras regelbundet.

  3. Z

    Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jakub Simko (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5996863
    Explore at:
    Dataset updated
    Apr 22, 2022
    Dataset provided by
    Jakub Simko
    Matus Tomlein
    Ivan Srba
    Robert Moro
    Elena Stefancova
    Branislav Pecher
    Maria Bielikova
    Description

    Overview

    This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

    The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

    Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

    The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

    The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

    Options to access the dataset

    There are two ways how to get access to the dataset:

    1. Static dump of the dataset available in the CSV format
    2. Continuously updated dataset available via REST API

    In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

    References

    If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

    @inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }

    @inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }

    Dataset creation process

    In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.

    Ethical considerations

    The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

    The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

    As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

    Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.

    Reporting mistakes in the dataset The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.

    Dataset structure

    Raw data

    At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

    Raw data are contained in these CSV files (and corresponding REST API endpoints):

    sources.csv

    articles.csv

    article_media.csv

    article_authors.csv

    discussion_posts.csv

    discussion_post_authors.csv

    fact_checking_articles.csv

    fact_checking_article_media.csv

    claims.csv

    feedback_facebook.csv

    Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.

    Annotations

    Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

    Each annotation is described by the following attributes:

    category of annotation (annotation_category). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).

    type of annotation (annotation_type_id). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.

    method which created annotation (method_id). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.

    its value (value). The value is stored in JSON format and its structure differs according to particular annotation type.

    At the same time, annotations are associated with a particular object identified by:

    entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.

    entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation annotations).

    The dataset provides specifically these entity annotations:

    Source reliability (binary). Determines validity of source (website) at a binary scale with two options: reliable source and unreliable source.

    Article veracity. Aggregated information about veracity from article-claim pairs.

    The dataset provides specifically these relation annotations:

    Fact-checking article to claim mapping. Determines mapping between fact-checking article and claim.

    Claim presence. Determines presence of claim in article.

    Claim stance. Determines stance of an article to a claim.

    Annotations are contained in these CSV files (and corresponding REST API endpoints):

    entity_annotations.csv

    relation_annotations.csv

    Note: Identification of human annotators authors (email provided in the annotation app) is anonymised.

  4. f

    Data from: THE PRODUCTION OF PROFESSIONAL BLOGS AS REFLEXIVE TOOLS IN...

    • scielo.figshare.com
    • figshare.com
    jpeg
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Moreira dos Anjos-Santos; Vera Lúcia Lopes Cristovão (2023). THE PRODUCTION OF PROFESSIONAL BLOGS AS REFLEXIVE TOOLS IN PRE-SERVICE ENGLISH TEACHER EDUCATION [Dataset]. http://doi.org/10.6084/m9.figshare.7514594.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SciELO journals
    Authors
    Lucas Moreira dos Anjos-Santos; Vera Lúcia Lopes Cristovão
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this paper, we aim to analyze the production of professional blogs by pre-service English teachers and the roles that such digital language practices may perform in the education of reflexive and critical language teachers. Specifically, we analyzed the blog posts that two pre-service teachers produced and the professional identities that are forged in such a digital language practice. The reported case study is of qualitative and interpretative nature. The data, composed by the pre-service English teachers' blog posts and their experiential narratives regarding the pedagogical practice they experienced, were generated in 2010 in an elective unit in a state university of the north of Paraná. The results demonstrate the emergence of identity conflicts due to the engagement of the pre-service English teachers in the production of digital language practices. These conflicts have generated an impulse towards the reconstruction of the identities of these future English language professionals.

  5. f

    Twitter bot profiling

    • figshare.com
    • researchdata.smu.edu.sg
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Living Analytics Research Centre (2023). Twitter bot profiling [Dataset]. http://doi.org/10.25440/smu.12062706.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    Living Analytics Research Centre
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    This dataset comprises a set of Twitter accounts in Singapore that are used for social bot profiling research conducted by the Living Analytics Research Centre (LARC) at Singapore Management University (SMU). Here a bot is defined as a Twitter account that generates contents and/or interacts with other users automatically (at least according to human judgment). In this research, Twitter bots have been categorized into three major types:

    Broadcast bot. This bot aims at disseminating information to general audience by providing, e.g., benign links to news, blogs or sites. Such bot is often managed by an organization or a group of people (e.g., bloggers). Consumption bot. The main purpose of this bot is to aggregate contents from various sources and/or provide update services (e.g., horoscope reading, weather update) for personal consumption or use. Spam bot. This type of bots posts malicious contents (e.g., to trick people by hijacking certain account or redirecting them to malicious sites), or promotes harmless but invalid/irrelevant contents aggressively.

    This categorization is general enough to cater for new, emerging types of bot (e.g., chatbots can be viewed as a special type of broadcast bots). The dataset was collected from 1 January to 30 April 2014 via the Twitter REST and streaming APIs. Starting from popular seed users (i.e., users having many followers), their follow, retweet, and user mention links were crawled. The data collection proceeds by adding those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. Using this procedure, a total of 159,724 accounts have been collected. To identify bots, the first step is to check active accounts who tweeted at least 15 times within the month of April 2014. These accounts were then manually checked and labelled, of which 589 bots were found. As many more human users are expected in the Twitter population, the remaining accounts were randomly sampled and manually checked. With this, 1,024 human accounts were identified. In total, this results in 1,613 labelled accounts. Related Publication: R. J. Oentaryo, A. Murdopo, P. K. Prasetyo, and E.-P. Lim. (2016). On profiling bots in social media. Proceedings of the International Conference on Social Informatics (SocInfo’16), 92-109. Bellevue, WA. https://doi.org/10.1007/978-3-319-47880-7_6

  6. Datasets of word network topic model

    • figshare.com
    application/x-rar
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jichang Zhao (2023). Datasets of word network topic model [Dataset]. http://doi.org/10.6084/m9.figshare.5572588.v1
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Authors
    Jichang Zhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract: This dataset holds the content of one day's micro-blogs sampled from Weibo(http://weibo.com) in the form of bags-of-words.-----------------------------------------------------Data Set Characteristics: TextNumber of Micro-blogs:189,223Total Number of Words:3,252,492Size of the Vocabulary:20,942Associated Tasks: short text topic modeling and etc.-----------------------------------------------------About PreprocessingFor tokenization, we use NLPIR. Stop words and those with term-frequence less than 20 were removed. Besides,words contain only one chinese-character were also removed.-----------------------------------------------------Data FormatThe format of released data is setted as follows:[document_1][document_2]...[document_M]in which each line is one document. [document_i] is the ith document of the dataset that consists of a list of Ni words/terms.[document_i] = [word_i1] [word_i2] ... [word_iNi]in which all word_ij are text strings and they are separated by the blank character.-----------------------------------------------------If you have any questions about the data set, please contact: jichang@buaa.edu.cn.

  7. Learning Management System

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Jun 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.usaid.gov (2024). Learning Management System [Dataset]. https://catalog.data.gov/dataset/learning-management-system
    Explore at:
    Dataset updated
    Jun 8, 2024
    Dataset provided by
    United States Agency for International Developmenthttps://usaid.gov/
    Description

    Although the commercial name for the The USAID University - Learning Management System is CSOD InCompass, the agencies that use the system have renamed (or rebranded) their specific agency portals to meet their own needs. lnCompass is a comprehensive talent management system that incorporates the following functional modules: 1) Learning -- The Learning module supports the management and tracking of training events and individual training records. Training events may be instructor Jed or online. Courses may be managed within the system to provide descriptions, availability, and registration. Online content is stored on the system. Training information stored for individuals includes courses completed, scores, and courses registered for, 2) Connect -- The Connect module supports employee collaboration efforts. Features include communities of practice, expertise location, blogs, and knowledge sharing support. Profile information that may be stored by the system includes job position, subject matter expertise, and previous accomplishments, 3) Performance -- The Performance module supports management of organizational goals and alignment of those goals to individual performance. The module supports managing skills and competencies for the organization. The module also supports employee performance reviews. The types of information gathered about employees include their skills, competencies, and performance evaluation, 4) Succession -- The Succession module supports workforce management and planning. The type of information gathered for this module includes prior work experience, skills, and competencies, and 5) Extended Enterprise -- The Extended Enterprise module supports delivery of training outside of the organization. Training provided may be for a fee. The type of information collected for this module includes individual data for identifying the person for training records management and related information for commercial transactions.

  8. National Open Address Database (BANO) - Seine-Saint-Denis

    • ckan.mobidatalab.eu
    Updated May 27, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenStreetMap (2014). National Open Address Database (BANO) - Seine-Saint-Denis [Dataset]. https://ckan.mobidatalab.eu/dataset/national-open-bano-seine-saint-denis-address-base
    Explore at:
    https://www.iana.org/assignments/media-types/application/json, https://www.iana.org/assignments/media-types/text/csv, https://www.iana.org/assignments/media-types/application/zipAvailable download formats
    Dataset updated
    May 27, 2014
    Dataset provided by
    OpenStreetMap//www.openstreetmap.org/
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    Seine-Saint-Denis
    Description

    This dataset comes from the Open National Address Base project initiated by OpenStreetMap France.

    For more information on this project: http://openstreetmap.fr/blogs/cquest/bano-banco

    Origin of data

    < p>BANO is a composite database, made up from different sources:

    • OpenStreetMap
    • data available in opendata
    • address data collected on the cadastral site (source DGFiP 2014)

    Distribution format

    These files are available in shapefile format, in WGS84 projection (EPSG :4326) as well as in CSV format and experimentally as github project.

    Description of content

    For each address:

    • id (unique): code_insee + codefantoir + number
    • number: street number with suffix (e.g.: 1, 1BIS, 1D)
    • street: street name
    • post_code: 5-character postcode
    • city: name of the municipality
    • source: OSM = data directly from OpenStreetMap, OD = data from local opendata sources, CAD = data directly from the cadastre, C+O = cadastre data enriched by OSM (road name for example)
    • lat: latitude in WGS84 decimal degrees
    • lon: longitude in WGS84 decimal degrees

    updates, corrections

    To update and correct BANO data, simply make improvements directly in OpenStreetMap, they will be taken into account in the next update cycle.

    A one-stop collaborative reporting/correction window will soon be set up to simplify the process of improving the content of the database. To participate in its co-construction, do not hesitate to contact us!

    For any questions concerning the project or this dataset, you can contact bano@openstreetmap.fr

  9. National Open Address Database (BANO) - Val-d'Oise

    • ckan.mobidatalab.eu
    Updated May 27, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenStreetMap (2014). National Open Address Database (BANO) - Val-d'Oise [Dataset]. https://ckan.mobidatalab.eu/dataset/national-address-base-open-bano-val-doise
    Explore at:
    https://www.iana.org/assignments/media-types/application/json, https://www.iana.org/assignments/media-types/text/csv, https://www.iana.org/assignments/media-types/application/zipAvailable download formats
    Dataset updated
    May 27, 2014
    Dataset provided by
    OpenStreetMap//www.openstreetmap.org/
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    Val-d'Oise, Oise
    Description

    This dataset comes from the Open National Address Base project initiated by OpenStreetMap France.

    For more information on this project: http://openstreetmap.fr/blogs/cquest/bano-banco

    Origin of data

    < p>BANO is a composite database, made up from different sources:

    • OpenStreetMap
    • data available in opendata
    • address data collected on the cadastral site (source DGFiP 2014)

    Distribution format

    These files are available in shapefile format, in WGS84 projection (EPSG :4326) as well as in CSV format and experimentally as github project.

    Description of content

    For each address:

    • id (unique): code_insee + codefantoir + number
    • number: street number with suffix (e.g.: 1, 1BIS, 1D)
    • street: street name
    • post_code: 5-character postcode
    • city: name of the municipality
    • source: OSM = data directly from OpenStreetMap, OD = data from local opendata sources, CAD = data directly from the cadastre, C+O = cadastre data enriched by OSM (road name for example)
    • lat: latitude in WGS84 decimal degrees
    • lon: longitude in WGS84 decimal degrees

    updates, corrections

    To update and correct BANO data, simply make improvements directly in OpenStreetMap, they will be taken into account in the next update cycle.

    A one-stop collaborative reporting/correction window will soon be set up to simplify the process of improving the content of the database. To participate in its co-construction, do not hesitate to contact us!

    For any questions concerning the project or this dataset, you can contact bano@openstreetmap.fr

  10. h

    hf-blog-posts-dpo_raw

    • huggingface.co
    • hf-proxy-cf.effarig.site
    Updated May 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florent Daudens (2024). hf-blog-posts-dpo_raw [Dataset]. https://huggingface.co/datasets/fdaudens/hf-blog-posts-dpo_raw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 3, 2024
    Authors
    Florent Daudens
    Description

    Dataset Card for hf-blog-posts-dpo_raw

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/fdaudens/hf-blog-posts-dpo_raw/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/fdaudens/hf-blog-posts-dpo_raw.

  11. E

    Graffiti around University of Edinburgh

    • dtechtive.com
    • find.data.gov.scot
    xml, zip
    Updated Feb 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Edinburgh (2017). Graffiti around University of Edinburgh [Dataset]. http://doi.org/10.7488/ds/1961
    Explore at:
    zip(0.0038 MB), xml(0.0045 MB)Available download formats
    Dataset updated
    Feb 22, 2017
    Dataset provided by
    University of Edinburgh
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    Edinburgh, UK
    Description

    This dataset maps the location of anti-social graffiti around the University of Edinburgh's central campus. The data was collected over a 2 week period between the 19th May and the 2nd June 2014. The data was collected using a smartphone through an app called Fieldtrip GB (http://fieldtripgb.blogs.edina.ac.uk/). Multiple asset collectors were deployed to use a pre-defined data collection form which allowed users to log the following attributes: Date / Name of asset collector / Type of graffiti (image/tag/words/advert/.....) / What the graffiti was on (building/wall/lamppost/....) / What medium was used (paint/paper/chalk/....) / Density of graffiti / Photograph / Location. The data is by no means complete and realistically captured only around 50% of the graffiti in the study area. It is hoped that this dataset will be updated every 3 months to chart the distribution of graffiti over time. data was collected using the app Fieldtrip GB Once collected, data from multiple asset collectors was merged in FtGB's authoring tool and exported as a CSV file. This was then imported into QGIS and saved as a vector dataset in ESRI Shapefile format. GIS vector data. This dataset was first accessioned in the EDINA ShareGeo Open repository on 2014-06-06 and migrated to Edinburgh DataShare on 2017-02-22.

  12. HadUK-Grid Gridded Climate Observations on a 12km grid over the UK,...

    • catalogue.ceda.ac.uk
    Updated Jul 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Hollis; Mark McCarthy; Michael Kendon; Tim Legg (2023). HadUK-Grid Gridded Climate Observations on a 12km grid over the UK, v1.2.0.ceda (1836-2022) [Dataset]. https://catalogue.ceda.ac.uk/uuid/640d33e0cf99477990f7fee35a101850
    Explore at:
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    Authors
    Dan Hollis; Mark McCarthy; Michael Kendon; Tim Legg
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Time period covered
    Jan 1, 1836 - Dec 31, 2022
    Area covered
    Variables measured
    time, latitude, longitude, air_temperature, relative_humidity, duration_of_sunshine, projection_x_coordinate, projection_y_coordinate, lwe_thickness_of_precipitation_amount
    Description

    HadUK-Grid is a collection of gridded climate variables derived from the network of UK land surface observations. The data have been interpolated from meteorological station data onto a uniform grid to provide complete and consistent coverage across the UK. The dataset at 12 km resolution is derived from the associated 1 km x 1 km resolution to allow for comparison to data from climate projections. The dataset spans the period from 1836 to 2022, but the start time is dependent on climate variable and temporal resolution.

    The gridded data are produced for daily, monthly, seasonal and annual timescales, as well as long term averages for a set of climatological reference periods. Variables include air temperature (maximum, minimum and mean), precipitation, sunshine, mean sea level pressure, wind speed, relative humidity, vapour pressure, days of snow lying, and days of ground frost.

    This data set supersedes the previous versions of this dataset which also superseded UKCP09 gridded observations. Subsequent versions may be released in due course and will follow the version numbering as outlined by Hollis et al. (2018, see linked documentation).

    The changes for v1.2.0.ceda HadUK-Grid datasets are as follows:

    • Added data for calendar year 2022

    • Added newly digitised data for monthly sunshine 1910-1918

    • Net changes to the input station data used to generate this dataset:

    • Total of 125601744 observations

    • 122621050 (97.6%) unchanged

    • 26700 (0.02%) modified for this version

    • 2953994 (2.35%) added in this version

    • 16315 (0.01%) deleted from this version

    • Changes to monthly rainfall 1836-1960

    • Total of 4823973 observations

    • 3315657 (68.7%) unchanged

    • 21029 (0.4%) modified for this version

    • 1487287 (30.8%) added in this version

    • 11155 (0.2%) deleted from this version

    The primary purpose of these data are to facilitate monitoring of UK climate and research into climate change, impacts and adaptation. The datasets have been created by the Met Office with financial support from the Department for Business, Energy and Industrial Strategy (BEIS) and Department for Environment, Food and Rural Affairs (DEFRA) in order to support the Public Weather Service Customer Group (PWSCG), the Hadley Centre Climate Programme, and the UK Climate Projections (UKCP18) project. The output from a number of data recovery activities relating to 19th and early 20th Century data have been used in the creation of this dataset, these activities were supported by: the Met Office Hadley Centre Climate Programme; the Natural Environment Research Council project "Analysis of historic drought and water scarcity in the UK"; the UK Research & Innovation (UKRI) Strategic Priorities Fund UK Climate Resilience programme; The UK Natural Environment Research Council (NERC) Public Engagement programme; the National Centre for Atmospheric Science; National Centre for Atmospheric Science and the NERC GloSAT project; and the contribution of many thousands of public volunteers. The dataset is provided under Open Government Licence.

  13. National Open Address Database (BANO) - Yvelines

    • ckan.mobidatalab.eu
    Updated May 27, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenStreetMap (2014). National Open Address Database (BANO) - Yvelines [Dataset]. https://ckan.mobidatalab.eu/dataset/national-open-bano-yvelines-address-base
    Explore at:
    https://www.iana.org/assignments/media-types/text/csv, https://www.iana.org/assignments/media-types/application/json, https://www.iana.org/assignments/media-types/application/zipAvailable download formats
    Dataset updated
    May 27, 2014
    Dataset provided by
    OpenStreetMap//www.openstreetmap.org/
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    Yvelines
    Description

    This dataset comes from the Open National Address Base project initiated by OpenStreetMap France.

    For more information on this project: http://openstreetmap.fr/blogs/cquest/bano-banco

    Origin of data

    < p>BANO is a composite database, made up from different sources:

    • OpenStreetMap
    • data available in opendata
    • address data collected on the cadastral site (source DGFiP 2014)

    Distribution format

    These files are available in shapefile format, in WGS84 projection (EPSG :4326) as well as in CSV format and experimentally as github project.

    Description of content

    For each address:

    • id (unique): code_insee + codefantoir + number
    • number: street number with suffix (e.g.: 1, 1BIS, 1D)
    • street: street name
    • post_code: 5-character postcode
    • city: name of the municipality
    • source: OSM = data directly from OpenStreetMap, OD = data from local opendata sources, CAD = data directly from the cadastre, C+O = cadastre data enriched by OSM (road name for example)
    • lat: latitude in WGS84 decimal degrees
    • lon: longitude in WGS84 decimal degrees

    updates, corrections

    To update and correct BANO data, simply make improvements directly in OpenStreetMap, they will be taken into account in the next update cycle.

    A one-stop collaborative reporting/correction window will soon be set up to simplify the process of improving the content of the database. To participate in its co-construction, do not hesitate to contact us!

    For any questions concerning the project or this dataset, you can contact bano@openstreetmap.fr

  14. C

    LEGO Diorama Images

    • data.wprdc.org
    • catalog.data.gov
    • +1more
    jpeg
    Updated Feb 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Western Pennsylvania Regional Data Center (2025). LEGO Diorama Images [Dataset]. https://data.wprdc.org/dataset/lego-diorama-images
    Explore at:
    jpeg, jpeg(59694)Available download formats
    Dataset updated
    Feb 26, 2025
    Dataset provided by
    Western Pennsylvania Regional Data Center
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We often create dioramas from LEGO bricks for use with our presentations, blogs, and social media posts. We find it's much more fun and effective to reenact meetings and other scenes than to try and use real-life images. It also saves us from the hassle of worrying about receiving permission to use a person's photo in our work. By popular demand, we have released some of our favorite images as open data for you to use.

  15. Outlines of the EPCI 2015

    • data.europa.eu
    esri shape
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenStreetMap, Outlines of the EPCI 2015 [Dataset]. https://data.europa.eu/data/datasets/54f63501c751df466f882844?locale=en
    Explore at:
    esri shapeAvailable download formats
    Dataset authored and provided by
    OpenStreetMap//www.openstreetmap.org/
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Geographical contours of the EPCI resulting from the crossing of the communal boundaries of OpenStreetMap and data from the General Directorate of Local Authorities dating from 2015.

    These data are partly from the crowdsourcing carried out by the contributors to the OpenStreetMap project and are therefore under ODbL license which imposes an identical sharing and the mandatory attribution mention must be "**© the contributors of OpenStreetMap under ODbL license**" in accordance with http://osm.org/copyright

    Description of the contents of the "epci" files

    Origin of data

    The data come from the Directorate General of Local Authorities (DGCL) crossed with the municipal division from the OpenStreetMap map database. These were created from the cadastre made available by the DGFiP on cadastre.gouv.fr.

    Source for EPCI 2015: http://www.collectivites-locales.gouv.fr/liste-et-composition-2015

    Format

    These files are offered in shapefile format, in WGS84 projection with several levels of detail:

    • simplification at 5m
    • simplification at 50m
    • simplification to 100m

    The topology is retained during the simplification process (see: http://openstreetmap.fr/blogs/cquest/administrative-simplified limits)

    Content

    These files contain all the EPCI contained in the DGCL file (see "Origin of data").

    The following attributes are provided:

    • siren_epci: SIREN code assigned by INSEE to EPCI (source Min. Intérieur)
    • name_epci: name of EPCI (source Min. Intérieur)
    • ptot_epci: total population of the EPCI (source Min. Intérieur)
    • nb_comm: number of municipalities in the EPCI
    • surf_km2: EPCI area in km2 on the spheroid WGS84 (rounded per hectare before simplification)
    • short_name: Abbreviated name (source OSM)
    • wikipedia: Wikipedia article about EPCI (language code + article name, example: "en:Community of communes of Larmont")
    • web: EPCI website (source OSM)
    • osm_id: OSM relationship ID at time of export
    • name_osm: EPCI name in OSM
    • type_epci: EPCI type (OSM source)

    History

    • 20-12-2013: first generation of the file, based on the OSM municipal division at 19-12-2013
    • 12-02-2014: second generation of the file with the EPCI 2014, and municipal division OSM of 19-12-2013
    • 13-02-2014: addition of web, wikipedia and osm_id fields
    • 06-03-2014: third generation of the file with the EPCI 2014 and the municipal division OSM at 06-03-2014
    • 06-03-2014: addition of OpenStreetMap name and EPCI type
    • 03-03-2014: fourth generation of the file with the EPCI 2015 and the municipal division OSM at 01-01-2015

    Previous versions available at: http://osm13.openstreetmap.fr/~cquest/openfla/export/

    For any questions regarding these exports, you can contact exports@openstreetmap.fr

    See also:

  16. HadUK-Grid Climate Observations by UK countries, v1.3.0.ceda (1836-2023)

    • catalogue.ceda.ac.uk
    Updated Jul 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Hollis; Emily Carlisle; Michael Kendon; Stephen Packman; Amy Doherty (2024). HadUK-Grid Climate Observations by UK countries, v1.3.0.ceda (1836-2023) [Dataset]. https://catalogue.ceda.ac.uk/uuid/a508838f92c74005a26b9277eae59a7c
    Explore at:
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    Authors
    Dan Hollis; Emily Carlisle; Michael Kendon; Stephen Packman; Amy Doherty
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Time period covered
    Jan 1, 1836 - Dec 31, 2023
    Area covered
    Variables measured
    time, region, area_type, wind_speed, air_temperature, relative_humidity, surface_temperature, duration_of_sunshine, surface_snow_binary_mask, air_pressure_at_sea_level, and 4 more
    Description

    HadUK-Grid is a collection of gridded climate variables derived from the network of UK land surface observations. The data have been interpolated from meteorological station data onto a uniform grid to provide complete and consistent coverage across the UK. These data at 1 km resolution have been averaged across a set of discrete geographies defining UK countries consistent with data from UKCP18 climate projections. The dataset spans the period from 1836 to 2023, but the start time is dependent on climate variable and temporal resolution.

    The gridded data are produced for daily, monthly, seasonal and annual timescales, as well as long term averages for a set of climatological reference periods. Variables include air temperature (maximum, minimum and mean), precipitation, sunshine, mean sea level pressure, wind speed, relative humidity, vapour pressure, days of snow lying, and days of ground frost.

    This data set supersedes the previous versions of this dataset which also superseded UKCP09 gridded observations. Subsequent versions may be released in due course and will follow the version numbering as outlined by Hollis et al. (2018, see linked documentation).

    The changes for v1.3.0.ceda HadUK-Grid datasets are as follows:

    • Added data for calendar year 2023

    • Added newly digitised data for monthly sunshine 1910-1918

    • Net changes to the input station data used to generate this dataset:

    • Total of 125601744 observations

    • 122621050 (97.6%) unchanged

    • 26700 (0.02%) modified for this version

    • 2953994 (2.35%) added in this version

    • 16315 (0.01%) deleted from this version

    • Changes to monthly rainfall 1836-1960

    • Total of 4823973 observations

    • 3315657 (68.7%) unchanged

    • 21029 (0.4%) modified for this version

    • 1487287 (30.8%) added in this version

    • 11155 (0.2%) deleted from this version

    The primary purpose of these data are to facilitate monitoring of UK climate and research into climate change, impacts and adaptation. The datasets have been created by the Met Office with financial support from the Department for Business, Energy and Industrial Strategy (BEIS) and Department for Environment, Food and Rural Affairs (DEFRA) in order to support the Public Weather Service Customer Group (PWSCG), the Hadley Centre Climate Programme, and the UK Climate Projections (UKCP18) project. The output from a number of data recovery activities relating to 19th and early 20th Century data have been used in the creation of this dataset, these activities were supported by: the Met Office Hadley Centre Climate Programme; the Natural Environment Research Council project "Analysis of historic drought and water scarcity in the UK"; the UK Research & Innovation (UKRI) Strategic Priorities Fund UK Climate Resilience programme; The UK Natural Environment Research Council (NERC) Public Engagement programme; the National Centre for Atmospheric Science; National Centre for Atmospheric Science and the NERC GloSAT project; and the contribution of many thousands of public volunteers. The dataset is provided under Open Government Licence.

  17. HadUK-Grid Gridded Climate Observations on a 25km grid over the UK,...

    • catalogue.ceda.ac.uk
    Updated Jul 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Hollis; Mark McCarthy; Michael Kendon; Tim Legg (2023). HadUK-Grid Gridded Climate Observations on a 25km grid over the UK, v1.2.0.ceda (1836-2022) [Dataset]. https://catalogue.ceda.ac.uk/uuid/0545f37fb7124df381d42468eb63c144
    Explore at:
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    Authors
    Dan Hollis; Mark McCarthy; Michael Kendon; Tim Legg
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Time period covered
    Jan 1, 1836 - Dec 31, 2022
    Area covered
    Variables measured
    time, latitude, longitude, air_temperature, relative_humidity, projection_x_coordinate, projection_y_coordinate, air_pressure_at_sea_level, lwe_thickness_of_precipitation_amount
    Description

    HadUK-Grid is a collection of gridded climate variables derived from the network of UK land surface observations. The data have been interpolated from meteorological station data onto a uniform grid to provide complete and consistent coverage across the UK. The dataset at 25 km resolution is derived from the associated 1 km x 1 km resolution to allow for comparison to data from UKCP18 climate projections. The dataset spans the period from 1836 to 2022, but the start time is dependent on climate variable and temporal resolution.

    The gridded data are produced for daily, monthly, seasonal and annual timescales, as well as long term averages for a set of climatological reference periods. Variables include air temperature (maximum, minimum and mean), precipitation, sunshine, mean sea level pressure, wind speed, relative humidity, vapour pressure, days of snow lying, and days of ground frost.

    This data set supersedes the previous versions of this dataset which also superseded UKCP09 gridded observations. Subsequent versions may be released in due course and will follow the version numbering as outlined by Hollis et al. (2018, see linked documentation).

    The changes for v1.2.0.ceda HadUK-Grid datasets are as follows:

    • Added data for calendar year 2022

    • Added newly digitised data for monthly sunshine 1910-1918

    • Net changes to the input station data used to generate this dataset:

    • Total of 125601744 observations

    • 122621050 (97.6%) unchanged

    • 26700 (0.02%) modified for this version

    • 2953994 (2.35%) added in this version

    • 16315 (0.01%) deleted from this version

    • Changes to monthly rainfall 1836-1960

    • Total of 4823973 observations

    • 3315657 (68.7%) unchanged

    • 21029 (0.4%) modified for this version

    • 1487287 (30.8%) added in this version

    • 11155 (0.2%) deleted from this version

    The primary purpose of these data are to facilitate monitoring of UK climate and research into climate change, impacts and adaptation. The datasets have been created by the Met Office with financial support from the Department for Business, Energy and Industrial Strategy (BEIS) and Department for Environment, Food and Rural Affairs (DEFRA) in order to support the Public Weather Service Customer Group (PWSCG), the Hadley Centre Climate Programme, and the UK Climate Projections (UKCP18) project. The output from a number of data recovery activities relating to 19th and early 20th Century data have been used in the creation of this dataset, these activities were supported by: the Met Office Hadley Centre Climate Programme; the Natural Environment Research Council project "Analysis of historic drought and water scarcity in the UK"; the UK Research & Innovation (UKRI) Strategic Priorities Fund UK Climate Resilience programme; The UK Natural Environment Research Council (NERC) Public Engagement programme; the National Centre for Atmospheric Science; National Centre for Atmospheric Science and the NERC GloSAT project; and the contribution of many thousands of public volunteers. The dataset is provided under Open Government Licence.

  18. HadUK-Grid Climate Observations by UK river basins, v1.2.0.ceda (1836-2022)

    • catalogue.ceda.ac.uk
    Updated Jul 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Hollis; Mark McCarthy; Michael Kendon; Tim Legg (2023). HadUK-Grid Climate Observations by UK river basins, v1.2.0.ceda (1836-2022) [Dataset]. https://catalogue.ceda.ac.uk/uuid/e6822428e4124c5986b689a37fda10bc
    Explore at:
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    Authors
    Dan Hollis; Mark McCarthy; Michael Kendon; Tim Legg
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Time period covered
    Jan 1, 1836 - Dec 31, 2022
    Area covered
    Variables measured
    time, region, area_type, air_temperature, surface_temperature, surface_snow_binary_mask, air_pressure_at_sea_level, surface_snow_area_fraction, water_vapor_partial_pressure_in_air, lwe_thickness_of_precipitation_amount, and 1 more
    Description

    HadUK-Grid is a collection of gridded climate variables derived from the network of UK land surface observations. The data have been interpolated from meteorological station data onto a uniform grid to provide complete and consistent coverage across the UK. These data at 1 km resolution have been averaged across a set of discrete geographies defining UK river basins consistent with data from UKCP18 climate projections. The dataset spans the period from 1836 to 2022, but the start time is dependent on climate variable and temporal resolution.

    The gridded data are produced for daily, monthly, seasonal and annual timescales, as well as long term averages for a set of climatological reference periods. Variables include air temperature (maximum, minimum and mean), precipitation, sunshine, mean sea level pressure, wind speed, relative humidity, vapour pressure, days of snow lying, and days of ground frost.

    This data set supersedes the previous versions of this dataset which also superseded UKCP09 gridded observations. Subsequent versions may be released in due course and will follow the version numbering as outlined by Hollis et al. (2018, see linked documentation).

    The changes for v1.2.0.ceda HadUK-Grid datasets are as follows:

    • Added data for calendar year 2022

    • Added newly digitised data for monthly sunshine 1910-1918

    • Net changes to the input station data used to generate this dataset:

    • Total of 125601744 observations

    • 122621050 (97.6%) unchanged

    • 26700 (0.02%) modified for this version

    • 2953994 (2.35%) added in this version

    • 16315 (0.01%) deleted from this version

    • Changes to monthly rainfall 1836-1960

    • Total of 4823973 observations

    • 3315657 (68.7%) unchanged

    • 21029 (0.4%) modified for this version

    • 1487287 (30.8%) added in this version

    • 11155 (0.2%) deleted from this version

    The primary purpose of these data are to facilitate monitoring of UK climate and research into climate change, impacts and adaptation. The datasets have been created by the Met Office with financial support from the Department for Business, Energy and Industrial Strategy (BEIS) and Department for Environment, Food and Rural Affairs (DEFRA) in order to support the Public Weather Service Customer Group (PWSCG), the Hadley Centre Climate Programme, and the UK Climate Projections (UKCP18) project. The output from a number of data recovery activities relating to 19th and early 20th Century data have been used in the creation of this dataset, these activities were supported by: the Met Office Hadley Centre Climate Programme; the Natural Environment Research Council project "Analysis of historic drought and water scarcity in the UK"; the UK Research & Innovation (UKRI) Strategic Priorities Fund UK Climate Resilience programme; The UK Natural Environment Research Council (NERC) Public Engagement programme; the National Centre for Atmospheric Science; National Centre for Atmospheric Science and the NERC GloSAT project; and the contribution of many thousands of public volunteers. The dataset is provided under Open Government Licence.

  19. d

    Correlation Analysis to Investigate Unconscious Mental Processes, 2018-2021...

    • b2find.dkrz.de
    Updated Apr 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Correlation Analysis to Investigate Unconscious Mental Processes, 2018-2021 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/33ce5cd1-ae8b-5c30-aacc-441cf0b1fe6c
    Explore at:
    Dataset updated
    Apr 26, 2023
    Description

    Data and code for Malejka et al. (2021), "Correlation analysis to investigate unconscious mental processes". The present project focused on a particular domain of this literature, implicit learning. Studies conducted in this area try to determine whether we are able to detect regularities in our environment without awareness of those regularities. Finding evidence of awareness in these domains is important because it suggests that some degree of control may be available as well. In the present project we propose new methods for the study of unconscious learning. Many of the problems that we have detected in our previous research can be ameliorated by employing cutting-edge statistical analysis, including Bayesian and meta-analytic methods and model fitting. However, the validity of these approaches in the domain of implicit cognition remains untested.A consensus among researchers is that much of our behaviour is based on rather automatic processes we are barely aware of and over which we have little control. Research suggests that exposure to subtle cues can have dramatic effects on our decisions. For instance, asking people to provide the last 2 digits of their social security number biases how much they are willing to pay for products and commodities. Similarly, according to some researchers, people are more likely to be impolite and disrespectful if they have been exposed to words related to rudeness while solving anagrams. Another line of research suggests that we take many of our (important) decisions when distracted and thinking about other things and that this 'unconscious thought' process actually improves the quality of our decisions. These studies pertain to a larger area of research usually called 'implicit cognition', which explores how unconscious mechanisms contribute to cognitive processes including perception, learning, memory, and decision making. This area of research has attracted a great deal of attention from the media and features frequently in popular science books, blogs, and documentaries. Some authors have even suggested that parts of this research could be used to improve our decisions in different domains at a societal level (for example, in health behaviour and pension planning). The present project focuses on a particular domain of this literature, implicit learning. Studies conducted in this area try to determine whether we are able to detect regularities in our environment without awareness of those regularities. In other words, these studies address whether we can learn something without realising that we are indeed learning it. In recent years there have been thousands of demonstrations of implicit learning effects in the scientific literature and, not surprisingly, this literature has become increasingly influential in all areas of psychology, with an important impact in our understanding of human cognition and psychopathology. Unfortunately, our previous research suggests that much of this evidence is undermined by fundamental methodological problems that preclude any strong conclusions about the reliability of unconscious learning effects. We have shown that many of these studies find unconscious learning because researchers use weaker methods to assess whether people are conscious of what they have learned than to assess whether learning has taken place. Naturally, this implies that learning is easily detected but awareness is not, which creates the illusion that learning has taken place unconsciously. Finding evidence of awareness in these domains is important because it suggests that some degree of control may be available as well. In the present project we propose new methods for the study of unconscious learning. Many of the problems that we have detected in our previous research can be ameliorated by employing cutting-edge statistical analysis, including Bayesian and meta-analytic methods and model fitting. However, the validity of these approaches in the domain of implicit cognition remains untested. A second goal is to conduct a large-scale exploration of the prevalence and magnitude of these problems. Our previous studies have focused on a very particular effect studied in implicit learning research ('contextual cueing'). We suspect that many of these problems transcend this domain and affect a large proportion of current studies on implicit learning. The potential impact of this assessment is difficult to overestimate. Finally, we will set up a collaboration with other international laboratories working on this topic to gather the largest and most sensitive data set of implicit learning effects available so far. This data set will be publicly available for all researchers, which will make it a fundamental resource for the study of unconscious cognitive processes for many years to come.

  20. d

    Tutkijoiden lukemisen käytännöt 2016 - Dataset - B2FIND

    • b2find.dkrz.de
    Updated May 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Tutkijoiden lukemisen käytännöt 2016 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/ecfad660-bd13-5a63-ba35-9e4b50924301
    Explore at:
    Dataset updated
    May 9, 2023
    Description

    Tutkimuksessa selvitettiin, miten tutkijat käyttävät erilaisia painettuja ja sähköisiä julkaisuja, kuten tieteellisiä lehtiä, artikkeleita, kirjoja, raportteja ja sosiaalista mediaa työssään. Tutkimus on osa yhdysvaltalaisten Carol Tenopirin ja Donald W. Kingin vuonna 1977 aloittamaa kyselytutkimusten sarjaa, jolla on seurattu tutkijoiden lukemisen käytäntöjä ja niissä tapahtuvia muutoksia eri tieteenaloilla ja eri maissa. Suomessa on kerätty aineistoa myös vuonna 2006, mutta sitä ei ole tallennettu Tietoarkistoon. Vuoden 2016 tutkimushanke toteutettiin osin Suomen Kulttuurirahaston apurahalla. Kyselyssä kartoitettiin tutkijoiden yleisiä lukemiskäytäntöjä sekä tieteellisen artikkelin ja muiden julkaisutyyppien julkaisemista ja lukemista. Kysyttiin myös, miten tutkijat hakevat, hankkivat, viittaavat ja julkaisevat tieteellistä tietoa. Lisäksi tiedusteltiin, kuinka paljon aikaa vastaajat käyttivät artikkelien lukemiseen, kuinka tuoreita ja minkä kielisiä julkaisut olivat sekä kuinka hyödyllisiä ne olivat työn kannalta. Sosiaalisen median merkitystä vastaajien työssä kartoitettiin kysymällä, kuinka tärkeitä eri palvelut ja välineet, kuten blogit, pilvipalvelut, akateemiset verkkoyhteisöt ja viitteidenhallintaohjelmat ovat työssä. Kysyttiin myös, miten tärkeitä elektronisten julkaisujen ominaisuuksia ovat esim. yhteensopivuus ja luettavuus eri laitteilla, mahdollisuus jakaa ja globaali kielituki. Lopuksi tiedusteltiin, miten tutkijan tieteellisen kirjallisuuden lukeminen tai luetun jakaminen on muuttunut viime vuosina ja miten se muuttuu lähivuosina. Taustamuuttujina olivat vastaajan tieteenala, ammattiasema, ikä ja työpaikan tyyppi. The study examined Finnish researchers' use of different printed and electronic publications in their work, such as scientific journals, articles, books, reports, and social media. The study is part of Carol Tenopir and Donald W. King's survey series launched in 1977 following the reading practices of researchers in different countries and scientific fields. Finnish data have also been collected in 2006 but this dataset has not been archived at the Finnish Social Science Data Archive. The 2016 project was partly funded by the Finnish Cultural Foundation. The survey charted researchers' common reading practices as well as publishing of different types of scientific articles and other publications. The way that the respondents' work time was distributed between different types of tasks was charted as well as how many publications of different types they had authored within the previous two years. It was also examined how researchers searched for information, published scientific work, and cited the work of others. Questions also covered how much time the respondents spent on reading articles, how many scientific articles and other types of publications they had read within the previous 30 days, how recent the publications that they read were, reasons for reading them, what language they were in, how they found the publications and received access, where they read the publications, which scientific field the publications represented, and how useful they considered different publication formats for their work. The significance of social media was charted with questions regarding, for instance, how important different services and tools were for their work (e.g. blogs, cloud services, institutional repositories, academic online communities, reference management software). The respondents were also asked about how important different features of electronic publications were (e.g. compatibility and readability on different devices, possibility to share publications, advanced navigation features, global language support, possibility to embed audio into publications). Background variables included scientific field, job title, age, and type of workplace. Ei-todennäköisyysotanta: itsestään muotoutunut näyteNonprobability.Availability Non-probability: AvailabilityNonprobability.Availability Itsetäytettävä lomake: verkkolomakeSelfAdministeredQuestionnaire.CAWI

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bar-Ilan University (2003). blog_authorship_corpus [Dataset]. https://huggingface.co/datasets/barilan/blog_authorship_corpus

blog_authorship_corpus

Blog Authorship Corpus

barilan/blog_authorship_corpus

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jul 27, 2003
Dataset authored and provided by
Bar-Ilan University
License

https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

Description

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups: - 8240 "10s" blogs (ages 13-17), - 8086 "20s" blogs (ages 23-27), - 2994 "30s" blogs (ages 33-47).

For each age group there are an equal number of male and female bloggers.

Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

The corpus may be freely used for non-commercial research purposes.

Search
Clear search
Close search
Google apps
Main menu