31 datasets found

n
New Site, AL annual median income by work experience and sex dataset: Aged...
neilsberg.com
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). New Site, AL annual median income by work experience and sex dataset: Aged 15+, 2010-2023 (in 2023 inflation-adjusted dollars) // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/new-site-al-income-by-gender/
Explore at:
Dataset updated
Feb 27, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Income for Male Population, Income for Female Population, Income for Male Population working full time, Income for Male Population working part time, Income for Female Population working full time, Income for Female Population working part time
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset presents median income data over a decade or more for males and females categorized by Total, Full-Time Year-Round (FT), and Part-Time (PT) employment in New Site. It showcases annual income, providing insights into gender-specific income distributions and the disparities between full-time and part-time work. The dataset can be utilized to gain insights into gender-based pay disparity trends and explore the variations in income for male and female individuals.

Key observations: Insights from 2023

Based on our analysis ACS 2019-2023 5-Year Estimates, we present the following observations: - All workers, aged 15 years and older: In New Site, the median income for all workers aged 15 years and older, regardless of work hours, was $52,083 for males and $21,667 for females.
These income figures highlight a substantial gender-based income gap in New Site. Women, regardless of work hours, earn 42 cents for each dollar earned by men. This significant gender pay gap, approximately 58%, underscores concerning gender-based income inequality in the town of New Site.
- Full-time workers, aged 15 years and older: In New Site, among full-time, year-round workers aged 15 years and older, males earned a median income of $55,156, while females earned $33,816, leading to a 39% gender pay gap among full-time workers. This illustrates that women earn 61 cents for each dollar earned by men in full-time roles. This level of income gap emphasizes the urgency to address and rectify this ongoing disparity, where women, despite working full-time, face a more significant wage discrepancy compared to men in the same employment roles.
Remarkably, across all roles, including non-full-time employment, women displayed a similar gender pay gap percentage. This indicates a consistent gender pay gap scenario across various employment types in New Site, showcasing a consistent income pattern irrespective of employment status.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. All incomes have been adjusting for inflation and are presented in 2023-inflation-adjusted dollars.

Gender classifications include:

Male

Female

Employment type classifications include:

Full-time, year-round: A full-time, year-round worker is a person who worked full time (35 or more hours per week) and 50 or more weeks during the previous calendar year.

Part-time: A part-time worker is a person who worked less than 35 hours per week during the previous calendar year.

Variables / Data Columns

Year: This column presents the data year. Expected values are 2010 to 2023

Male Total Income: Annual median income, for males regardless of work hours

Male FT Income: Annual median income, for males working full time, year-round

Male PT Income: Annual median income, for males working part time

Female Total Income: Annual median income, for females regardless of work hours

Female FT Income: Annual median income, for females working full time, year-round

Female PT Income: Annual median income, for females working part time

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for New Site median household income by race. You can refer the same here
CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection
zenodo.org
data.niaid.nih.gov
Updated Oct 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.5775511
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5775511
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes. Due to these restrictions, the collection is not open data. Please fill out the form and upload the Data Sharing Agreement at Google Form.

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

Submission Link: Coming soon

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
t
Dataset: on the similarity of web measurements under different experimental...
service.tib.eu
Updated Nov 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Dataset: on the similarity of web measurements under different experimental setups - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-35097-1719
Explore at:
Dataset updated
Nov 28, 2024
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Abstract: Measurement studies are essential for research and industry alike to understand the Web's inner workings better and help quantify specific phenomena. Performing such studies is demanding due to the dynamic nature and size of the Web. An experiment's careful design and setup are complex, and many factors might affect the results. However, while several works have independently observed differences in the outcome of an experiment (e.g., the number of observed trackers) based on the measurement setup, it is unclear what causes such deviations. This work investigates the reasons for these differences by visiting 1.7M webpages with five different measurement setups. Based on this, we build dependency trees' for each page and cross-compare the nodes in the trees. The results show that the measured trees differ considerably, that the cause of differences can be attributed to specific nodes, and that even identical measurement setups can produce different results. Abstract: Measurement studies are essential for research and industry alike to understand the Web's inner workings better and help quantify specific phenomena. Performing such studies is demanding due to the dynamic nature and size of the Web. An experiment's careful design and setup are complex, and many factors might affect the results. However, while several works have independently observed differences in the outcome of an experiment (e.g., the number of observed trackers) based on the measurement setup, it is unclear what causes such deviations. This work investigates the reasons for these differences by visiting 1.7M webpages with five different measurement setups. Based on this, we builddependency trees' for each page and cross-compare the nodes in the trees. The results show that the measured trees differ considerably, that the cause of differences can be attributed to specific nodes, and that even identical measurement setups can produce different results. TechnicalRemarks: This repository hosts the dataset corresponding to the paper "On the Similarity of Web Measurements Under Different Experimental Setups", which was published at the Proceedings of the 23nd ACM Internet Measurement Conference 2023.
Z
Data from: Five Years of COVID-19 Discourse on Instagram: A Labeled...
data.niaid.nih.gov
zenodo.org
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thakur, Ph.D., Nirmalya (2024). Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13896352
Explore at:
Dataset updated
Oct 21, 2024
Dataset authored and provided by
Thakur, Ph.D., Nirmalya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite the following paper when using this dataset:

N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)

Abstract

The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.

For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.

The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)

There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)

The following is a description of the attributes present in this dataset

Post ID: Unique ID of each Instagram post

Post Description: Complete description of each post in the language in which it was originally published

Date: Date of publication in MM/DD/YYYY format

Language code: Language code (for example: “en”) that represents the language of the post as detected using the Google Translate API

Full Language: Full form of the language (for example: “English”) that represents the language of the post as detected using the Google Translate API

Sentiment: Results of sentiment analysis (using the preprocessed version of each post) where each post was classified as positive, negative, or neutral

Open Research Questions

This dataset is expected to be helpful for the investigation of the following research questions and even beyond:

How does sentiment toward COVID-19 vary across different languages?

How has public sentiment toward COVID-19 evolved from 2020 to the present?

How do cultural differences affect social media discourse about COVID-19 across various languages?

How has COVID-19 impacted mental health, as reflected in social media posts across different languages?

How effective were public health campaigns in shifting public sentiment in different languages?

What patterns of vaccine hesitancy or support are present in different languages?

How did geopolitical events influence public sentiment about COVID-19 in multilingual social media discourse?

What role does social media discourse play in shaping public behavior toward COVID-19 in different linguistic communities?

How does the sentiment of minority or underrepresented languages compare to that of major world languages regarding COVID-19?

What insights can be gained by comparing the sentiment of COVID-19 posts in widely spoken languages (e.g., English, Spanish) to those in less common languages?

All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
TREC 2022 Deep Learning test collection
s.cnmilf.com
data.nist.gov
+1more
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
d
Coresignal | Web Data | Job Postings Data | Largest Professional Network +...
datarade.ai
.json, .csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coresignal, Coresignal | Web Data | Job Postings Data | Largest Professional Network + Indeed Jobs + 3 Other Sources | Global / 437M+ Records / Updated Monthly [Dataset]. https://datarade.ai/data-products/coresignal-web-data-job-postings-data-largest-professio-coresignal
Explore at:
.json, .csvAvailable download formats
Dataset authored and provided by
Coresignal
Area covered
Jamaica, United Arab Emirates, Tokelau, Kyrgyzstan, Isle of Man, Gibraltar, Guadeloupe, Japan, Bhutan, Trinidad and Tobago
Description
➡️ You can choose from multiple data formats, delivery frequency options, and delivery methods

➡️ Extensive web datasets with job posting data from 5 leading B2B data sources

➡️ Jobs API designed for effortless search and enrichment (accessible using a user-friendly self-service tool)

➡️ Fresh data: daily updates, easy change tracking with dedicated data fields, and a constant flow of new data

➡️ You get all the necessary resources for evaluating our web dataset: a free consultation, a data sample, or free credits for testing the API.

✅ For HR tech

Job posting web dataset can provide insights into the demand for different types of jobs and skills, as well as trends in job postings over time. With access to historical data, companies can develop predictive models.

✅ For Investors

Explore expansion trends, analyze hiring practices, and predict company or industry growth rates, enabling the extraction of actionable strategic and operational insights. At a larger scale of analysis, Job Postings Database can be leveraged to forecast market trends and predict the growth of specific industries.

✅ For Lead generation

Coresignal’s Job Postings Data is ideal for lead generation and determining purchasing intent. In B2B sales, job postings can help identify the best time to approach a prospective client.

➡️ Why 400+ data-powered businesses choose Coresignal:

Experienced data provider (in the market since 2016)

Exceptional client service

Responsible and secure data collection
Spanish Wine Quality Dataset
kaggle.com
Updated Apr 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fedesoriano (2022). Spanish Wine Quality Dataset [Dataset]. https://www.kaggle.com/datasets/fedesoriano/spanish-wine-quality-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 26, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
fedesoriano
Description
Similar Datasets

CERN Proton Collision Dataset: LINK

Airfoil Self-Noise Dataset: LINK

CERN Electron Collision Data: LINK

Wind Speed Prediction Dataset: LINK

Stellar Classification Dataset - SDSS17: LINK

Context

This dataset is related to red variants of spanish wines. The dataset describes several popularity and description metrics their effect on it's quality. The datasets can be used for classification or regression tasks. The classes are ordered and not balanced (i.e. the quality goes from almost 5 to 4 points). The task is to predict either the quality of wine or the prices using the given data.

Content

The dataset contains 7500 different types of red wines from Spain with 11 features that describe their price, rating, and even some flavor description. The was collected by me using web scraping from different sources (from wine specialized pages to supermarkets). Please acknowledge the hard work to obtain and create this dataset, you can upvote it if you find it useful to use on your projects :)

If the dataset becomes popular I will probably try to create a bigger version with wines from other countries and a wider spectrum of ratings.

Attribute Information

winery: Winery name

wine: Name of the wine

year: Year in which the grapes were harvested

rating: Average rating given to the wine by the users [from 1-5]

num_reviews: Number of users that reviewed the wine

country: Country of origin [Spain]

region: Region of the wine

price: Price in euros [€]

type: Wine variety

body: Body score, defined as the richness and weight of the wine in your mouth [from 1-5]

acidity: Acidity score, defined as wine's “pucker” or tartness; it's what makes a wine refreshing and your tongue salivate and want another sip [from 1-5]

Citation Request

If you want to cite this data:

fedesoriano. (April 2022). Spanish Wine Quality Dataset. Retrieved [Date Retrieved] from https://www.kaggle.com/datasets/fedesoriano/spanish-wine-quality-dataset
Data cleaning using unstructured data
zenodo.org
zip
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer (2024). Data cleaning using unstructured data [Dataset]. http://doi.org/10.5281/zenodo.13135983
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13135983
Dataset updated
Jul 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this project, we work on repairing three datasets:

Trials design: This dataset was obtained from the European Union Drug Regulating Authorities Clinical Trials Database (EudraCT) register and the ground truth was created from external registries. In the dataset, multiple countries, identified by the attribute country_protocol_code, conduct the same clinical trials which is identified by eudract_number. Each clinical trial has a title that can help find informative details about the design of the trial.

Trials population: This dataset delineates the demographic origins of participants in clinical trials primarily conducted across European countries. This dataset include structured attributes indicating whether the trial pertains to a specific gender, age group or healthy volunteers. Each of these categories is labeled as (`1') or (`0') respectively denoting whether it is included in the trials or not. It is important to note that the population category should remain consistent across all countries conducting the same clinical trial identified by an eudract_number. The ground truth samples in the dataset were established by aligning information about the trial populations provided by external registries, specifically the CT.gov database and the German Trials database. Additionally, the dataset comprises other unstructured attributes that categorize the inclusion criteria for trial participants such as inclusion.

Allergens: This dataset contains information about products and their allergens. The data was collected from the German version of the `Alnatura' (Access date: 24 November, 2020), a free database of food products from around the world `Open Food Facts', and the websites: `Migipedia', 'Piccantino', and `Das Ist Drin'. There may be overlapping products across these websites. Each product in the dataset is identified by a unique code. Samples with the same code represent the same product but are extracted from a differentb source. The allergens are indicated by (‘2’) if present, or (‘1’) if there are traces of it, and (‘0’) if it is absent in a product. The dataset also includes information on ingredients in the products. Overall, the dataset comprises categorical structured data describing the presence, trace, or absence of specific allergens, and unstructured text describing ingredients.

N.B: Each '.zip' file contains a set of 5 '.csv' files which are part of the afro-mentioned datasets:

"{dataset_name}_train.csv": samples used for the ML-model training. (e.g "allergens_train.csv")

"{dataset_name}_test.csv": samples used to test the the ML-model performance. (e.g "allergens_test.csv")

"{dataset_name}_golden_standard.csv": samples represent the ground truth of the test samples. (e.g "allergens_golden_standard.csv")

"{dataset_name}_parker_train.csv": samples repaired using Parker Engine used for the ML-model training. (e.g "allergens_parker_train.csv")

"{dataset_name}_parker_train.csv": samples repaired using Parker Engine used to test the the ML-model performance. (e.g "allergens_parker_test.csv")
SAPFLUXNET: A global database of sap flow measurements
zenodo.org
explore.openaire.eu
+1more
zip
Updated Sep 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Poyatos; Rafael Poyatos; Víctor Granda; Víctor Granda; Víctor Flo; Víctor Flo; Roberto Molowny-Horas; Roberto Molowny-Horas; Kathy Steppe; Kathy Steppe; Maurizio Mencuccini; Maurizio Mencuccini; Jordi Martínez-Vilalta; Jordi Martínez-Vilalta (2020). SAPFLUXNET: A global database of sap flow measurements [Dataset]. http://doi.org/10.5281/zenodo.3697807
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3697807
Dataset updated
Sep 26, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Poyatos; Rafael Poyatos; Víctor Granda; Víctor Granda; Víctor Flo; Víctor Flo; Roberto Molowny-Horas; Roberto Molowny-Horas; Kathy Steppe; Kathy Steppe; Maurizio Mencuccini; Maurizio Mencuccini; Jordi Martínez-Vilalta; Jordi Martínez-Vilalta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General description

SAPFLUXNET contains a global database of sap flow and environmental data, together with metadata at different levels.
SAPFLUXNET is a harmonised database, compiled from contributions from researchers worldwide. This version (0.1.4) contains more than 200 datasets, from all over the World, covering a broad range of bioclimatic conditions.
More information on the coverage can be found here: http://sapfluxnet.creaf.cat/shiny/sfn_progress_dashboard/.

The SAPFLUXNET project has been developed by researchers at CREAF and other institutions (http://sapfluxnet.creaf.cat/#team), coordinated by Rafael Poyatos (CREAF, http://www.creaf.cat/staff/rafael-poyatos-lopez), and funded by two Spanish Young Researcher's Grants (SAPFLUXNET, CGL2014-55883-JIN; DATAFORUSE, RTI2018-095297-J-I00 ) and an Alexander von Humboldt Research Fellowship for Experienced Researchers).

Variables and units

SAPFLUXNET contains whole-plant sap flow and environmental variables at sub-daily temporal resolution. Both sap flow and environmental time series have accompanying flags in a data frame, one for sap flow and another for environmental
variables. These flags store quality issues detected during the quality control process and can be used to add further quality flags.

Metadata contain relevant variables informing about site conditions, stand characteristics, tree and species attributes, sap flow methodology and details on environmental measurements. To learn more about variables, units and data flags please use the functionalities implemented in the sapfluxnetr package (https://github.com/sapfluxnet/sapfluxnetr). In particular, have a look at the package vignettes using R:

# remotes::install_github( # 'sapfluxnet/sapfluxnetr', # build_opts = c("--no-resave-data", "--no-manual", "--build-vignettes") # ) library(sapfluxnetr) # to list all vignettes vignette(package='sapfluxnetr') # variables and units vignette('metadata-and-data-units', package='sapfluxnetr') # data flags vignette('data-flags', package='sapfluxnetr')

Data formats

SAPFLUXNET data can be found in two formats: 1) RData files belonging to the custom-built 'sfn_data' class and 2) Text files in .csv format. We recommend using the sfn_data objects together with the sapfluxnetr package, although we also provide the text files for convenience. For each dataset, text files are structured in the same way as the slots of sfn_data objects; if working with text files, we recommend that you check the data structure of 'sfn_data' objects in the corresponding vignette.

Working with sfn_data files

To work with SAPFLUXNET data, first they have to be downloaded from Zenodo, maintaining the folder structure. A first level in the folder hierarchy corresponds to file format, either RData files or csv's. A second level corresponds to how sap flow is expressed: per plant, per sapwood area or per leaf area. Please note that interconversions among the magnitudes have been performed whenever possible. Below this level, data have been organised per dataset. In the case of RData files, each dataset is contained in a sfn_data object, which stores all data and metadata in different slots (see the vignette 'sfn-data-classes'). In the case of csv files, each dataset has 9 individual files, corresponding to metadata (5), sap flow and environmental data (2) and their corresponding data flags (2).

After downloading the entire database, the sapfluxnetr package can be used to:
- Work with data from a single site: data access, plotting and time aggregation.
- Select the subset datasets to work with.
- Work with data from multiple sites: data access, plotting and time aggregation.

Please check the following package vignettes to learn more about how to work with sfn_data files:

Quick guide

Metadata and data units

sfn_data classes

Custom aggregation

Memory and parallelization

Working with text files

We recommend to work with sfn_data objects using R and the sapfluxnetr package and we do not currently provide code to work with text files.

Data issues and reporting

Please report any issue you may find in the database by sending us an email: sapfluxnet@creaf.uab.cat.

Temporary data fixes, detected but not yet included in released versions will be published in SAPFLUXNET main web page ('Known data errors').

Data access, use and citation

This version of the SAPFLUXNET database is open access. We are working on a data paper describing the database, but, before its publication, please cite this Zenodo entry if SAPFLUXNET is used in any publication.
f
Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...
figshare.com
xlsx
Updated Oct 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2024). Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.27072247.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27072247.v1
Dataset updated
Oct 12, 2024
Dataset provided by
figshare
Authors
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite this paper when using this dataset: N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292Abstract: The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. During recent virus outbreaks, social media platforms have played a crucial role in keeping the global population informed and updated regarding various aspects of the outbreaks. As a result, in the last few years, researchers from different disciplines have focused on the development of social media datasets focusing on different virus outbreaks. No prior work in this field has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper (stated above) aims to address this research gap. It presents this multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. This dataset contains Instagram posts about mpox in 52 languages.For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were also performed. This process included classifying each post intoone of the fine-grain sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutralhate or not hateanxiety/stress detected or no anxiety/stress detected.These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications.The 52 distinct languages in which Instagram posts are present in the dataset are English, Portuguese, Indonesian, Spanish, Korean, French, Hindi, Finnish, Turkish, Italian, German, Tamil, Urdu, Thai, Arabic, Persian, Tagalog, Dutch, Catalan, Bengali, Marathi, Malayalam, Swahili, Afrikaans, Panjabi, Gujarati, Somali, Lithuanian, Norwegian, Estonian, Swedish, Telugu, Russian, Danish, Slovak, Japanese, Kannada, Polish, Vietnamese, Hebrew, Romanian, Nepali, Czech, Modern Greek, Albanian, Croatian, Slovenian, Bulgarian, Ukrainian, Welsh, Hungarian, and Latvian.The following is a description of the attributes present in this dataset:Post ID: Unique ID of each Instagram postPost Description: Complete description of each post in the language in which it was originally publishedDate: Date of publication in MM/DD/YYYY formatLanguage: Language of the post as detected using the Google Translate APITranslated Post Description: Translated version of the post description. All posts which were not in English were translated into English using the Google Translate API. No language translation was performed for English posts.Sentiment: Results of sentiment analysis (using the preprocessed version of the translated Post Description) where each post was classified into one of the sentiment classes: fear, surprise, joy, sadness, anger, disgust, and neutralHate: Results of hate speech detection (using the preprocessed version of the translated Post Description) where each post was classified as hate or not hateAnxiety or Stress: Results of anxiety or stress detection (using the preprocessed version of the translated Post Description) where each post was classified as stress/anxiety detected or no stress/anxiety detected.All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
Esports Performance Rankings and Results
kaggle.com
Updated Dec 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Esports Performance Rankings and Results [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-collegiate-esports-performance-with-bu/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 12, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
Description
Esports Performance Rankings and Results

Performance Rankings and Results from Multiple Esports Platforms

By [source]

About this dataset

This dataset provides a detailed look into the world of competitive video gaming in universities. It covers a wide range of topics, from performance rankings and results across multiple esports platforms to the individual team and university rankings within each tournament. With an incredible wealth of data, fans can discover statistics on their favorite teams or explore the challenges placed upon university gamers as they battle it out to be the best. Dive into the information provided and get an inside view into the world of collegiate esports tournaments as you assess all things from Match ID, Team 1, University affiliations, Points earned or lost in each match and special Seeds or UniSeeds for exceptional teams. Of course don't forget about exploring all the great Team Names along with their corresponding websites for further details on stats across tournaments!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Download Files First, make sure you have downloaded the CS_week1, CS_week2, CS_week3 and seeds datasets on Kaggle. You will also need to download the currentRankings file for each week of competition. All files should be saved using their originally assigned name in order for your analysis tools to read them properly (ie: CS_week1.csv).

Understand File Structure Once all data has been collected and organized into separate files on your desktop/laptop computer/mobile device/etc., it's time to become familiar with what type of information is included in each file. The main folder contains three main data files: week1-3 and seedings. The week1-3 contain teams matched against one another according to university, point score from match results as well as team name and website URL associated with university entry; whereas the seedings include a ranking system amongst university entries which are accompanied by information regarding team names, website URLs etc.. Furthermore, there is additional file featured which contains currentRankings scores for each individual player/teams for an first given period of competition (ie: first week).

Analyzing Data Now that everything is set up on your end it’s time explore! You can dive deep into trends amongst universities or individual players in regards to specific match performances or standings overall throughout weeks of competition etc… Furthermore you may also jumpstart insights via further creation of graphs based off compiled date from sources taken from BUECTracker dataset! For example let us say we wanted compare two universities- let's say Harvard University v Cornell University - against one another since beginning of event i we shall extract respective points(column),dates(column)(found under result tab) ,regions(csilluminating North America vs Europe etc)general stats such as maps played etc.. As well any other custom ideas which would come along in regards when dealing with similar datasets!

Research Ideas

Analyze the performance of teams and identify areas for improvement for better performance in future competitions.

Assess which esports platforms are the most popular among gamers.

Gain a better understanding of player rankings across different regions, based on rankings system, to create targeted strategies that could boost individual players' scoring potential or team overall success in competitive gaming events

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: CS_week1.csv | Column name | Description | |:---------------|:----------------------------------------------| | Match ID | Unique identifier for each match. (Integer) | | Team 1 | Name of the first team in the match. (String) | | University | University associated with the team. (String) |

File: CS_week1_currentRankings.csv | Column name | Description | |:--------------|:-----------------------------------------------------------|...
Z
Dataset for: The Evolution of the Manosphere Across the Web
data.niaid.nih.gov
Updated Aug 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emiliano De Cristofaro (2020). Dataset for: The Evolution of the Manosphere Across the Web [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4007912
Explore at:
Dataset updated
Aug 30, 2020
Dataset provided by
Emiliano De Cristofaro
Savvas Zannettou
Manoel Horta Ribeiro
Jeremy Blackburn
Summer Long
Stephanie Greenberg
Barry Bradlyn
Gianluca Stringhini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Evolution of the Manosphere Across the Web

We make available data related to subreddit and standalone forums from the manosphere.

We also make available Perspective API annotations for all posts.

You can find the code in GitHub.

Please cite this paper if you use this data:

@article{ribeiroevolution2021, title={The Evolution of the Manosphere Across the Web}, author={Ribeiro, Manoel Horta and Blackburn, Jeremy and Bradlyn, Barry and De Cristofaro, Emiliano and Stringhini, Gianluca and Long, Summer and Greenberg, Stephanie and Zannettou, Savvas}, booktitle = {{Proceedings of the 15th International AAAI Conference on Weblogs and Social Media (ICWSM'21)}}, year={2021} }

Reddit data

We make available data for forums and for relevant subreddits (56 of them, as described in subreddit_descriptions.csv). These are available, 1 line per post in each subreddit Reddit in /ndjson/reddit.ndjson. A sample for example is:

{ "author": "Handheld_Gaming", "date_post": 1546300852, "id_post": "abcusl", "number_post": 9.0, "subreddit": "Braincels", "text_post": "Its been 2019 for almost 1 hour And I am at a party with 120 people, half of them being foids. The last year had been the best in my life. I actually was happy living hope because I was redpilled to the death.

Now that I am blackpilled I see that I am the shortest of all men and that I am the only one with a recessed jaw.

Its over. Its only thanks to my age old friendship with chads and my social skills I had developed in the past year that a lot of men like me a lot as a friend.

No leg lengthening syrgery is gonna save me. Ignorance was a bliss. Its just horror now seeing that everyone can make out wirth some slin hoe at the party.

I actually feel so unbelivably bad for turbomanlets. Life as an unattractive manlet is a pain, I cant imagine the hell being an ugly turbomanlet is like. I would have roped instsntly if I were one. Its so unfair.

Tallcels are fakecels and they all can (and should) suck my cock.

If I were 17cm taller my life would be a heaven and I would be the happiest man alive.

Just cope and wait for affordable body tranpslants.", "thread": "t3_abcusl" }

Forums

We here describe the .sqlite and .ndjson files that contain the data from the following forums.

(avfm) --- https://d2ec906f9aea-003845.vbulletin.net (incels) --- https://incels.co/ (love_shy) --- http://love-shy.com/lsbb/ (redpilltalk) --- https://redpilltalk.com/ (mgtow) --- https://www.mgtow.com/forums/ (rooshv) --- https://www.rooshvforum.com/ (pua_forum) --- https://www.pick-up-artist-forum.com/ (the_attraction) --- http://www.theattractionforums.com/

The files are in folders /sqlite/ and /ndjson.

2.1 .sqlite

All the tables in the sqlite. datasets follow a very simple {key:value} format. Each key is a thread name (for example /threads/housewife-is-like-a-job.123835/) and each value is a python dictionary or a list. This file contains three tables:

idx each key is the relative address to a thread and maps to a post. Each post is represented by a dict:

"type": (list) in some forums you can add a descriptor such as [RageFuel] to each topic, and you may also have special types of posts, like sticked/pool/locked posts.
"title": (str) title of the thread; "link": (str) link to the thread; "author_topic": (str) username that created the thread; "replies": (int) number of replies, may differ from number of posts due to difference in crawling date; "views": (int) number of views; "subforum": (str) name of the subforum; "collected": (bool) indicates if raw posts have been collected; "crawled_idx_at": (str) datetime of the collection.

processed_posts each key is the relative address to a thread and maps to a list with posts (in order). Each post is represented by a dict:

"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

raw_posts each key is the relative address to a thread and maps to a list with unprocessed posts (in order). Each post is represented by a dict:

"post_raw": (binary) raw html binary; "crawled_at": (str) datetime of the collection.

2.2 .ndjson

Each line consists of a json object representing a different comment with the following fields:

"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

Perspective

We also run each post and reddit post through perspective, the files are located in the /perspective/ folder. They are compressed with gzip. One example output

{ "id_post": 5200, "hate_output": { "text": "I still can\u2019t wrap my mind around both of those articles about these c~~~s sleeping with poor Haitian Men. Where\u2019s the uproar?, where the hell is the outcry?, the \u201cpig\u201d comments or the \u201ccreeper comments\u201d. F~~~ing hell, if roles were reversed and it was an article about Men going to Europe where under 18 sex in legal, you better believe they would crucify the writer of that article and DEMAND an apology by the paper that wrote it.. This is exactly what I try and explain to people about the double standards within our modern society. A bunch of older women, wanna get their kicks off by sleeping with poor Men, just before they either hit or are at menopause age. F~~~ing unreal, I\u2019ll never forget going to Sweden and Norway a few years ago with one of my buddies and his girlfriend who was from there, the legal age of consent in Norway is 16 and in Sweden it\u2019s 15. I couldn\u2019t believe it, but my friend told me \u201c hey, it\u2019s normal here\u201d . Not only that but the age wasn\u2019t a big different in other European countries as well. One thing i learned very quickly was how very Misandric Sweden as well as Denmark were.", "TOXICITY": 0.6079781, "SEVERE_TOXICITY": 0.53744453, "INFLAMMATORY": 0.7279288, "PROFANITY": 0.58842486, "INSULT": 0.5511079, "OBSCENE": 0.9830818, "SPAM": 0.17009115 } }

Working with sqlite

A nice way to read some of the files of the dataset is using SqliteDict, for example:

from sqlitedict import SqliteDict processed_posts = SqliteDict("./data/forums/incels.sqlite", tablename="processed_posts")

for key, posts in processed_posts.items(): for post in posts: # here you could do something with each post in the dataset pass

Helpers

Additionally, we provide two .sqlite files that are helpers used in the analyses. These are related to reddit, and not to the forums! They are:

channel_dict.sqlite a sqlite where each key corresponds to a subreddit and values are lists of dictionaries users who posted on it, along with timestamps.

author_dict.sqlite a sqlite where each key corresponds to an author and values are lists of dictionaries of the subreddits they posted on, along with timestamps.

These are used in the paper for the migration analyses.

Examples and particularities for forums

Although we did our best to clean the data and be consistent across forums, this is not always possible. In the following subsections we talk about the particularities of each forum, directions to improve the parsing which were not pursued as well as give some examples on how things work in each forum.

6.1 incels

Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

types: for the incel forums the special types associated with each thread in the idx table are “Sticky”, “Pool”, “Closed”, and the custom types added by users, such as [LifeFuel]. These last ones are all in brackets. You can see some examples of these in the on the example thread page.

quotes: quotes in this forum were quite nice and thus, all quotations are deterministic.

6.2 LoveShy

Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

types: no types were parsed. There are some rules in the forum, but not significant.

quotes: quotes were obtained from exact text+author match, or author match + a jaccard
r
ccPDB - Compilation and Creation of datasets from PDB
rrid.site
neuinfo.org
+1more
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ccPDB - Compilation and Creation of datasets from PDB [Dataset]. http://identifiers.org/RRID:SCR_005870
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005870
Dataset updated
Jul 15, 2025
Description
ccPDB (Compilation and Creation of datasets from PDB) is designed to provide service to scientific community working in the field of function or structure annoation of proteins. This database of datasets is based on Protein Data Bank (PDB), where all datasets were derived from PDB. ccPDB have four modules; i) compilation of datasets, ii) creation of datasets, iii) web services and iv) Important links. * Compilation of Datasets: Datasets at ccPDB can be classified in two categories, i) datasets collected from literature and ii) datasets compiled from PDB. We are in process of collecting PDB datasetsfrom literature and maintaining at ccPDB. We are also requesting community to suggest datasets. In addition, we generate datasets from PDB, these datasets were generated using commonly used standard protocols like non-redundant chains, structures solved at high resolution. * Creation of datasets: This module developed for creating customized datasets where user can create a dataset using his/her conditions from PDB. This module will be useful for those users who wish to create a new dataset as per ones requirement. This module have six steps, which are described in help page. * Web Services: We integrated following web services in ccPDB; i) Analyze of PDB ID service allows user to submit their PDB on around 40 servers from single point, ii) BLAST search allows user to perform BLAST search of their protein against PDB, iii) Structural information service is designed for annotating a protein structure from PDB ID, iv) Search in PDB facilitate user in searching structures in PDB, v)Generate patterns service facility to generate different types of patterns required for machine learning techniques and vi) Download useful information allows user to download various types of information for a given set of proteins (PDB IDs). * Important Links: One of major objectives of this web site is to provide links to web servers related to functional annotation of proteins. In first phase we have collected and compiled these links in different categories. In future attempt will be made to collect as many links as possible.
E
SAS: Semantic Artist Similarity Dataset
live.european-language-grid.eu
zenodo.org
txt
Updated Oct 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). SAS: Semantic Artist Similarity Dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7418
Explore at:
txtAvailable download formats
Dataset updated
Oct 28, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Semantic Artist Similarity dataset consists of two datasets of artists entities with their corresponding biography texts, and the list of top-10 most similar artists within the datasets used as ground truth. The dataset is composed by a corpus of 268 artists and a slightly larger one of 2,336 artists, both gathered from Last.fm in March 2015. The former is mapped to the MIREX Audio and Music Similarity evaluation dataset, so that its similarity judgments can be used as ground truth. For the latter corpus we use the similarity between artists as provided by the Last.fm API. For every artist there is a list with the top-10 most related artists. In the MIREX dataset there are 188 artists with at least 10 similar artists, the other 80 artists have less than 10 similar artists. In the Last.fm API dataset all artists have a list of 10 similar artists. There are 4 files in the dataset.mirex_gold_top10.txt and lastfmapi_gold_top10.txt have the top-10 lists of artists for every artist of both datasets. Artists are identified by MusicBrainz ID. The format of the file is one line per artist, with the artist mbid separated by a tab with the list of top-10 related artists identified by their mbid separated by spaces.artist_mbid \t artist_mbid_top10_list_separated_by_spaces mb2uri_mirex and mb2uri_lastfmapi.txt have the list of artists. In each line there are three fields separated by tabs. First field is the MusicBrainz ID, second field is the last.fm name of the artist, and third field is the DBpedia uri.artist_mbid \t lastfm_name \t dbpedia_uri There are also 2 folders in the dataset with the biography texts of each dataset. Each .txt file in the biography folders is named with the MusicBrainz ID of the biographied artist. Biographies were gathered from the Last.fm wiki page of every artist.Using this datasetWe would highly appreciate if scientific publications of works partly based on the Semantic Artist Similarity dataset quote the following publication:Oramas, S., Sordo M., Espinosa-Anke L., & Serra X. (In Press). A Semantic-based Approach for Artist Similarity. 16th International Society for Music Information Retrieval Conference.We are interested in knowing if you find our datasets useful! If you use our dataset please email us at mtg-info@upf.edu and tell us about your research. https://www.upf.edu/web/mtg/semantic-similarity
o
Data from: PANACEA dataset - Heterogeneous COVID-19 Claims
explore.openaire.eu
data.niaid.nih.gov
Updated Jul 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miguel Arana-Catania; Elena Kochkina; Arkaitz Zubiaga; Maria Liakata; Rob Procter; Yulan He (2022). PANACEA dataset - Heterogeneous COVID-19 Claims [Dataset]. http://doi.org/10.5281/zenodo.6493847
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6493847
Dataset updated
Jul 15, 2022
Authors
Miguel Arana-Catania; Elena Kochkina; Arkaitz Zubiaga; Maria Liakata; Rob Procter; Yulan He
Description
The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset. This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim. The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation). The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019). The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English. The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy. The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels. The data sources used are: - The CoronaVirusFacts/DatosCoronaVirus Alliance Database. https://www.poynter.org/ifcn-covid-19-misinformation/ - CoAID dataset (Cui and Lee, 2020) https://github.com/cuilimeng/CoAID - MM-COVID (Li et al., 2020) https://github.com/bigheiniu/MM-COVID - CovidLies (Hossain et al., 2020) https://github.com/ucinlp/covid19-data - TREC Health Misinformation track https://trec-health-misinfo.github.io/ - TREC COVID challenge (Voorhees et al., 2021; Roberts et al., 2020) https://ir.nist.gov/covidSubmit/data.html The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True). The entries in the dataset contain the following information: - Claim. Text of the claim. - Claim label. The labels are: False, and True. - Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals. - Original information source. Information about which general information source was used to obtain the claim. - Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities. Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1). References - Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596 - Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109. - Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552. - Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc. - Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Pro...
d
Assessor - Commercial Valuation Data
catalog.data.gov
datacatalog.cookcountyil.gov
+1more
Updated Apr 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datacatalog.cookcountyil.gov (2025). Assessor - Commercial Valuation Data [Dataset]. https://catalog.data.gov/dataset/assessor-commercial-valuation-data
Explore at:
Dataset updated
Apr 12, 2025
Dataset provided by
datacatalog.cookcountyil.gov
Description
Commercial valuation data collected and maintained by the Cook County Assessor's Office, from 2021 to present. The office uses this data primarily for valuation and reporting. This dataset consolidates the individual Excel workbooks available on the Assessor's website into a single shared format. Properties are valued using similar valuation methods within each model group, per township, per year (in the year the township is reassessed). This dataset has been cleaned minimally, only enough to fit the source Excel workbooks together - because models are updated for each township in the year it is reassessed, users should expect inconsistencies within columns across time and townships. When working with Parcel Index Numbers (PINs) make sure to zero-pad them to 14 digits. Some datasets may lose leading zeros for PINs when downloaded. This data is property-level. Each 14-digit key PIN represents one commercial property. Commercial properties can and often do encompass multiple PINs. Additional notes: Current property class codes, their levels of assessment, and descriptions can be found on the Assessor's website. Note that class codes details can change across time. Data will be updated yearly, once the Assessor has finished mailing first pass values. If users need more up-to-date information they can access it through the Assessor's website. The Assessor's Office reassesses roughly one third of the county (a triad) each year. For commercial valuations, this means each year of data only contain the triad that was reassessed that year. Which triads and their constituent townships have been reassessed recently as well the year of their reassessment can be found in the Assessor's assessment calendar. One KeyPIN is one Commercial Entity. Each KeyPIN (entity) can be comprised of one single PIN (parcel), or multiple PINs as designated in the pins column. Additionally, each KeyPIN might have multiple rows if it is associated with different class codes or model groups. This can occur because many of Cook County's parcels have multiple class codes associated with them if they have multiple uses (such as residential and commercial). Users should not expect this data to be unique by any combination of available columns. Commercial properties are calculated by first determining a property’s use (office, retail, apartments, industrial, etc.), then the property is grouped with similar or like-kind property types. Next, income generated by the property such as rent or incidental income streams like parking or advertising signage is examined. Next, market-level vacancy based on location and property type is examined. In addition, new construction that has not yet been leased is also considered. Finally, expenses such as property taxes, insurance, repair and maintenance costs, property management fees, and service expenditures for professional services are examined. Once a snapshot of a property’s income statement is captured based on market data, a standard valuation metric called a “capitalization rate” to convert income to value is applied. This data was used to produce initial valuations mailed to property owners. It does not incorporate any subsequent changes to a property’s class, characteristics, valuation, or assessed value from appeals.Township codes can be found in the legend of this map. For more information on the sourcing of attached data and the preparation of this datase
f
Datasets description.
plos.figshare.com
xls
Updated Jan 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Klaifer Garcia; Pedro Shiguihara; Lilian Berton (2024). Datasets description. [Dataset]. http://doi.org/10.1371/journal.pone.0296929.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0296929.t002
Dataset updated
Jan 26, 2024
Dataset provided by
PLOS ONE
Authors
Klaifer Garcia; Pedro Shiguihara; Lilian Berton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Every day thousands of news are published on the web and filtering tools can be used to extract knowledge on specific topics. The categorization of news into a predefined set of topics is a subject widely studied in the literature, however, most works are restricted to documents in English. In this work, we make two contributions. First, we introduce a Portuguese news dataset collected from WikiNews an open-source media that provide news from different sources. Since there is a lack of datasets for Portuguese, and an existing one is from a single news channel, we aim to introduce a dataset from different news channels. The availability of comprehensive datasets plays a key role in advancing research. Second, we compare different architectures for Portuguese news classification, exploring different text representations (BoW, TF-IDF, Embedding) and classification techniques (SVM, CNN, DJINN, BERT) for documents in Portuguese, covering classical methods and current technologies. We show the trade-off between accuracy and training time for this application. We aim to show the capabilities of available algorithms and the challenges faced in the area.
C
City of Pittsburgh Traffic Count
data.wprdc.org
datasets.ai
csv, geojson
Updated Jun 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Pittsburgh (2024). City of Pittsburgh Traffic Count [Dataset]. https://data.wprdc.org/dataset/traffic-count-data-city-of-pittsburgh
Explore at:
csv, geojson(421434)Available download formats
Dataset updated
Jun 9, 2024
Dataset provided by
City of Pittsburgh
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Pittsburgh
Description
This traffic-count data is provided by the City of Pittsburgh's Department of Mobility & Infrastructure (DOMI). Counters were deployed as part of traffic studies, including intersection studies, and studies covering where or whether to install speed humps. In some cases, data may have been collected by the Southwestern Pennsylvania Commission (SPC) or BikePGH.

Data is currently available for only the most-recent count at each location.

Traffic count data is important to the process for deciding where to install speed humps. According to DOMI, they may only be legally installed on streets where traffic counts fall below a minimum threshhold. Residents can request an evaluation of their street as part of DOMI's Neighborhood Traffic Calming Program. The City has also shared data on the impact of the Neighborhood Traffic Calming Program in reducing speeds.

Different studies may collect different data. Speed hump studies capture counts and speeds. SPC and BikePGH conduct counts of cyclists. Intersection studies included in this dataset may not include traffic counts, but reports of individual studies may be requested from the City. Despite the lack of count data, intersection studies are included to facilitate data requests.

Data captured by different types of counting devices are included in this data. StatTrak counters are in use by the City, and capture data on counts and speeds. More information about these devices may be found on the company's website. Data includes traffic counts and average speeds, and may also include separate counts of bicycles.

Tubes are deployed by both SPC and BikePGH and used to count cyclists. SPC may also deploy video counters to collect data.

NOTE: The data in this dataset has not updated since 2021 because of a broken data feed. We're working to fix it.
Arizona State University Flixster Data Set
academictorrents.com
bittorrent
Updated Dec 23, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Flixter (2013). Arizona State University Flixster Data Set [Dataset]. https://academictorrents.com/details/4960373ea6dec89153639b0975ea92f9e3d3c914
Explore at:
bittorrent(36140875)Available download formats
Dataset updated
Dec 23, 2013
Dataset provided by
Flixster.comhttps://www.facebook.com/FlixsterMovies
Authors
Flixter
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Flixster is a social movie site allowing users to share movie ratings, discover new movies and meet others with similar movie taste. Number of Nodes: 2523386 Number of Edges: 9197338 Missing Values? no Source: N/A Data Set Information: 2 files are included: 1. nodes.csv — it s the file of all the users. This file works as a dictionary of all the users in this data set. It s useful for fast reference. It contains all the node ids used in the dataset 2. edges.csv — this is the friendship network among the users. The friends are represented using edges. Here is an example. 1,2 This means user with id "1" is friend with user id "2". Attribute Information: Flixster is a social movie site allowing users to share movie ratings, discover new movies and meet others with similar movie taste. This contains the friendship network crawled in December 2010 by Javier Parra (Javier.Parra@asu.edu). For easier understanding, all the contents are organized in CSV file form
R
WageIndicator Survey
datasets.iza.org
zip
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Data Center of IZA (IDSC) (2024). WageIndicator Survey [Dataset]. http://doi.org/10.15185/wif.1
Explore at:
zip(4109134), zip(1429922892)Available download formats
Unique identifier
https://doi.org/10.15185/wif.1
Dataset updated
Jan 29, 2024
Dataset provided by
Research Data Center of IZA (IDSC)
License
https://www.iza.org/wc/dataverse/IIL-1.0.pdfhttps://www.iza.org/wc/dataverse/IIL-1.0.pdf
Time period covered
2000 - 2021
Area covered
Angola, Italy, Indonesia, France, Zimbabwe, Zambia, Kazakhstan, India, Azerbaijan, Sweden
Description
The WageIndicator Survey is a continuous, multilingual, multi-country web-survey, counducted across 65 countries since 2000. The web-survey generates cross sectional and longitudinal data which might provide data especially about wages, benefits, working hours, working conditions and industrial relations. The survey has detailed questions about earnings, benefits, working conditions, employment contracts and training, as well as questions about education, occupation, industry and household characteristics. The WageIndicator Survey is a multilingual questionnaire and aims to collect information on wages and working conditions. As labour markets and wage setting processes vary across countries, country specific translations have been favoured over literal translations. The WageIndicator Survey includes regularly extra survey questions for project targeting specific countries, for specific groups or about specific events. These projects usually address a specific audience (employees of a company, employees in an industry, readers of a magazine, members of a trade union or an occupational association, and alike). The data of the project questions are included in the dataset. Bias: Non-Probability web based surveys are problematic because not every individual has the same probability of being selected into the survey. The probability of being selected depends on national or regional internet access rates and on numbers of visitors accessing the webiste. Data of such surveys form a convenience rather than a probability sample. Due to the non-probability based nature of the survey and its selectivity the obtained results cannot be generalized for the population of interest; i.e. the labor force. Comparisons with representative studies found an underrepresentation of male labour force, part-timers, older age groups, and low educated persons. Besides other strategies to reduce the bias the WageIndicators provides different weighting schemes in order to correct for selection bias. Data Characteristics: The data is organised in annual releases. The data of the period 2000-2005 is released as one dataset. Each data release consists of a dataset with continuous variables and one with project variables. The continuous variables can be merged across years. All variable and value labels are in English. The data does not include the text variables and verbatims form open-ended survey questions, these are available in Excel-Format upon request. Spatial Coverage: The survey started in 2000 in the Netherlands. Since 2004, websites have been launched in many European countries, in North and South America and in countries in Asia. From 2008 on web sites have been launched in more African countries, as well as in Indonesia and in a number of post-Soviet countries. For each country, the questions have been translated. Multilingual countries employ multilingual questionnaires. Country-specific translations and locally accepted terminology have been favored over literal translations.

Facebook

Twitter

Click to copy link

Link copied

Cite

Neilsberg Research (2025). New Site, AL annual median income by work experience and sex dataset: Aged 15+, 2010-2023 (in 2023 inflation-adjusted dollars) // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/new-site-al-income-by-gender/

New Site, AL annual median income by work experience and sex dataset: Aged 15+, 2010-2023 (in 2023 inflation-adjusted dollars) // 2025 Edition

Explore at:

Dataset updated

Feb 27, 2025

Dataset authored and provided by

Neilsberg Research

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Variables measured

Income for Male Population, Income for Female Population, Income for Male Population working full time, Income for Male Population working part time, Income for Female Population working full time, Income for Female Population working part time

Dataset funded by

Neilsberg Research

Description

About this dataset

Context

The dataset presents median income data over a decade or more for males and females categorized by Total, Full-Time Year-Round (FT), and Part-Time (PT) employment in New Site. It showcases annual income, providing insights into gender-specific income distributions and the disparities between full-time and part-time work. The dataset can be utilized to gain insights into gender-based pay disparity trends and explore the variations in income for male and female individuals.

Key observations: Insights from 2023

Based on our analysis ACS 2019-2023 5-Year Estimates, we present the following observations: - All workers, aged 15 years and older: In New Site, the median income for all workers aged 15 years and older, regardless of work hours, was $52,083 for males and $21,667 for females.

These income figures highlight a substantial gender-based income gap in New Site. Women, regardless of work hours, earn 42 cents for each dollar earned by men. This significant gender pay gap, approximately 58%, underscores concerning gender-based income inequality in the town of New Site.

- Full-time workers, aged 15 years and older: In New Site, among full-time, year-round workers aged 15 years and older, males earned a median income of $55,156, while females earned $33,816, leading to a 39% gender pay gap among full-time workers. This illustrates that women earn 61 cents for each dollar earned by men in full-time roles. This level of income gap emphasizes the urgency to address and rectify this ongoing disparity, where women, despite working full-time, face a more significant wage discrepancy compared to men in the same employment roles.

Remarkably, across all roles, including non-full-time employment, women displayed a similar gender pay gap percentage. This indicates a consistent gender pay gap scenario across various employment types in New Site, showcasing a consistent income pattern irrespective of employment status.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. All incomes have been adjusting for inflation and are presented in 2023-inflation-adjusted dollars.

Gender classifications include:

Male
Female

Employment type classifications include:

Full-time, year-round: A full-time, year-round worker is a person who worked full time (35 or more hours per week) and 50 or more weeks during the previous calendar year.
Part-time: A part-time worker is a person who worked less than 35 hours per week during the previous calendar year.

Variables / Data Columns

Year: This column presents the data year. Expected values are 2010 to 2023
Male Total Income: Annual median income, for males regardless of work hours
Male FT Income: Annual median income, for males working full time, year-round
Male PT Income: Annual median income, for males working part time
Female Total Income: Annual median income, for females regardless of work hours
Female FT Income: Annual median income, for females working full time, year-round
Female PT Income: Annual median income, for females working part time

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for New Site median household income by race. You can refer the same here

Clear search

Close search

Google apps

Main menu

New Site, AL annual median income by work experience and sex dataset: Aged...

About this dataset

Content

Inspiration

Recommended for further research

CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection

Dataset: on the similarity of web measurements under different experimental...

Data from: Five Years of COVID-19 Discourse on Instagram: A Labeled...

TREC 2022 Deep Learning test collection

Coresignal | Web Data | Job Postings Data | Largest Professional Network +...

Spanish Wine Quality Dataset

Similar Datasets

Context

Content

Attribute Information

Citation Request

Data cleaning using unstructured data

SAPFLUXNET: A global database of sap flow measurements

Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...

Esports Performance Rankings and Results

Esports Performance Rankings and Results

Performance Rankings and Results from Multiple Esports Platforms

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Dataset for: The Evolution of the Manosphere Across the Web

ccPDB - Compilation and Creation of datasets from PDB

SAS: Semantic Artist Similarity Dataset

Data from: PANACEA dataset - Heterogeneous COVID-19 Claims

Assessor - Commercial Valuation Data

Datasets description.

City of Pittsburgh Traffic Count

Arizona State University Flixster Data Set

WageIndicator Survey

New Site, AL annual median income by work experience and sex dataset: Aged 15+, 2010-2023 (in 2023 inflation-adjusted dollars) // 2025 Edition

About this dataset

Content

Inspiration

Recommended for further research