19 datasets found
  1. e

    Dataset: On the Similarity of Web Measurements Under Different Experimental...

    • b2find.eudat.eu
    Updated Oct 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Dataset: On the Similarity of Web Measurements Under Different Experimental Setups - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/5f12f3aa-ad88-553f-afaa-c265e2f2356e
    Explore at:
    Dataset updated
    Oct 11, 2024
    Description

    Measurement studies are essential for research and industry alike to understand the Web's inner workings better and help quantify specific phenomena. Performing such studies is demanding due to the dynamic nature and size of the Web. An experiment's careful design and setup are complex, and many factors might affect the results. However, while several works have independently observed differences in the outcome of an experiment (e.g., the number of observed trackers) based on the measurement setup, it is unclear what causes such deviations. This work investigates the reasons for these differences by visiting 1.7M webpages with five different measurement setups. Based on this, we build `dependency trees' for each page and cross-compare the nodes in the trees. The results show that the measured trees differ considerably, that the cause of differences can be attributed to specific nodes, and that even identical measurement setups can produce different results. This repository hosts the dataset corresponding to the paper "On the Similarity of Web Measurements Under Different Experimental Setups", which was published at the Proceedings of the 23nd ACM Internet Measurement Conference 2023.

  2. Company Datasets for Business Profiling

    • datarade.ai
    Updated Feb 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
    Explore at:
    .json, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Feb 23, 2017
    Dataset authored and provided by
    Oxylabs
    Area covered
    Bangladesh, Canada, Tunisia, Taiwan, British Indian Ocean Territory, Moldova (Republic of), Nepal, Andorra, Isle of Man, Northern Mariana Islands
    Description

    Company Datasets for valuable business insights!

    Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

    These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

    • Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

    We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

    • Company name;
    • Size;
    • Founding date;
    • Location;
    • Industry;
    • Revenue;
    • Employee count;
    • Competitors.

    You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

    Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

    With Oxylabs Datasets, you can count on:

    • Fresh and accurate data collected and parsed by our expert web scraping team.
    • Time and resource savings, allowing you to focus on data analysis and achieving your business goals.
    • A customized approach tailored to your specific business needs.
    • Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!

  3. t

    Dataset: on the similarity of web measurements under different experimental...

    • service.tib.eu
    Updated Nov 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Dataset: on the similarity of web measurements under different experimental setups - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-35097-1719
    Explore at:
    Dataset updated
    Nov 28, 2024
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Abstract: Measurement studies are essential for research and industry alike to understand the Web's inner workings better and help quantify specific phenomena. Performing such studies is demanding due to the dynamic nature and size of the Web. An experiment's careful design and setup are complex, and many factors might affect the results. However, while several works have independently observed differences in the outcome of an experiment (e.g., the number of observed trackers) based on the measurement setup, it is unclear what causes such deviations. This work investigates the reasons for these differences by visiting 1.7M webpages with five different measurement setups. Based on this, we build dependency trees' for each page and cross-compare the nodes in the trees. The results show that the measured trees differ considerably, that the cause of differences can be attributed to specific nodes, and that even identical measurement setups can produce different results. Abstract: Measurement studies are essential for research and industry alike to understand the Web's inner workings better and help quantify specific phenomena. Performing such studies is demanding due to the dynamic nature and size of the Web. An experiment's careful design and setup are complex, and many factors might affect the results. However, while several works have independently observed differences in the outcome of an experiment (e.g., the number of observed trackers) based on the measurement setup, it is unclear what causes such deviations. This work investigates the reasons for these differences by visiting 1.7M webpages with five different measurement setups. Based on this, we builddependency trees' for each page and cross-compare the nodes in the trees. The results show that the measured trees differ considerably, that the cause of differences can be attributed to specific nodes, and that even identical measurement setups can produce different results. TechnicalRemarks: This repository hosts the dataset corresponding to the paper "On the Similarity of Web Measurements Under Different Experimental Setups", which was published at the Proceedings of the 23nd ACM Internet Measurement Conference 2023.

  4. Job Offers Web Scraping Search

    • kaggle.com
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Job Offers Web Scraping Search [Dataset]. https://www.kaggle.com/datasets/thedevastator/job-offers-web-scraping-search
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Job Offers Web Scraping Search

    Targeted Results to Find the Optimal Work Solution

    By [source]

    About this dataset

    This dataset collects job offers from web scraping which are filtered according to specific keywords, locations and times. This data gives users rich and precise search capabilities to uncover the best working solution for them. With the information collected, users can explore options that match with their personal situation, skillset and preferences in terms of location and schedule. The columns provide detailed information around job titles, employer names, locations, time frames as well as other necessary parameters so you can make a smart choice for your next career opportunity

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is a great resource for those looking to find an optimal work solution based on keywords, location and time parameters. With this information, users can quickly and easily search through job offers that best fit their needs. Here are some tips on how to use this dataset to its fullest potential:

    • Start by identifying what type of job offer you want to find. The keyword column will help you narrow down your search by allowing you to search for job postings that contain the word or phrase you are looking for.

    • Next, consider where the job is located – the Location column tells you where in the world each posting is from so make sure it’s somewhere that suits your needs!

    • Finally, consider when the position is available – look at the Time frame column which gives an indication of when each posting was made as well as if it’s a full-time/ part-time role or even if it’s a casual/temporary position from day one so make sure it meets your requirements first before applying!

    • Additionally, if details such as hours per week or further schedule information are important criteria then there is also info provided under Horari and Temps Oferta columns too! Now that all three criteria have been ticked off - key words, location and time frame - then take a look at Empresa (Company Name) and Nom_Oferta (Post Name) columns too in order to get an idea of who will be employing you should you land the gig!

      All these pieces of data put together should give any motivated individual all they need in order to seek out an optimal work solution - keep hunting good luck!

    Research Ideas

    • Machine learning can be used to groups job offers in order to facilitate the identification of similarities and differences between them. This could allow users to specifically target their search for a work solution.
    • The data can be used to compare job offerings across different areas or types of jobs, enabling users to make better informed decisions in terms of their career options and goals.
    • It may also provide an insight into the local job market, enabling companies and employers to identify where there is potential for new opportunities or possible trends that simply may have previously gone unnoticed

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: web_scraping_information_offers.csv | Column name | Description | |:-----------------|:------------------------------------| | Nom_Oferta | Name of the job offer. (String) | | Empresa | Company offering the job. (String) | | Ubicació | Location of the job offer. (String) | | Temps_Oferta | Time of the job offer. (String) | | Horari | Schedule of the job offer. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  5. k

    Data from: Reproducibility and Replicability of Web Measurement Studies

    • radar.kit.edu
    • radar-service.eu
    tar
    Updated Jun 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo Große-Kampmann; Tobias Urban; Christian Wressnegger; Thorsten Holz; Norbert Pohlmann; Nurullah Demir (2023). Reproducibility and Replicability of Web Measurement Studies [Dataset]. http://doi.org/10.35097/1560
    Explore at:
    tar(294064087552 bytes)Available download formats
    Dataset updated
    Jun 24, 2023
    Dataset provided by
    Holz, Thorsten
    Karlsruhe Institute of Technology
    Wressnegger, Christian
    Große-Kampmann, Matteo
    Urban, Tobias
    Pohlmann, Norbert
    Demir, Nurullah
    Authors
    Matteo Große-Kampmann; Tobias Urban; Christian Wressnegger; Thorsten Holz; Norbert Pohlmann; Nurullah Demir
    Description

    This dataset holds additional material to the paper "Reproducibility and Replicability of Web Measurement Studies" submitted to the ACM Web Conference 2022. It contains the measurement data (requests, responses, visited URLs, cookies, and LocalStorage objects) we have collected from 25 different profiles. All data is in CSV format (exported from the Google BigQuery service) and can be imported into any database. Table sizes (according to Google BigQuery): Cookies: 2.8 GB LocalStorage: 6 GB Requests: 626.6 GB Responses: 501.6 GB URL: 38 MB Visits: 935 MB Note: Although our paper does not include the analysis for the collected Cookie and LocalStorage objects, we publish them for further studies. You can find further information about our study on our repository in GitHub.

  6. f

    Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...

    • figshare.com
    xlsx
    Updated Oct 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur (2024). Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.27072247.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 12, 2024
    Dataset provided by
    figshare
    Authors
    Nirmalya Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please cite this paper when using this dataset: N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292Abstract: The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. During recent virus outbreaks, social media platforms have played a crucial role in keeping the global population informed and updated regarding various aspects of the outbreaks. As a result, in the last few years, researchers from different disciplines have focused on the development of social media datasets focusing on different virus outbreaks. No prior work in this field has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper (stated above) aims to address this research gap. It presents this multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. This dataset contains Instagram posts about mpox in 52 languages.For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were also performed. This process included classifying each post intoone of the fine-grain sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutralhate or not hateanxiety/stress detected or no anxiety/stress detected.These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications.The 52 distinct languages in which Instagram posts are present in the dataset are English, Portuguese, Indonesian, Spanish, Korean, French, Hindi, Finnish, Turkish, Italian, German, Tamil, Urdu, Thai, Arabic, Persian, Tagalog, Dutch, Catalan, Bengali, Marathi, Malayalam, Swahili, Afrikaans, Panjabi, Gujarati, Somali, Lithuanian, Norwegian, Estonian, Swedish, Telugu, Russian, Danish, Slovak, Japanese, Kannada, Polish, Vietnamese, Hebrew, Romanian, Nepali, Czech, Modern Greek, Albanian, Croatian, Slovenian, Bulgarian, Ukrainian, Welsh, Hungarian, and Latvian.The following is a description of the attributes present in this dataset:Post ID: Unique ID of each Instagram postPost Description: Complete description of each post in the language in which it was originally publishedDate: Date of publication in MM/DD/YYYY formatLanguage: Language of the post as detected using the Google Translate APITranslated Post Description: Translated version of the post description. All posts which were not in English were translated into English using the Google Translate API. No language translation was performed for English posts.Sentiment: Results of sentiment analysis (using the preprocessed version of the translated Post Description) where each post was classified into one of the sentiment classes: fear, surprise, joy, sadness, anger, disgust, and neutralHate: Results of hate speech detection (using the preprocessed version of the translated Post Description) where each post was classified as hate or not hateAnxiety or Stress: Results of anxiety or stress detection (using the preprocessed version of the translated Post Description) where each post was classified as stress/anxiety detected or no stress/anxiety detected.All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).

  7. e

    Verwijzing naar de data van: WageIndicator continuous web-survey on work and...

    • b2find.eudat.eu
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    Dataset updated
    Nov 11, 2024
    Description

    The WageIndicator Survey is a continuous, multilingual, multi-country web-survey, counducted across 65 countries since 2000. The web-survey generates cross sectional and longitudinal data which might provide data especially about wages, benefits, working hours, working conditions and industrial relations.The survey has detailed questions about earnings, benefits, working conditions, employment contracts and training, as well as questions about education, occupation, industry and household characteristics.Research Focus:The WageIndicator Survey is a multilingual questionnaire and aims to collect information on wages and working conditions. As labour markets and wage setting processes vary across countries, country specific translations have been favoured over literal translations. The WageIndicator Survey includes regularly extra survey questions for project targeting specific countries, for specific groups or about specific events.These projects usually address a specific audience (employees of a company, employees in an industry, readers of a magazine, members of a trade union or an occupational association, and alike). The data of the project questions are included in the dataset.Sample:The target population of the WageIndicator is the labour force, that is, individuals in paid employment as well as job seekers. In addition to workers in formal dependent employment the survey aims to include apprentices, employers, own-account workers, freelancers, workers in family businesses, workers in the informal sector, unemployed workers, job seekers individuals who never had a job, as well as retired workers and housewifes school pupils or students with a job on the side and persons performing voluntary work.The WageIndicator data is derived from a volunteer survey, inviting webvisitors to the national WageIndicator websites to complete the web-survey. Annually, the websites receive millions of web-visitors.Bias:Non-Probability web based surveys are problematic because not every individual has the same probability of being selected into the survey. The probability of being selected depends on national or regional internet access rates and on numbers of visitors accessing the webiste. Data of such surveys form a convenience rather than a probability sample. Due to the non-probability based nature of the survey and its selectivity the obtained results cannot be generalized for the population of interest; i.e. the labor force.Comparisons with representative studies found an underrepresentation of male labour force, part-timers, older age groups, and low educated persons.Besides other strategies to reduce the bias the WageIndicators provides different weighting schemes in order to correct for selection bias.Data Characteristics:The data is organised in annual releases. The data of the period 2000-2005 is released as one dataset. Each data release consists of a dataset with continuous variables and one with project variables. The continuous variables can be merged across years. All variable and value labels are in English. The data does not include the text variables and verbatims form open-ended survey questions, these are available in Excel-Format upon request.Spatial Coverage:The survey started in 2000 in the Netherlands. Since 2004, websites have been launched in many European countries, in North and South America and in countries in Asia. From 2008 on web sites have been launched in more African countries, as well as in Indonesia and in a number of post-Soviet countries.For each country each, the questions have been translated. Multilingual countries employ multilingual questionnaires. Country-specific translations and locally accepted terminology have been favored over literal translations.Rights: Due to the confidential character of the WageIndicator microdata, direct access to the data is only provided by means of research contracts. Access is in principle restricted to universities and research institutes.

  8. Z

    Data from: Five Years of COVID-19 Discourse on Instagram: A Labeled...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thakur, Ph.D., Nirmalya (2024). Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13896352
    Explore at:
    Dataset updated
    Oct 21, 2024
    Dataset authored and provided by
    Thakur, Ph.D., Nirmalya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please cite the following paper when using this dataset:

    N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)

    Abstract

    The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.

    For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.

    The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)

    There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)

    The following is a description of the attributes present in this dataset

    Post ID: Unique ID of each Instagram post

    Post Description: Complete description of each post in the language in which it was originally published

    Date: Date of publication in MM/DD/YYYY format

    Language code: Language code (for example: “en”) that represents the language of the post as detected using the Google Translate API

    Full Language: Full form of the language (for example: “English”) that represents the language of the post as detected using the Google Translate API

    Sentiment: Results of sentiment analysis (using the preprocessed version of each post) where each post was classified as positive, negative, or neutral

    Open Research Questions

    This dataset is expected to be helpful for the investigation of the following research questions and even beyond:

    How does sentiment toward COVID-19 vary across different languages?

    How has public sentiment toward COVID-19 evolved from 2020 to the present?

    How do cultural differences affect social media discourse about COVID-19 across various languages?

    How has COVID-19 impacted mental health, as reflected in social media posts across different languages?

    How effective were public health campaigns in shifting public sentiment in different languages?

    What patterns of vaccine hesitancy or support are present in different languages?

    How did geopolitical events influence public sentiment about COVID-19 in multilingual social media discourse?

    What role does social media discourse play in shaping public behavior toward COVID-19 in different linguistic communities?

    How does the sentiment of minority or underrepresented languages compare to that of major world languages regarding COVID-19?

    What insights can be gained by comparing the sentiment of COVID-19 posts in widely spoken languages (e.g., English, Spanish) to those in less common languages?

    All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).

  9. E

    SAS: Semantic Artist Similarity Dataset

    • live.european-language-grid.eu
    • zenodo.org
    txt
    Updated Oct 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). SAS: Semantic Artist Similarity Dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7418
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 28, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Semantic Artist Similarity dataset consists of two datasets of artists entities with their corresponding biography texts, and the list of top-10 most similar artists within the datasets used as ground truth. The dataset is composed by a corpus of 268 artists and a slightly larger one of 2,336 artists, both gathered from Last.fm in March 2015. The former is mapped to the MIREX Audio and Music Similarity evaluation dataset, so that its similarity judgments can be used as ground truth. For the latter corpus we use the similarity between artists as provided by the Last.fm API. For every artist there is a list with the top-10 most related artists. In the MIREX dataset there are 188 artists with at least 10 similar artists, the other 80 artists have less than 10 similar artists. In the Last.fm API dataset all artists have a list of 10 similar artists. There are 4 files in the dataset.mirex_gold_top10.txt and lastfmapi_gold_top10.txt have the top-10 lists of artists for every artist of both datasets. Artists are identified by MusicBrainz ID. The format of the file is one line per artist, with the artist mbid separated by a tab with the list of top-10 related artists identified by their mbid separated by spaces.artist_mbid \t artist_mbid_top10_list_separated_by_spaces mb2uri_mirex and mb2uri_lastfmapi.txt have the list of artists. In each line there are three fields separated by tabs. First field is the MusicBrainz ID, second field is the last.fm name of the artist, and third field is the DBpedia uri.artist_mbid \t lastfm_name \t dbpedia_uri There are also 2 folders in the dataset with the biography texts of each dataset. Each .txt file in the biography folders is named with the MusicBrainz ID of the biographied artist. Biographies were gathered from the Last.fm wiki page of every artist.Using this datasetWe would highly appreciate if scientific publications of works partly based on the Semantic Artist Similarity dataset quote the following publication:Oramas, S., Sordo M., Espinosa-Anke L., & Serra X. (In Press). A Semantic-based Approach for Artist Similarity. 16th International Society for Music Information Retrieval Conference.We are interested in knowing if you find our datasets useful! If you use our dataset please email us at mtg-info@upf.edu and tell us about your research. https://www.upf.edu/web/mtg/semantic-similarity

  10. R

    Rsdd Dataset

    • universe.roboflow.com
    zip
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RDMO (2025). Rsdd Dataset [Dataset]. https://universe.roboflow.com/rdmo/rsdd/model/6
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 16, 2025
    Dataset authored and provided by
    RDMO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Damages Bounding Boxes
    Description

    Description

    Potholes are a common problem in damaged roads, where people stumble, vehicles get damaged, and drivers lose control over their cars. The maintenance of roads is a costly necessity that developing countries' authorities often struggle to deliver in time. The present dataset was collected to develop a prioritisation system that combines deep learning models and traditional computer vision techniques to automate the analysis of road irregularities reported by citizens [1]. Although the images in the dataset come from different sources (e.g. web scraping), we attribute the authorship of a main portion to the well-known RDD2020 dataset [2]. For the labelled images, we enhanced the original annotations by relabelling and focusing on four categories: crocodile cracks, lateral cracks, longitudinal cracks and potholes. We iteratively filtered bad samples and improved the annotations. As a result, the resulting object detection models have allowed us to better discriminate road damage severity, in contrast to just detecting potholes. Here are a few use cases for this dataset:

    1. Automated Road Inspection: Researchers working on transportation and mobility issues could use the RSDD dataset to train models and embed them inside UAVs and road quality survey vehicles to make road damage inspection more efficient. Most cities in the world inspect failures in person and rely on manual record-keeping. Developing an edge-AI-based system could help automatically detect and log road damage on longer roads, improving road damage maintenance prioritisation.

    2. Road Maintenance and Repair: Municipalities and public works departments could use the RSDD dataset to train models to analyse road conditions server-side, prioritising repair activities by identifying more serious damage such as crocodile cracks and potholes.

    3. Autonomous Vehicles: Developers in the autonomous vehicle industry could utilize the RSDD dataset to enhance the situational awareness capability of their vehicles. By recognising road damage, AI systems could make more informed navigation decisions, enhancing safety and efficiency.

    [1] E. Salcedo, M. Jaber, and J. Requena Carrión, “A novel road maintenance prioritisation system based on computer vision and crowdsourced reporting,” Journal of Sensor and Actuator Networks, vol. 11, no. 1, p. 15, 2022. doi:10.3390/jsan11010015

    [2] D. Arya, H. Maeda, S. K. Ghosh, D. Toshniwal, and Y. Sekimoto, “RDD2020: An annotated image dataset for Automatic Road Damage Detection Using Deep Learning,” Data in Brief, vol. 36, p. 107133, 2021. doi:10.1016/j.dib.2021.107133

    Citation

    If you use this dataset in your research, please considering citing the following work:

    @article{Salcedo_Jaber_Requena Carrión_2022, title={A novel road maintenance prioritisation system based on computer vision and crowdsourced reporting}, volume={11}, DOI={10.3390/jsan11010015}, number={1}, journal={Journal of Sensor and Actuator Networks}, author={Salcedo, Edwin and Jaber, Mona and Requena Carrión, Jesús}, year={2022}, pages={15}}

    You are also encourage to explore the original dataset provided by D. Arya et al.

  11. d

    Job Postings Dataset for Labour Market Research and Insights

    • datarade.ai
    Updated Sep 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs (2023). Job Postings Dataset for Labour Market Research and Insights [Dataset]. https://datarade.ai/data-products/job-postings-dataset-for-labour-market-research-and-insights-oxylabs
    Explore at:
    .json, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Sep 20, 2023
    Dataset authored and provided by
    Oxylabs
    Area covered
    Zambia, Switzerland, British Indian Ocean Territory, Anguilla, Tajikistan, Jamaica, Togo, Kyrgyzstan, Luxembourg, Sierra Leone
    Description

    Introducing Job Posting Datasets: Uncover labor market insights!

    Elevate your recruitment strategies, forecast future labor industry trends, and unearth investment opportunities with Job Posting Datasets.

    Job Posting Datasets Source:

    1. Indeed: Access datasets from Indeed, a leading employment website known for its comprehensive job listings.

    2. Glassdoor: Receive ready-to-use employee reviews, salary ranges, and job openings from Glassdoor.

    3. StackShare: Access StackShare datasets to make data-driven technology decisions.

    Job Posting Datasets provide meticulously acquired and parsed data, freeing you to focus on analysis. You'll receive clean, structured, ready-to-use job posting data, including job titles, company names, seniority levels, industries, locations, salaries, and employment types.

    Choose your preferred dataset delivery options for convenience:

    Receive datasets in various formats, including CSV, JSON, and more. Opt for storage solutions such as AWS S3, Google Cloud Storage, and more. Customize data delivery frequencies, whether one-time or per your agreed schedule.

    Why Choose Oxylabs Job Posting Datasets:

    1. Fresh and accurate data: Access clean and structured job posting datasets collected by our seasoned web scraping professionals, enabling you to dive into analysis.

    2. Time and resource savings: Focus on data analysis and your core business objectives while we efficiently handle the data extraction process cost-effectively.

    3. Customized solutions: Tailor our approach to your business needs, ensuring your goals are met.

    4. Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is a founding member of the Ethical Web Data Collection Initiative, aligning with GDPR and CCPA best practices.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Effortlessly access fresh job posting data with Oxylabs Job Posting Datasets.

  12. Friedrich W. Nietzsche Bibliography

    • kaggle.com
    zip
    Updated Jun 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akoua Orsot (2021). Friedrich W. Nietzsche Bibliography [Dataset]. https://www.kaggle.com/akouaorsot/nietzsches-bibliography
    Explore at:
    zip(3796498 bytes)Available download formats
    Dataset updated
    Jun 17, 2021
    Authors
    Akoua Orsot
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Hello Fellow Kagglers, this is my first dataset, and I wanted to bring something at the intersection of what I like the most: philosophy and data. In that regard, this is for data and philosophy enthusiasts, more particularly interested in the pessimism of which a key thinker was Nietzsche. This dataset is a CSV file that contains the corpus of each of his most famous books from Beyond Good and Evil to Thus Spoke Zarathustra. Though the initial intent was to have a Natural Language Processing task, it is yours to explore and be creative as the possibilities in data are infinite.

    After web scrapping the original texts, I created some functions to clean and tokenized them. So, you will find an auto-increment column and the four other columns as follows: book-title,publishing_date, text, text_clean.

    And above all, it is thanks to Project Gutenberg, a phenomenal platform for all book lovers and generally knowledge avid people, Hello Fellow Kagglers, this is my first dataset, and I wanted to bring something at the intersection of what I like the most: philosophy and data. In that regard, this is for data and philosophy enthusiasts, more particularly interested in the pessimism of which a key thinker was Nietzsche. This dataset is a CSV file that contains the corpus of each of his most famous books from Beyond Good and Evil to Thus Spoke Zarathustra. Though the initial intent was to have a Natural Language Processing task, it is yours to explore and be creative as the possibilities in data are infinite.

    After web scrapping the original texts, I created some functions to clean and tokenized them. So, you will find an auto-increment column and the four other columns as follows: book-title,publishing_date, text, text_clean.

    And above all, it is thanks to Project Gutenberg, a phenomenal platform for all book lovers and generally knowledge avid people, that I could obtain those texts at no cost. So, please support them in their continuous effort in making knowledge accessible: https://www.gutenberg.org/

    In the following bullet points, I would propose possible exploration routes but do not feel constrained to go above and beyond: 1. An exploratory analysis on term frequency 2. A word cloud of Nietzsche's ideas 3. A Recommendation system for someone wanting to read those books with an evolving string of ideas

  13. National Hydrography Dataset Plus Version 2.1

    • resilience.climate.gov
    • oregonwaterdata.org
    • +4more
    Updated Aug 16, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2022). National Hydrography Dataset Plus Version 2.1 [Dataset]. https://resilience.climate.gov/maps/4bd9b6892530404abfe13645fcb5099a
    Explore at:
    Dataset updated
    Aug 16, 2022
    Dataset authored and provided by
    Esrihttp://esri.com/
    Area covered
    Description

    The National Hydrography Dataset Plus (NHDplus) maps the lakes, ponds, streams, rivers and other surface waters of the United States. Created by the US EPA Office of Water and the US Geological Survey, the NHDPlus provides mean annual and monthly flow estimates for rivers and streams. Additional attributes provide connections between features facilitating complicated analyses. For more information on the NHDPlus dataset see the NHDPlus v2 User Guide.Dataset SummaryPhenomenon Mapped: Surface waters and related features of the United States and associated territories not including Alaska.Geographic Extent: The United States not including Alaska, Puerto Rico, Guam, US Virgin Islands, Marshall Islands, Northern Marianas Islands, Palau, Federated States of Micronesia, and American SamoaProjection: Web Mercator Auxiliary Sphere Visible Scale: Visible at all scales but layer draws best at scales larger than 1:1,000,000Source: EPA and USGSUpdate Frequency: There is new new data since this 2019 version, so no updates planned in the futurePublication Date: March 13, 2019Prior to publication, the NHDPlus network and non-network flowline feature classes were combined into a single flowline layer. Similarly, the NHDPlus Area and Waterbody feature classes were merged under a single schema.Attribute fields were added to the flowline and waterbody layers to simplify symbology and enhance the layer's pop-ups. Fields added include Pop-up Title, Pop-up Subtitle, On or Off Network (flowlines only), Esri Symbology (waterbodies only), and Feature Code Description. All other attributes are from the original NHDPlus dataset. No data values -9999 and -9998 were converted to Null values for many of the flowline fields.What can you do with this layer?Feature layers work throughout the ArcGIS system. Generally your work flow with feature layers will begin in ArcGIS Online or ArcGIS Pro. Below are just a few of the things you can do with a feature service in Online and Pro.ArcGIS OnlineAdd this layer to a map in the map viewer. The layer is limited to scales of approximately 1:1,000,000 or larger but a vector tile layer created from the same data can be used at smaller scales to produce a webmap that displays across the full range of scales. The layer or a map containing it can be used in an application. Change the layer’s transparency and set its visibility rangeOpen the layer’s attribute table and make selections. Selections made in the map or table are reflected in the other. Center on selection allows you to zoom to features selected in the map or table and show selected records allows you to view the selected records in the table.Apply filters. For example you can set a filter to show larger streams and rivers using the mean annual flow attribute or the stream order attribute. Change the layer’s style and symbologyAdd labels and set their propertiesCustomize the pop-upUse as an input to the ArcGIS Online analysis tools. This layer works well as a reference layer with the trace downstream and watershed tools. The buffer tool can be used to draw protective boundaries around streams and the extract data tool can be used to create copies of portions of the data.ArcGIS ProAdd this layer to a 2d or 3d map. Use as an input to geoprocessing. For example, copy features allows you to select then export portions of the data to a new feature class. Change the symbology and the attribute field used to symbolize the dataOpen table and make interactive selections with the mapModify the pop-upsApply Definition Queries to create sub-sets of the layerThis layer is part of the ArcGIS Living Atlas of the World that provides an easy way to explore the landscape layers and many other beautiful and authoritative maps on hundreds of topics.Questions?Please leave a comment below if you have a question about this layer, and we will get back to you as soon as possible.

  14. d

    Fils - APPLICATION OF OPEN WEB PATTERNS AND STRUCTURED DATA ON THE WEB TO...

    • search.dataone.org
    • hydroshare.org
    • +1more
    Updated Apr 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Douglas Fils (2022). Fils - APPLICATION OF OPEN WEB PATTERNS AND STRUCTURED DATA ON THE WEB TO GEOINFORMATICS [Dataset]. https://search.dataone.org/view/sha256%3A24011857dfb0df4de44933e0adde5a6e2b1dec90a73ef9cae9f854f2d91ff2ba
    Explore at:
    Dataset updated
    Apr 15, 2022
    Dataset provided by
    Hydroshare
    Authors
    Douglas Fils
    Description

    FILS, Douglas, Ocean Leadership, 1201 New York Ave, NW, 4th Floor, Washington, DC 20005, SHEPHERD, Adam, Woods Hole Oceangraphic Inst, 266 Woods Hole Road, Woods Hole, MA 02543-1050 and LINGERFELT, Eric, Earth Science Support Office, Boulder, CO 80304

    The growth in the amount of geoscience data on the internet is paralleled by the need to address issues of data citation, access and reuse. Additionally, new research tools are driving a demand for machine accessible data as part of researcher workflows. In the commercial sector, elements of this have been addressed by the use of the Schema.org vocabulary encoded via JSON-LD and coupled with web publishing patterns. Adaptable publishing approaches are already in use by many data facilities as they work to address publishing and FAIR patterns. While these often lack the structured data elements these workflows could be leveraged to additionally implement schema.org style publishing patterns.

    This presentation will report on work that grew out of the EarthCube Council of Data Facilities known as, Project 418. Project 418 was a proof of concept funded by the EarthCube Science Support Office for exploring the approach of publishing JSON-LD with schema.org and extensions by a set of NSF data facilities. The goal was focused on using this approach to describe data set resources and evaluate the use of this structured metadata to address discovery. Additionally, we will discuss growing interest by Google and others in leveraging this approach to data set discovery.

    The work scoped 47,650 datasets from 10 NSF-funded data facilities. Across these datasets, the harvester found 54,665 data download URLs, and approximately 560K dataset variables and 35k unique identifiers (DOIs, IGSNs or ORCIDs).

    The various publishing workflows used by the involved data facilities will be presented along with the harvesting and interface developments. Details on how resources were indexed into text, spatial and graph systems and used for search interfaces will be presented along with future directions underway building on this foundation.

  15. Data from: Time Use Longitudinal Panel Study, 1975-1981

    • icpsr.umich.edu
    • abacus.library.ubc.ca
    ascii, sas, spss +1
    Updated Jan 12, 2006
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juster, F. Thomas; Hill, Martha S.; Stafford, Frank P.; Unknown (2006). Time Use Longitudinal Panel Study, 1975-1981 [Dataset]. http://doi.org/10.3886/ICPSR09054.v2
    Explore at:
    ascii, stata, spss, sasAvailable download formats
    Dataset updated
    Jan 12, 2006
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Juster, F. Thomas; Hill, Martha S.; Stafford, Frank P.; Unknown
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/9054/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/9054/terms

    Area covered
    United States
    Description

    The 1975-1981 TIME USE LONGITUDINAL PANEL STUDY dataset combines a round of data collected in 1981 with the principal investigators' earlier TIME USE IN ECONOMIC AND SOCIAL ACCOUNTS, 1975-1976 (ICPSR 7580), collected by F. Thomas Juster, Paul Courant, et al. This combined data collection consists of data from 620 respondents, their spouses if they were married at the time of first contact, and up to three children between the ages of three and seventeen living in the household. The key features which characterized the 1975 time use study were repeated in 1981. In both of the data collection years, adult individuals provided four time diaries as well as extensive information related to their time use in the four waves of data collection. Information pertaining to the household was collected, as well as identical measures from respondents and spouses for all person-specific information. Selected children provided two time diary reports (one for a school day and one non-school day), an academic achievement measure, and survey measures pertaining to school and family life. In addition, teacher ratings were obtained. For each adult individual who remained in the sample through the 1981 study, a time budget was constructed from his or her time diaries containing the number of minutes per week spent in each of some 223 mutually exclusive and exhaustive activities. These measures provide a description of how the sample individuals were currently allocating their time and are comparable to the 87 activity measures created from their 1975 diaries. In addition, respondent and spouse time aggregates were converted to parent time aggregates for mothers and fathers of children in the sample. To facilitate analyses on spouses, a merged data file was created for 868 couples in which both husband and wife had complete Wave I data in either 1975-1976 or 1981.

  16. Descriptions of popular movies (polish language)

    • kaggle.com
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michal Bogacz (2024). Descriptions of popular movies (polish language) [Dataset]. https://www.kaggle.com/michau96/descriptions-of-popular-movies-polish-language/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Michal Bogacz
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Filmweb is the largest film web portal in Poland. It works in a very similar way to imdb and is the second largest movie database in the world after IMDb.com (as of March 26, 2015, Filmweb contains information about 598,775 movies, 50,529 series and 2,278,691 movie people). The portal was established in 1998 and has gained great recognition and is widely known in Poland. On the website, users can rate movies, take part in discussions, add various content related to the files (information, reviews). The database below presents information about the most popular (most rated) films, with an emphasis on the descriptions of these films in Polish, created by administrations and Polish users.

    Content

    The data was obtained using webscraping. The Python (version 3.10) language with the "BeautifulSoup", "requests", "re", "pandas", "numpy" and "datetime" packages was used for this process and "SelectorGadet" add-on, which made the work with the site easier. Each row in the database refers to one movie. In first 4 columns there are informations about orginial title of movie, average user ranking, number of votes and box office worldwide if such information exists. In next column we have short description of movie which have maximum 241 char (most have beetwen 50-200) wihich is 1-2 sentance summary. Next 10 columns contains longer description - each user can and add own descrpition which later is verified by admins. Number of descriptions might be diffrent in each movie, but all of them is in polish language (encoding utf-8).

    Photo by GR Stocks on Unsplash

  17. Z

    Dataset: A Systematic Literature Review on the topic of High-value datasets

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Magdalena Ciesielska (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
    Explore at:
    Dataset updated
    Jun 23, 2023
    Dataset provided by
    Charalampos Alexopoulos
    Andrea Miletič
    Magdalena Ciesielska
    Nina Rizun
    Anastasija Nikiforova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

    The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

    Methodology

    To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

    These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

    To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

    Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

    Description of the data in this data set

    Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

    The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

    Descriptive information
    1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

    Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

    Quality- and relevance- related information
    17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

    HVD determination-related information
    19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

    Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

    Licenses or restrictions CC-BY

    For more info, see README.txt

  18. Passive Surveillance Index

    • opendata.transport.nsw.gov.au
    Updated Sep 28, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    opendata.transport.nsw.gov.au (2021). Passive Surveillance Index [Dataset]. https://opendata.transport.nsw.gov.au/data/dataset/passive-surveillance-index
    Explore at:
    Dataset updated
    Sep 28, 2021
    Dataset provided by
    Transport for NSWhttp://www.transport.nsw.gov.au/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes the final Technical Note and accompanying GIS datasets delivered by Cardno and UNSW for their proof-of-concept Passive Surveillance Index (PSI) trial in Parramatta, through TfNSW’s Safety After Dark Innovation Challenge (SADIC). The PSI scores walking routes based on quantifiable indicators. The tool may be a starting point for planners to make informed decisions on how safe cities may factor passive surveillance into their design. The web map visually displays the PSI for different times of the night across the trial area. This website works best using the Google Chrome browser. Contact: Elizabeth Muscat, elizabeth.muscat@cardno.com.au Output: the SADIC PSI Data zip file and technical report

  19. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). Dataset: On the Similarity of Web Measurements Under Different Experimental Setups - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/5f12f3aa-ad88-553f-afaa-c265e2f2356e

Dataset: On the Similarity of Web Measurements Under Different Experimental Setups - Dataset - B2FIND

Explore at:
Dataset updated
Oct 11, 2024
Description

Measurement studies are essential for research and industry alike to understand the Web's inner workings better and help quantify specific phenomena. Performing such studies is demanding due to the dynamic nature and size of the Web. An experiment's careful design and setup are complex, and many factors might affect the results. However, while several works have independently observed differences in the outcome of an experiment (e.g., the number of observed trackers) based on the measurement setup, it is unclear what causes such deviations. This work investigates the reasons for these differences by visiting 1.7M webpages with five different measurement setups. Based on this, we build `dependency trees' for each page and cross-compare the nodes in the trees. The results show that the measured trees differ considerably, that the cause of differences can be attributed to specific nodes, and that even identical measurement setups can produce different results. This repository hosts the dataset corresponding to the paper "On the Similarity of Web Measurements Under Different Experimental Setups", which was published at the Proceedings of the 23nd ACM Internet Measurement Conference 2023.

Search
Clear search
Close search
Google apps
Main menu