38 datasets found
  1. Similarweb's Surge: A Sign of Digital Dominance? (SMWB) (Forecast)

    • kappasignal.com
    Updated May 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KappaSignal (2024). Similarweb's Surge: A Sign of Digital Dominance? (SMWB) (Forecast) [Dataset]. https://www.kappasignal.com/2024/05/similarwebs-surge-sign-of-digital.html
    Explore at:
    Dataset updated
    May 22, 2024
    Dataset authored and provided by
    KappaSignal
    License

    https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html

    Description

    This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

    Similarweb's Surge: A Sign of Digital Dominance? (SMWB)

    Financial data:

    • Historical daily stock prices (open, high, low, close, volume)

    • Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

    • Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

    Machine learning features:

    • Feature engineering based on financial data and technical indicators

    • Sentiment analysis data from social media and news articles

    • Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

    Potential Applications:

    • Stock price prediction

    • Portfolio optimization

    • Algorithmic trading

    • Market sentiment analysis

    • Risk management

    Use Cases:

    • Researchers investigating the effectiveness of machine learning in stock market prediction

    • Analysts developing quantitative trading Buy/Sell strategies

    • Individuals interested in building their own stock market prediction models

    • Students learning about machine learning and financial applications

    Additional Notes:

    • The dataset may include different levels of granularity (e.g., daily, hourly)

    • Data cleaning and preprocessing are essential before model training

    • Regular updates are recommended to maintain the accuracy and relevance of the data

  2. Google Play Store Apps

    • kaggle.com
    Updated Feb 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lavanya (2019). Google Play Store Apps [Dataset]. https://www.kaggle.com/lava18/google-play-store-apps/home
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Lavanya
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    [ADVISORY] IMPORTANT

    Instructions for citation:

    If you use this dataset anywhere in your work, kindly cite as the below: L. Gupta, "Google Play Store Apps," Feb 2019. [Online]. Available: https://www.kaggle.com/lava18/google-play-store-apps

    Context

    While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web. On digging deeper, I found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.

    Content

    Each app (row) has values for catergory, rating, size, and more.

    Acknowledgements

    This information is scraped from the Google Play Store. This app information would not be available without it.

    Inspiration

    The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

  3. RICO dataset

    • kaggle.com
    Updated Dec 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Onur Gunes (2021). RICO dataset [Dataset]. https://www.kaggle.com/onurgunes1993/rico-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 2, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Onur Gunes
    Description

    Context

    Data-driven models help mobile app designers understand best practices and trends, and can be used to make predictions about design performance and support the creation of adaptive UIs. This paper presents Rico, the largest repository of mobile app designs to date, created to support five classes of data-driven applications: design search, UI layout generation, UI code generation, user interaction modeling, and user perception prediction. To create Rico, we built a system that combines crowdsourcing and automation to scalably mine design and interaction data from Android apps at runtime. The Rico dataset contains design data from more than 9.3k Android apps spanning 27 categories. It exposes visual, textual, structural, and interactive design properties of more than 66k unique UI screens. To demonstrate the kinds of applications that Rico enables, we present results from training an autoencoder for UI layout similarity, which supports query-by-example search over UIs.

    Content

    Rico was built by mining Android apps at runtime via human-powered and programmatic exploration. Like its predecessor ERICA, Rico’s app mining infrastructure requires no access to — or modification of — an app’s source code. Apps are downloaded from the Google Play Store and served to crowd workers through a web interface. When crowd workers use an app, the system records a user interaction trace that captures the UIs visited and the interactions performed on them. Then, an automated agent replays the trace to warm up a new copy of the app and continues the exploration programmatically, leveraging a content-agnostic similarity heuristic to efficiently discover new UI states. By combining crowdsourcing and automation, Rico can achieve higher coverage over an app’s UI states than either crawling strategy alone. In total, 13 workers recruited on UpWork spent 2,450 hours using apps on the platform over five months, producing 10,811 user interaction traces. After collecting a user trace for an app, we ran the automated crawler on the app for one hour.

    Acknowledgements

    UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN https://interactionmining.org/rico

    Inspiration

    The Rico dataset is large enough to support deep learning applications. We trained an autoencoder to learn an embedding for UI layouts, and used it to annotate each UI with a 64-dimensional vector representation encoding visual layout. This vector representation can be used to compute structurally — and often semantically — similar UIs, supporting example-based search over the dataset. To create training inputs for the autoencoder that embed layout information, we constructed a new image for each UI capturing the bounding box regions of all leaf elements in its view hierarchy, differentiating between text and non-text elements. Rico’s view hierarchies obviate the need for noisy image processing or OCR techniques to create these inputs.

  4. d

    Ads.txt / App-ads.txt for advertisement compliance

    • datarade.ai
    .json, .csv, .txt
    Updated Jan 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datandard (2024). Ads.txt / App-ads.txt for advertisement compliance [Dataset]. https://datarade.ai/data-products/ads-txt-app-ads-txt-for-advertisement-compliance-datandard
    Explore at:
    .json, .csv, .txtAvailable download formats
    Dataset updated
    Jan 1, 2024
    Dataset authored and provided by
    Datandard
    Area covered
    Turks and Caicos Islands, Mauritius, Chad, Iraq, Yemen, French Polynesia, Sint Maarten (Dutch part), Fiji, Latvia, Grenada
    Description

    In today's digital landscape, data transparency and compliance are paramount. Organizations across industries are striving to maintain trust and adhere to regulations governing data privacy and security. To support these efforts, we present our comprehensive Ads.txt and App-Ads.txt dataset.

    Key Benefits of Our Dataset:

    • Coverage: Our dataset offers a comprehensive view of the Ads.txt and App-Ads.txt files, providing valuable information about publishers, advertisers, and the relationships between them. You gain a holistic understanding of the digital advertising ecosystem.
    • Multiple Data Formats: We understand that flexibility is essential. Our dataset is available in multiple formats, including .CSV, .JSON, and more. Choose the format that best suits your data processing needs.
    • Global Scope: Whether your business operates in a single country or spans multiple continents, our dataset is tailored to meet your needs. It provides data from various countries, allowing you to analyze regional trends and compliance.
      • Top-Quality Data: Quality matters. Our dataset is meticulously curated and continuously updated to deliver the most accurate and reliable information. Trust in the integrity of your data for critical decision-making.
      • Seamless Integration: We've designed our dataset to seamlessly integrate with your existing systems and workflows. No disruptions—just enhanced compliance and efficiency.

    The Power of Ads.txt & App-Ads.txt: Ads.txt (Authorized Digital Sellers) and App-Ads.txt (Authorized Sellers for Apps) are industry standards developed by the Interactive Advertising Bureau (IAB) to increase transparency and combat ad fraud. These files specify which companies are authorized to sell digital advertising inventory on a publisher's website or app. Understanding and maintaining these files is essential for data compliance and the prevention of unauthorized ad sales.

    How Can You Benefit? - Data Compliance: Ensure that your organization adheres to industry standards and regulations by monitoring Ads.txt and App-Ads.txt files effectively. - Ad Fraud Prevention: Identify unauthorized sellers and take action to prevent ad fraud, ultimately protecting your revenue and brand reputation. - Strategic Insights: Leverage the data in these files to gain insights into your competitors, partners, and the broader digital advertising landscape. - Enhanced Decision-Making: Make data-driven decisions with confidence, armed with accurate and up-to-date information about your advertising partners. - Global Reach: If your operations span the globe, our dataset provides insights into the Ads.txt and App-Ads.txt files of publishers worldwide.

    Multiple Data Formats for Your Convenience: - CSV (Comma-Separated Values): A widely used format for easy data manipulation and analysis in spreadsheets and databases. - JSON (JavaScript Object Notation): Ideal for structured data and compatibility with web applications and APIs. - Other Formats: We understand that different organizations have different preferences and requirements. Please inquire about additional format options tailored to your needs.

    Data That You Can Trust:

    We take data quality seriously. Our team of experts curates and updates the dataset regularly to ensure that you receive the most accurate and reliable information available. Your confidence in the data is our top priority.

    Seamless Integration:

    Integrate our Ads.txt and App-Ads.txt dataset effortlessly into your existing systems and processes. Our goal is to enhance your compliance efforts without causing disruptions to your workflow.

    In Conclusion:

    Transparency and compliance are non-negotiable in today's data-driven world. Our Ads.txt and App-Ads.txt dataset empowers you with the knowledge and tools to navigate the complexities of the digital advertising ecosystem while ensuring data compliance and integrity. Whether you're a Data Protection Officer, a data compliance professional, or a business leader, our dataset is your trusted resource for maintaining data transparency and safeguarding your organization's reputation and revenue.

    Get Started Today:

    Don't miss out on the opportunity to unlock the power of data transparency and compliance. Contact us today to learn more about our Ads.txt and App-Ads.txt dataset, available in multiple formats and tailored to your specific needs. Join the ranks of organizations worldwide that trust our dataset for a compliant and transparent future.

  5. m

    Network traffic and code for machine learning classification

    • data.mendeley.com
    Updated Feb 20, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Víctor Labayen (2020). Network traffic and code for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.2
    Explore at:
    Dataset updated
    Feb 20, 2020
    Authors
    Víctor Labayen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

    Activities:

    Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

    The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

    The amount of data is stated as follows:

    Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

    The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.

  6. Data from: iRaPCA and SOMoC: Development and Validation of Web Applications...

    • acs.figshare.com
    text/x-python
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denis N. Prada Gori; Manuel A. Llanos; Carolina L. Bellera; Alan Talevi; Lucas N. Alberca (2023). iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules [Dataset]. http://doi.org/10.1021/acs.jcim.2c00265.s002
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Denis N. Prada Gori; Manuel A. Llanos; Carolina L. Bellera; Alan Talevi; Lucas N. Alberca
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The clustering of small molecules implies the organization of a group of chemical structures into smaller subgroups with similar features. Clustering has important applications to sample chemical datasets or libraries in a representative manner (e.g., to choose, from a virtual screening hit list, a chemically diverse subset of compounds to be submitted to experimental confirmation, or to split datasets into representative training and validation sets when implementing machine learning models). Most strategies for clustering molecules are based on molecular fingerprints and hierarchical clustering algorithms. Here, two open-source in-house methodologies for clustering of small molecules are presented: iterative Random subspace Principal Component Analysis clustering (iRaPCA), an iterative approach based on feature bagging, dimensionality reduction, and K-means optimization; and Silhouette Optimized Molecular Clustering (SOMoC), which combines molecular fingerprints with the Uniform Manifold Approximation and Projection (UMAP) and Gaussian Mixture Model algorithm (GMM). In a benchmarking exercise, the performance of both clustering methods has been examined across 29 datasets containing between 100 and 5000 small molecules, comparing these results with those given by two other well-known clustering methods, Ward and Butina. iRaPCA and SOMoC consistently showed the best performance across these 29 datasets, both in terms of within-cluster and between-cluster distances. Both iRaPCA and SOMoC have been implemented as free Web Apps and standalone applications, to allow their use to a wide audience within the scientific community.

  7. R

    Data from: Reorganized Dataset

    • universe.roboflow.com
    zip
    Updated Mar 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bruce baur (2023). Reorganized Dataset [Dataset]. https://universe.roboflow.com/bruce-baur/reorganized/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 20, 2023
    Dataset authored and provided by
    bruce baur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Gui Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Automated UI Testing: "Reorganized" can be employed to perform automated UI testing for web and mobile applications, helping developers quickly identify and verify the presence of specific UI elements such as buttons, fields, and images to ensure that each component is functioning properly and meets design specifications.

    2. Accessibility Enhancement: Utilizing "Reorganized" can help in improving the accessibility of websites and applications by automatically identifying and labeling different GUI elements, enabling screen reader software to provide more accurate and detailed information for visually impaired users.

    3. UI Design Evaluation: "Reorganized" can assist in analyzing and comparing UI designs of different applications to evaluate consistency, user experience and adherence to design principles. By identifying specific elements, it can provide insights to designers on which areas need improvement or adjustments.

    4. Content Curation and Classification: The computer vision model can be used to analyze and sort through large collections of web pages or applications to categorize and curate content based on the presence of specific GUI elements like text, images, buttons, etc. This can be helpful in creating repositories, educational material, or designing targeted advertisements.

    5.website Migration and Conversion: Using "Reorganized" can significantly speed up the process of migrating or converting websites, especially when transitioning from one content management system to another. By identifying and extracting GUI elements, it becomes easier to map these elements to a new system and ensure a seamless transfer.

  8. Alternative Data Market Analysis North America, Europe, APAC, South America,...

    • technavio.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio, Alternative Data Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, Canada, China, UK, Mexico, Germany, Japan, India, Italy, France - Size and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/alternative-data-market-industry-analysis
    Explore at:
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    Canada, United States, United Kingdom, Global
    Description

    Snapshot img

    Alternative Data Market Size 2025-2029

    The alternative data market size is forecast to increase by USD 60.32 billion, at a CAGR of 52.5% between 2024 and 2029.

    The market is experiencing significant growth, driven by the increased availability and diversity of data sources. This expanding data landscape is fueling the rise of alternative data-driven investment strategies across various industries. However, the market faces challenges related to data quality and standardization. As companies increasingly rely on alternative data to inform business decisions, ensuring data accuracy and consistency becomes paramount. Addressing these challenges requires robust data management systems and collaboration between data providers and consumers to establish industry-wide standards. Companies that effectively navigate these dynamics can capitalize on the wealth of opportunities presented by alternative data, driving innovation and competitive advantage.

    What will be the Size of the Alternative Data Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
    Request Free SampleThe market continues to evolve, with new applications and technologies shaping its dynamics. Predictive analytics and deep learning are increasingly being integrated into business intelligence systems, enabling more accurate risk management and sales forecasting. Data aggregation from various sources, including social media and web scraping, enriches datasets for more comprehensive quantitative analysis. Data governance and metadata management are crucial for maintaining data accuracy and ensuring data security. Real-time analytics and cloud computing facilitate decision support systems, while data lineage and data timeliness are essential for effective portfolio management. Unstructured data, such as sentiment analysis and natural language processing, provide valuable insights for various sectors. Machine learning algorithms and execution algorithms are revolutionizing trading strategies, from proprietary trading to high-frequency trading. Data cleansing and data validation are essential for maintaining data quality and relevance. Standard deviation and regression analysis are essential tools for financial modeling and risk management. Data enrichment and data warehousing are crucial for data consistency and completeness, allowing for more effective customer segmentation and sales forecasting. Data security and fraud detection are ongoing concerns, with advancements in technology continually addressing new threats. The market's continuous dynamism is reflected in its integration of various technologies and applications. From data mining and data visualization to supply chain optimization and pricing optimization, the market's evolution is driven by the ongoing unfolding of market activities and evolving patterns.

    How is this Alternative Data Industry segmented?

    The alternative data industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. TypeCredit and debit card transactionsSocial mediaMobile application usageWeb scrapped dataOthersEnd-userBFSIIT and telecommunicationRetailOthersGeographyNorth AmericaUSCanadaMexicoEuropeFranceGermanyItalyUKAPACChinaIndiaJapanRest of World (ROW)

    By Type Insights

    The credit and debit card transactions segment is estimated to witness significant growth during the forecast period.Alternative data derived from card and debit card transactions plays a pivotal role in business intelligence, offering valuable insights into consumer spending behaviors. This data is essential for market analysts, financial institutions, and businesses aiming to optimize strategies and enhance customer experiences. Two primary categories exist within this data segment: credit card transactions and debit card transactions. Credit card transactions reveal consumers' discretionary spending patterns, luxury purchases, and credit management abilities. By analyzing this data through quantitative methods, such as regression analysis and time series analysis, businesses can gain a deeper understanding of consumer preferences and trends. Debit card transactions, on the other hand, provide insights into essential spending habits, budgeting strategies, and daily expenses. This data is crucial for understanding consumers' practical needs and lifestyle choices. Machine learning algorithms, such as deep learning and predictive analytics, can be employed to uncover patterns and trends in debit card transactions, enabling businesses to tailor their offerings and services accordingly. Data governance, data security, and data accuracy are critical considerations when dealing with sensitive financial d

  9. Data from: Crowd and community sourcing to update authoritative LULC data in...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    txt
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ana-Maria Olteanu-Raimond; Marie-Dominique Van Damme; Julie Marcuzzi; Tobias Sturn; Ludovic Fraval; Marie Gombert; Laurence Jolivet; Linda See; Timothé Royer; Simon Fauret; Ana-Maria Olteanu-Raimond; Marie-Dominique Van Damme; Julie Marcuzzi; Tobias Sturn; Ludovic Fraval; Marie Gombert; Laurence Jolivet; Linda See; Timothé Royer; Simon Fauret (2024). Crowd and community sourcing to update authoritative LULC data in urban areas [Dataset]. http://doi.org/10.5281/zenodo.3691827
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ana-Maria Olteanu-Raimond; Marie-Dominique Van Damme; Julie Marcuzzi; Tobias Sturn; Ludovic Fraval; Marie Gombert; Laurence Jolivet; Linda See; Timothé Royer; Simon Fauret; Ana-Maria Olteanu-Raimond; Marie-Dominique Van Damme; Julie Marcuzzi; Tobias Sturn; Ludovic Fraval; Marie Gombert; Laurence Jolivet; Linda See; Timothé Royer; Simon Fauret
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The French National Mapping Agency (Institut National de l'Information Géographique et Forestière - IGN) is responsible for producing and maintaining the spatial data sets for all of France. At the same time, they must satisfy the needs of different stakeholders who are responsible for decisions at multiple levels from local to national. IGN produces many different maps including detailed road networks and land cover/land use maps over time. The information contained in these maps is crucial for many of the decisions made about urban planning, resource management and landscape restoration as well as other environmental issues in France. Recently, IGN has started the process of creating a high-resolution land use land cover (LULC) maps, aimed at developing smart and accurate monitoring services of LULC over time. To help update and validate the French LULC database, citizens and interested stakeholders can contribute using the Paysages mobile and web applications. This approach presents an opportunity to evaluate the integration of citizens in the IGN process of updating and validating LULC data.

    Dataset 1: Change detection validation 2019

    This dataset contains web-based validations of changes detected by time series (2016 – 2019) analysis of Sentinel-2 satellite imagery. Validation was conducted using two high resolution orthophotos from respectively 2016 and 2019 as reference data. Two tools have been used: Paysages web application and LACO-Wiki. Both tools used the same validation design: blind validation and the same options. For each detected change, contributors are asked to validate if there is a change and if it is the case then to choose a LU or LC class from a pre-defined list of classes.

    The dataset has the following characteristics:

    • Time period of the change detection: 2016-2019.
    • Time period of data collection: February 2019-December 2019
    • Total number of contributors: 105
    • Number of validated changes: 1048; each change was validated by between 1 to 6 contributors.
    • Region of interest: Toulouse and surrounding areas

    Associated files: 1- Change validation locations.png, 1-Change validation 2019 – Attributes.csv, 1-Change validation 2019.csv, 1-Change validation 2019.geoJSON

    This dataset is licensed under a Creative Commons Attribution 4.0 International. It is attributed to the LandSense Citizen Observatory, IGN-France, and GeoVille.

    Dataset 2: Land use classification 2019

    The aim of this data collection campaign was to improve the LU classification of authoritative LULC data (OCS-GE 2016 ©IGN) for built-up area. Using the Paysages web platform, contributors are asked to choose a land use value among a list of pre-defined values for each location.

    The dataset has the following characteristics:

    • Time period of data collection: August 2019
    • Types of contributors: Surveyors from the production department of IGN
    • Total number of contributors: 5
    • Total number of observations: 2711
    • Data specifications of the OCS-GE ©IGN
    • Region of interest: Toulouse and surrounding areas

    Associated files: 2- LU classification points.png, 2-LU classification 2019 – Attributes.csv, 2-LU classification 2019.csv, 2-LU classification 2019.geoJSON

    This dataset is licensed under a Creative Commons Attribution 4.0 International. It is attributed to the LandSense Citizen Observatory, IGN-France and the International Institute for Applied Systems Analysis.

    Dataset 3: In-situ validation 2018

    The aim of this data collection campaign was to collect in-situ (ground-based) information, using the Paysages mobile application, to update authoritative LULC data. Contributors visit pre-determined locations, take photographs, of the point location and in the four cardinal directions away from the point and answer a few questions with respect with the task. Two tasks were defined:

    • Classify the point by choosing a LU class between three classes: industrial (US2), commercial (US3) or residential (US5).
    • Validate changes detected by the LandSense Change Detection Service: for each new detected change, the contributor was requested to validate the change and choose a LU and LC class from a pre-defined list of classes.

    The dataset has the following characteristics

    • Time period of data collection: June 2018 – October 2018
    • Types of contributors: students from the School of Agricultural and Life Sciences and citizens
    • Total number of contributors: 26
    • Total number of observations: 281
    • Total number of photos: 421
    • Region of interest: Toulouse and surrounding areas

    Associated files: 3- Insitu locations.png, 3- Insitu validation 2018 – Attributes.csv, 3- Insitu validation 2018.csv, 3- Insitu validation 2018.geoJSON

    This dataset is licensed under a Creative Commons Attribution 4.0 International. It is attributed to the LandSense Citizen Observatory, IGN-France.

    This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no 689812.

  10. Cross-language corpora of privacy policies

    • zenodo.org
    • explore.openaire.eu
    • +1more
    csv, zip
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesco Ciclosi; Francesco Ciclosi; Silvia Vidor; Silvia Vidor; Fabio Massacci; Fabio Massacci (2023). Cross-language corpora of privacy policies [Dataset]. http://doi.org/10.5281/zenodo.7729546
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francesco Ciclosi; Francesco Ciclosi; Silvia Vidor; Silvia Vidor; Fabio Massacci; Fabio Massacci
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset consists of three different privacy policy corpora (in English and Italian) composed of 81 unique privacy policy texts spanning the period 2018-2021. This dataset makes available an example of three corpora of privacy policies. The first corpus is the English-language corpus, the original used in the study by Tang et al. [2]. The other two are cross-language corpora built (one, the source corpus, in English, and the other, the replication corpus, in Italian, which is the language of a potential replication study) from the first corpus.

    The policies were collected from:

    1. the Alexa top 10 Italy and U.S. websites rank;
    2. the Play Store apps rank in the "most profitable games" category of the Play Store for Italy and the U.S.

    We manually analyzed the Alexa top 10 Italy websites as of November 2021. Analogously, we analyzed selected apps that, in the same period, had ranked better in the "most profitable games" category of the Play Store for Italy.

    All the privacy policies are ANSI-encoded text files and have been manually read and verified.
    The dataset is helpful as a starting point for building comparable cross-language privacy policies corpora. The availability of these comparable cross-language privacy policies corpora helps replicate studies in different languages.
    Details on the methodology can be found in the accompanying paper.

    The available files are as follows:

    • policies-texts.zip --> contains a directory of text files with the policy texts. File names are the SHA1 hashes of the policy text.
    • policy-metadata.csv --> Contains a CSV file with the metadata for each privacy policy.

    This dataset is the original dataset used in the publication [1]. The original English U.S. corpus is described in the publication [2].

    [1] F. Ciclosi, S. Vidor and F. Massacci. "Building cross-language corpora for human understanding of privacy policies." Workshop on Digital Sovereignty in Cyber Security: New Challenges in Future Vision. Communications in Computer and Information Science. Springer International Publishing, 2023, In press.

    [2] J. Tang, H. Shoemaker, A. Lerner, and E. Birrell. Defining Privacy: How Users Interpret Technical Terms in Privacy Policies. Proceedings on Privacy Enhancing Technologies, 3:70–94, 2021.

  11. Z

    Example dataset for DASiRe

    • data.niaid.nih.gov
    Updated Dec 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marisol Salgado (2021). Example dataset for DASiRe [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5792671
    Explore at:
    Dataset updated
    Dec 20, 2021
    Dataset provided by
    Marisol Salgado
    Amit Fenn
    Chit Tong Lio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Direct Alternative Splicing Regulator predictor (DASiRe) is a web application that allows non-expert users to perform different types of splicing analysis from RNA-seq experiments and also incorporates ChIP-seq data of a DNA-binding protein of interest to evaluate whether its presence is associated with the splicing changes detected in the RNA-seq dataset.

    DASiRe is an accessible web-based platform that performs the analysis of raw RNA-seq and ChIP-seq data to study the relationship between DNA-binding proteins and alternative splicing regulation. It provides a fully integrated pipeline that takes raw reads from RNA-seq and performs extensive splicing analysis by incorporating the three current methodological approaches to study alternative splicing: isoform switching, exon and event-level. Once the initial splicing analysis is finished, DASiRe performs ChIP-seq peak enrichment in the spliced genes detected by each one of the three approaches.

  12. Z

    Continuous MODIS land surface temperature dataset over the Eastern...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Feb 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shilo Shiff (2021). Continuous MODIS land surface temperature dataset over the Eastern Mediterranean [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3583123
    Explore at:
    Dataset updated
    Feb 11, 2021
    Dataset provided by
    Shilo Shiff
    Lensky, M Itamar
    Helman, David
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Mediterranean Sea, Eastern Mediterranean
    Description

    A continuous dataset of Land Surface Temperature (LST) is vital for climatological and environmental studies. LST can be regarded as a combination of seasonal mean temperature (climatology) and daily anomaly, which is attributed mainly to the synoptic-scale atmospheric circulation (weather). To reproduce LST in cloudy pixels, time series (2002-2019) of cloud-free 1km MODIS Aqua LST images were generated and the pixel-based seasonality (climatology) was calculated using temporal Fourier analysis. To add the anomaly, we used the NCEP Climate Forecast System Version 2 (CFSv2) model, which provides air surface temperature under both cloudy and clear sky conditions. The combination of the two sources of data enables the estimation of LST in cloudy pixels.

    Data structure

    The dataset consists of geo-located continuous LST (Day, Night and Daily) which calculates LST values of cloudy pixels. The spatial domain of the data is the Eastern Mediterranean, at the resolution of the MYD11A1 product (~1 Km). Data are stored in GeoTIFF format as signed 16-bit integers using a scale factor of 0.02, with one file per day, each defined by 4 dimensions (Night LST Cont., Day LST Cont., Daily Average LST Cont., QA). The QA band stores information about the presence of cloud in the original pixel. If in both original files, Day LST and Night LST there was NoData due to clouds, then the QA value is 0. QA value of 1 indicates NoData at original Day LST, 2 indicates NoData at Night LST and 3 indicates valid data at both, day and night. File names follow this naming convention: LST_  .tif, where  represents the year, represents the month and represents the day. Files of each year (2002-2019) are compressed in a ZIP file. The same data is also provided in NetCDF format, each file represents a whole year and is consist of 4 bands (Night LST Cont., Day LST Cont., Daily Average LST Cont., QA) for each day.

    The file LSTcont_validation.tif contains the validation dataset in which the MAE, RMSE, and Pearson (r) of the validation with true LST are provided. Data are stored in GeoTIFF format as signed 32-bit floats, with the same spatial extent and resolution as the LSTcont dataset. These data are stored with one file containing three bands (MAE, RMSE, and Perarson_r). The same data with the same structure is also provided in NetCDF format.

    How to use

    The data can be read in various of program languages such as Python, IDL, Matlab etc.and can be visualize in a GIS program such as ArcGis or Qgis. A short animation demonstrates how to visualize the data using the Qgis open source program is available in the project Github code reposetory.

    Web application

    The LSTcont web application (https://shilosh.users.earthengine.app/view/continuous-lst) is an Earth Engine app. The interface includes a map and a date picker. The user can select a date (July 2002 – present) and visualize LSTcont for that day anywhere on the globe. The web app calculate LSTcont on the fly based on ready-made global climatological files. The LSTcont can be downloaded as a GeoTiff with 5 bands in that order: Mean daily LSTcont, Night original LST, Night LSTcont, Day original LST, Day LSTcont.

    Code availability

    Datasets for other regions can be easily produced by the GEE platform with the code provided project Github code reposetory.

  13. ToS;DR policies dataset (raw) - 21/07/2023

    • zenodo.org
    csv
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). ToS;DR policies dataset (raw) - 21/07/2023 [Dataset]. http://doi.org/10.5281/zenodo.15012282
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Description

    This dataset has been collected and annotated by Terms of Service; Didn't Read (ToS;DR), an independent project aimed at analyzing and summarizing the terms of service and privacy policies of various online services. ToS;DR helps users understand the legal agreements they accept when using online platforms by categorizing and evaluating specific cases related to these policies.

    The dataset includes structured information on individual cases, broader topics, specific services, detailed documents, and key points extracted from legal texts.

    • Cases refer to individual legal cases or specific issues related to the terms of service or privacy policies of a particular online service. Each case typically focuses on a specific aspect of a service's terms, such as data collection, user rights, content ownership, or security practices.

      • id, a unique id for each case (incremental).
      • classification, one of those values (good, bad, neutral, blocker).
      • score, values range between 0 to 100.
      • title.
      • description.
      • topic_id, connecting the case with it's topic.
      • created_at.
      • updated_at.
      • privacy_related, a flag indicate if it's related to privacy or not.
      • docbot_regex, the regex expression used to check for specific words in the quoted text.
    • Topics are general categories or themes that encompass various cases. They help organize and group similar cases together based on the type of issues they address. For example, "Data Collection" could be a topic that includes cases related to how a service collects and uses user data.

      • id, a unique id for each topic (incremental).
      • title.
      • subtitle, small description.
      • description.
      • created_at.
      • updated_at.
    • Services represent specific online platforms, websites, or applications that have their own terms of service and privacy policies.

      • id, a unique id for each service (incremental).
      • name.
      • url.
      • created_at.
      • updated_at.
      • wikipedia, wikipedia url of the service.
      • keywords.
      • related, connecting the service with one of known similar services in the same field.
      • slug. extracted from the name, small letters, no spaces and so on.
      • is_comprehensively_reviewed, a flag indicate if it's comprehensively_reviewed or not.
      • rating, overall rating for the service based on the all cases.
      • status, indicate if the service is deleted or not (deleted, NaN).
    • Points are individual statements or aspects within a case that highlight important information about a service's terms of service or privacy policy. These points can be positive (e.g., strong privacy protections) or negative (e.g., data sharing with third parties).

      • id, a unique id for each point (incremental).
      • rank, all values are zero.
      • title, mostly it's similar to case title.
      • source, url of the source.
      • status, one of those values (approved, declined, pending, changes-requested, disputed, draft).
      • analysis.
      • created_at.
      • updated_at.
      • service_id, connecting the point with it's service.
      • quote_text, quotted text from the source which contain information for this point.
      • case_id, connecting the point with the related case.
      • old_id, used for data migration.
      • quote_start, index of first letter of the quotted text in the document.
      • quote_end, index of last letter of the quotted text in the document.
      • service_needs_rating_update, all values are False.
      • document_id, connecting the point with the related document.
      • annotation_ref.
    • Documents refer to the original terms of service and privacy policies of the services that are being analyzed on TOSDR. These documents are the source of information for the cases, points, and ratings provided on the platform. TOSDR links to the actual documents, so users can review the full details if they choose to.

      • id, a unique id for each document (incremental).
      • name, name of document like privacy policy or cookies policy, etc.
      • url, url of the document.
      • xpath.
      • text, the actual document.
      • created_at.
      • updated_at.
      • service_id, connecting the document with it's service.
      • reviewed, a flag indicate if the document has been reviewed or not.
      • status, indicate if the service is deleted or not (deleted, NaN).
      • crawler_server, the server used to crawl the document

  14. UiPad

    • huggingface.co
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MacPaw Way Ltd. (2024). UiPad [Dataset]. https://huggingface.co/datasets/MacPaw/UiPad
    Explore at:
    Dataset updated
    Oct 2, 2024
    Dataset provided by
    MacPaw
    Authors
    MacPaw Way Ltd.
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    UiPad - UI Parsing and Accessibility Dataset

    Curated by: MacPaw Way Ltd. Language(s): Mostly EN, UA License: MIT

    Overview UiPad is a dataset created for the IASA Champ 2024 Challenge, focusing on the accessibility and interface understanding of MacOS applications. With growing interest in AI-driven user interface analysis, the dataset aims to bridge the gap in available resources for desktop app accessibility. While mobile apps and web platforms benefit from datasets like RICO and… See the full description on the dataset page: https://huggingface.co/datasets/MacPaw/UiPad.

  15. Data from: Site-specific management of cotton root rot using airborne and...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Site-specific management of cotton root rot using airborne and high resolution satellite imagery and variable rate technology [Dataset]. https://catalog.data.gov/dataset/data-from-site-specific-management-of-cotton-root-rot-using-airborne-and-high-resolution-s-9a191
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    Cotton root rot is a century-old cotton disease that now can be effectively controlled with Topguard Terra fungicide. Because this disease tends to occur in the same general areas within fields in recurring years, site-specific application of the fungicide only to infested areas can be as effective as and considerably more economical than uniform application. The overall objective of this research was to demonstrate how site-specific fungicide application could be implemented based on historical remote sensing imagery and using variable-rate technology. Procedures were developed for creating binary prescription maps from historical airborne and high-resolution satellite imagery. Two different variable-rate liquid control systems were adapted to two existing cotton planters, respectively, for site-specific fungicide application at planting. One system was used for site-specific application on multiple fields in 2015 and 2016 near Edroy, Texas, and the other system was used on multiple fields in both years near San Angelo, Texas. Airborne multispectral imagery taken during the two growing seasons was used to monitor the performance of the site-specific treatments. Results based on prescription maps derived from historical airborne and satellite imagery of two fields in 2015 and one field in 2016 are reported in this article. Two years of field experiments showed that the prescription maps and the variable-rate systems performed well and that site-specific fungicide treatments effectively controlled cotton root rot. Reduction in fungicide use was 41%, 43%, and 63% for the three fields, respectively. The methodologies and results of this research will provide cotton growers, crop consultants, and agricultural dealers with practical guidelines for implementing site-specific fungicide application using historical imagery and variable-rate technology for effective management of cotton root rot. Resources in this dataset: Resource Title: A ground picture of cotton root rot File Name: IMG_0124.JPG Resource Description: A cotton root rot-infested area in a cotton field near Edroy, TX. Resource Title: An aerial image of a cotton field File Name: Color-infrared image of a field.jpg Resource Description: Aerial color-infrared (CIR) image of a cotton field infested with cotton root rot. Resource Title: As-applied fungicide application data File Name: Jim Ermis-Farm 1-Field 11 Fungicide Application.csv Resource Description: As-applied fungicide application rates for variable rate application of Topguard to a cotton field infested with cotton rot

  16. A

    Pattern-based GIS for understanding content of very large Earth Science...

    • data.amerigeoss.org
    • data.wu.ac.at
    html
    Updated Jan 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States (2020). Pattern-based GIS for understanding content of very large Earth Science datasets [Dataset]. https://data.amerigeoss.org/dataset/pattern-based-gis-for-understanding-content-of-very-large-earth-science-datasets1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jan 29, 2020
    Dataset provided by
    United States
    Area covered
    Earth
    Description

    The research focus in the field of remotely sensed imagery has shifted from collection and warehousing of data ' tasks for which a mature technology already exists, to auto-extraction of information and knowledge discovery from this valuable resource ' tasks for which technology is still under active development. In particular, intelligent algorithms for analysis of very large rasters, either high resolutions images or medium resolution global datasets, that are becoming more and more prevalent, are lacking. We propose to develop the Geospatial Pattern Analysis Toolbox (GeoPAT) a computationally efficient, scalable, and robust suite of algorithms that supports GIS processes such as segmentation, unsupervised/supervised classification of segments, query and retrieval, and change detection in giga-pixel and larger rasters. At the core of the technology that underpins GeoPAT is the novel concept of pattern-based image analysis. Unlike pixel-based or object-based (OBIA) image analysis, GeoPAT partitions an image into overlapping square scenes containing 1,000'100,000 pixels and performs further processing on those scenes using pattern signature and pattern similarity ' concepts first developed in the field of Content-Based Image Retrieval. This fusion of methods from two different areas of research results in orders of magnitude performance boost in application to very large images without sacrificing quality of the output.

    GeoPAT v.1.0 already exists as the GRASS GIS add-on that has been developed and tested on medium resolution continental-scale datasets including the National Land Cover Dataset and the National Elevation Dataset. Proposed project will develop GeoPAT v.2.0 ' much improved and extended version of the present software. We estimate an overall entry TRL for GeoPAT v.1.0 to be 3-4 and the planned exit TRL for GeoPAT v.2.0 to be 5-6. Moreover, several new important functionalities will be added. Proposed improvements includes conversion of GeoPAT from being the GRASS add-on to stand-alone software capable of being integrated with other systems, full implementation of web-based interface, writing new modules to extent it applicability to high resolution images/rasters and medium resolution climate data, extension to spatio-temporal domain, enabling hierarchical search and segmentation, development of improved pattern signature and their similarity measures, parallelization of the code, implementation of divide and conquer strategy to speed up selected modules.

    The proposed technology will contribute to a wide range of Earth Science investigations and missions through enabling extraction of information from diverse types of very large datasets. Analyzing the entire dataset without the need of sub-dividing it due to software limitations offers important advantage of uniformity and consistency. We propose to demonstrate the utilization of GeoPAT technology on two specific applications. The first application is a web-based, real time, visual search engine for local physiography utilizing query-by-example on the entire, global-extent SRTM 90 m resolution dataset. User selects region where process of interest is known to occur and the search engine identifies other areas around the world with similar physiographic character and thus potential for similar process. The second application is monitoring urban areas in their entirety at the high resolution including mapping of impervious surface and identifying settlements for improved disaggregation of census data.

  17. e

    UVP6Net : plankton images captured with the UVP6 - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Aug 8, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    Dataset updated
    Aug 8, 2018
    Description

    Plankton was imaged with UVP6 in contrasted oceanic regions. The full images were processed by the UVP6 firmware and the regions of interest (ROIs) around each individual object were recorded. A set of associated features were measured on the objects (see Picheral et al. 2021, doi:10.1002/lom3.10475, for more information). All objects were classified by a limited number of operators into 110 different classes using the web application EcoTaxa (http://ecotaxa.obs-vlfr.fr). The following dataset corresponds to the 634 459 objects that have an area superior to 73 pixels (equivalent spherical diameter of 9.8 pixels, corresponding to the default size limit of 620µm in the UVP6 configuration). The different files provide information about the features of the objects, their taxonomic identification as well as the raw images. For the purpose of training machine learning classifiers, the images in each class were split into training, validation, and test sets, with proportions 70%, 15% and 15%. An additional folder is provided, which includes the subset of images used to train the unique embedded classification model of the UVP6 actually deployed on the NKE CTS5 floats (10.5281/zenodo.10694203). These images correspond to UVP6Net objects filtered to retain only those with a size of 79 pixels to fit with the 645µm class from EcoPart, resulting in a total of 595,595 objects. The taxonomic identification was also made coarser (from 110 classes to 20) to ensure adequate performance of the classification model on power-constrained hardware. Images in this subset display objects as shades of grey/white on a black background. The folder UVP6Net_data.tar contains : taxa.csv.gz Table of the classification of each object in the dataset, with columns : - objid: unique object identifier in EcoTaxa (integer number). - taxon_level1: taxonomic name corresponding to the level 1 classification - lineage_level1: taxonomic lineage corresponding to the level 1 classification - taxon_level2: name of the taxon corresponding to the level 2 classification - plankton: if the object is a plankton or not (boolean) - set: class of the image corresponding to the taxon (train: training, val: validation, or test) - img_path: local path of the image corresponding to the taxon (of level 1), named according to the object id features_native.csv.gz Table of metadata of each object including the different features processed by the UVPapp application. All features are computed on the object only, excluding the background. All area/length measures are in pixels. All grey levels are encoded in 8 bits (0=black, 255=white). With columns : - objid: unique object identifier in EcoTaxa (integer number). And 62 features: - area - mean - stddev - mode - min - max - perim - width - height - major - minor - angle - circ - feret - intden - median - skew - kurt - %area - area_exc - fractal - skelarea - slope - histcum1, 2, 3 - nb1 nb2 nb3 - symetrieh - symetriev - symetriehc - symetrievc - convperim - convarea - fcons - thickr - elongation - range - meanpos - cv - sr - perimareaexc - feretareaexc - perimferet - perimmajor - circex - kurt_mean - skew_mean - convperim_perim - convarea_area - symetrieh_area - symetriev_area - nb1, nb2, nb3_area - nb1, nb2, nb3_range - median_mean/median_mean_range - skeleton_area See OBJECT measurements at https://doi.org/10.5281/zenodo.14704250 for definitions. features_skimage.csv.gz Table of morphological features recomputed with skimage.measure.regionprops on the ROIs produced by UVP6 firmware. See http://scikit-image.org/docs/dev/api/skimage.measure.html#skimage.measure.regionprops for documentation. inventory.tsv Tree view of the taxonomy and number of images in each taxon, displayed as text. With columns : - lineage_level1: taxonomic lineage corresponding to the level 1 classification - taxon_level1: name of the taxon corresponding to the level 1 classification - n: number of objects in each class 2. Second folder UVP6Net_imgs.tar contains : imgs Images of each object, named according to the object id objid and sorted in subdirectories according to their taxon. 3. The last folder UVPEC_imgs.tar contains : imgs Images of each object on a black background, stored in the format required to train and embedded classifier with the UVPEC package (https://github.com/ecotaxa/uvpec); i.e. each image is stored as “objid.jpg” in folders corresponding to their taxon (20 different classes), named “taxon_name_taxon_id”. 4. And : map.png Map of the sampling locations, to give an idea of the diversity sampled in this dataset.

  18. Context Ad Clicks Dataset

    • kaggle.com
    Updated Feb 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Möbius (2021). Context Ad Clicks Dataset [Dataset]. https://www.kaggle.com/arashnic/ctrtest/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Möbius
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The dataset generated by an E-commerce website which sells a variety of products at its online platform. The records user behaviour of its customers and stores it as a log. However, most of the times, users do not buy the products instantly and there is a time gap during which the customer might surf the internet and maybe visit competitor websites. Now, to improve sales of products, website owner has hired an Adtech company which built a system such that ads are being shown for owner products on its partner websites. If a user comes to owner website and searches for a product, and then visits these partner websites or apps, his/her previously viewed items or their similar items are shown on as an ad. If the user clicks this ad, he/she will be redirected to the owner website and might buy the product.

    The task is to predict the probability i.e. probability of user clicking the ad which is shown to them on the partner websites for the next 7 days on the basis of historical view log data, ad impression data and user data.

    Content

    You are provided with the view log of users (2018/10/15 - 2018/12/11) and the product description collected from the owner website. We also provide the training data and test data containing details for ad impressions at the partner websites(Train + Test). Train data contains the impression logs during 2018/11/15 – 2018/12/13 along with the label which specifies whether the ad is clicked or not. Your model will be evaluated on the test data which have impression logs during 2018/12/12 – 2018/12/18 without the labels. You are provided with the following files:

    • train.zip: This contains 3 files and description of each is given below:
    • train.csv
    • view_log.csv
    • item_data.csv

      • test.csv: test file contains the impressions for which the participants need to predict the click rate sample_submission.csv: This file contains the format in which you have to submit your predictions.

    Inspiration

    • Predict the probability probability of user clicking the ad which is shown to them on the partner websites for the next 7 days on the basis of historical view log data, ad impression data and user data.

    The evaluated metric could be "area under the ROC curve" between the predicted probability and the observed target.

  19. e

    EpiLexO - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Jan 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    Dataset updated
    Jan 10, 2025
    Description

    EpiLexO is a user friendly web application for the creation and editing of an integrated system of language resources for ancient fragmentary languages centered on the lexicon, in compliance with current digital humanities and Linked Open Data principles. EpiLexo allows for the editing of lexica with all relevant cross-references: for their linking to their testimonies, as well as to bibliographic information and other (external) resources and common vocabularies. This front-end application rests on a Service-Oriented Architecture with two main back-end components, the LexO-server (\handle) and the CASH-server (1github), which manage lexica and textual documents respectively via Rest-ful APIs web-services, plus additional services for the management of other aspects such as access and authentication, XML rendering, etc. All code is available on https://github.com/DigItAnt/ The application has been developed in the context of a project on the languages of fragmentary attestation of ancient Italy, but can be applied to other similar contexts.

  20. d

    Data from: Twitter Big Data as A Resource For Exoskeleton Research: A...

    • search.dataone.org
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thakur, Nirmalya (2023). Twitter Big Data as A Resource For Exoskeleton Research: A Large-Scale Dataset of about 140,000 Tweets and 100 Research Questions [Dataset]. http://doi.org/10.7910/DVN/VPPTRF
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Thakur, Nirmalya
    Description

    Please cite the following paper when using this dataset: N. Thakur, “Twitter Big Data as a Resource for Exoskeleton Research: A Large-Scale Dataset of about 140,000 Tweets and 100 Research Questions,” Preprints, 2022, DOI: 10.20944/preprints202206.0383.v1 Abstract The exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and use cases in assisted living, military, healthcare, firefighting, and industries. With the projected increase in the diverse uses of exoskeletons in the next few years in these application domains and beyond, it is crucial to study, interpret, and analyze user perspectives, public opinion, reviews, and feedback related to exoskeletons, for which a dataset is necessary. The Internet of Everything era of today's living, characterized by people spending more time on the Internet than ever before, holds the potential for developing such a dataset by mining relevant web behavior data from social media communications, which have increased exponentially in the last few years. Twitter, one such social media platform, is highly popular amongst all age groups, who communicate on diverse topics including but not limited to news, current events, politics, emerging technologies, family, relationships, and career opportunities, via tweets, while sharing their views, opinions, perspectives, and feedback towards the same. Therefore, this work presents a dataset of about 140,000 Tweets related to exoskeletons. that were mined for a period of 5-years from May 21, 2017, to May 21, 2022. The tweets contain diverse forms of communications and conversations which communicate user interests, user perspectives, public opinion, reviews, feedback, suggestions, etc., related to exoskeletons. Instructions: This dataset contains about 140,000 Tweets related to exoskeletons. that were mined for a period of 5-years from May 21, 2017, to May 21, 2022. The tweets contain diverse forms of communications and conversations which communicate user interests, user perspectives, public opinion, reviews, feedback, suggestions, etc., related to exoskeletons. The dataset contains only tweet identifiers (Tweet IDs) due to the terms and conditions of Twitter to re-distribute Twitter data only for research purposes. They need to be hydrated to be used. The process of retrieving a tweet's complete information (such as the text of the tweet, username, user ID, date and time, etc.) using its ID is known as the hydration of a tweet ID. The Hydrator application (link to download the application: https://github.com/DocNow/hydrator/releases and link to a step-by-step tutorial: https://towardsdatascience.com/learn-how-to-easily-hydrate-tweets-a0f393ed340e#:~:text=Hydrating%20Tweets) or any similar application may be used for hydrating this dataset. Data Description This dataset consists of 7 .txt files. The following shows the number of Tweet IDs and the date range (of the associated tweets) in each of these files. Filename: Exoskeleton_TweetIDs_Set1.txt (Number of Tweet IDs – 22945, Date Range of Tweets - July 20, 2021 – May 21, 2022) Filename: Exoskeleton_TweetIDs_Set2.txt (Number of Tweet IDs – 19416, Date Range of Tweets - Dec 1, 2020 – July 19, 2021) Filename: Exoskeleton_TweetIDs_Set3.txt (Number of Tweet IDs – 16673, Date Range of Tweets - April 29, 2020 - Nov 30, 2020) Filename: Exoskeleton_TweetIDs_Set4.txt (Number of Tweet IDs – 16208, Date Range of Tweets - Oct 5, 2019 - Apr 28, 2020) Filename: Exoskeleton_TweetIDs_Set5.txt (Number of Tweet IDs – 17983, Date Range of Tweets - Feb 13, 2019 - Oct 4, 2019) Filename: Exoskeleton_TweetIDs_Set6.txt (Number of Tweet IDs – 34009, Date Range of Tweets - Nov 9, 2017 - Feb 12, 2019) Filename: Exoskeleton_TweetIDs_Set7.txt (Number of Tweet IDs – 11351, Date Range of Tweets - May 21, 2017 - Nov 8, 2017) Here, the last date for May is May 21 as it was the most recent date at the time of data collection. The dataset would be updated soon to incorporate more recent tweets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
KappaSignal (2024). Similarweb's Surge: A Sign of Digital Dominance? (SMWB) (Forecast) [Dataset]. https://www.kappasignal.com/2024/05/similarwebs-surge-sign-of-digital.html
Organization logo

Similarweb's Surge: A Sign of Digital Dominance? (SMWB) (Forecast)

Explore at:
Dataset updated
May 22, 2024
Dataset authored and provided by
KappaSignal
License

https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html

Description

This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

Similarweb's Surge: A Sign of Digital Dominance? (SMWB)

Financial data:

  • Historical daily stock prices (open, high, low, close, volume)

  • Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

  • Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

  • Feature engineering based on financial data and technical indicators

  • Sentiment analysis data from social media and news articles

  • Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

  • Stock price prediction

  • Portfolio optimization

  • Algorithmic trading

  • Market sentiment analysis

  • Risk management

Use Cases:

  • Researchers investigating the effectiveness of machine learning in stock market prediction

  • Analysts developing quantitative trading Buy/Sell strategies

  • Individuals interested in building their own stock market prediction models

  • Students learning about machine learning and financial applications

Additional Notes:

  • The dataset may include different levels of granularity (e.g., daily, hourly)

  • Data cleaning and preprocessing are essential before model training

  • Regular updates are recommended to maintain the accuracy and relevance of the data

Search
Clear search
Close search
Google apps
Main menu