34 datasets found
  1. COVID19_datasets

    • kaggle.com
    zip
    Updated Apr 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suradech Kongkiatpaiboon (2022). COVID19_datasets [Dataset]. https://www.kaggle.com/datasets/suradechk/covid19-datasets/discussion
    Explore at:
    zip(136322570 bytes)Available download formats
    Dataset updated
    Apr 2, 2022
    Authors
    Suradech Kongkiatpaiboon
    Description

    Collected COVID-19 datasets from various sources as part of DAAN-888 course, Penn State, Spring 2022. Collaborators: Mohamed Abdelgayed, Heather Beckwith, Mayank Sharma, Suradech Kongkiatpaiboon, and Alex Stroud

    **1 - COVID-19 Data in the United States ** Source: The data is collected from multiple public health official sources by NY Times journalists and compiled in one single file. Description: Daily count of new COVID-19 cases and deaths for each state. Data is updated daily and runs from 1/21/2020 to 2/4/2022. URL: https://github.com/nytimes/covid-19-data/blob/master/us-states.csv Data size: 38,814 row and 5 columns.

    **2 - Mask-Wearing Survey Data ** Source: The New York Times is releasing estimates of mask usage by county in the United States. Description: This data comes from a large number of interviews conducted online by the global data and survey firm Dynata, at the request of The New York Times. The firm asked a question about mask usage to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. URL: https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv Data size: 3,142 rows and 6 columns

    **3a - Vaccine Data – Global ** Source: This data comes from the US Centers for Disease Control and Prevention (CDC), Our World in Data (OWiD) and the World Health Organization (WHO). Description: Time series data of vaccine doses administered and the number of fully and partially vaccinated people by country. This data was last updated on February 3, 2022 URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/global_data/time_series_covid19_vaccine_global.csv
    Data Size: 162,521 rows and 8 columns

    **3b -Vaccine Data – United States ** Source: The data is comprised of individual State's public dashboards and data from the US Centers for Disease Control and Prevention (CDC). Description: Time series data of the total vaccine doses shipped and administered by manufacturer, the dose number (first or second) by state. This data was last updated on February 3, 2022. URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/us_data/time_series/vaccine_data_us_timeline.csv
    Data Size: 141,503 rows and 13 columns

    **4 - Testing Data ** Source: The data is comprised of individual State's public dashboards and data from the U.S. Department of Health & Human Services. Description: Time series data of total tests administered by county and state. This data was last updated on January 25, 2022. URL: https://github.com/govex/COVID-19/blob/master/data_tables/testing_data/county_time_series_covid19_US.csv
    Data size: 322,154 rows and 8 columns

    **5 – US State and Territorial Public Mask Mandates ** Source: Data from state and territory executive orders, administrative orders, resolutions, and proclamations is gathered from government websites and cataloged and coded by one coder using Microsoft Excel, with quality checking provided by one or more other coders. Description: US State and Territorial Public Mask Mandates from April 10, 2020 through August 15, 2021 by County by Day URL: https://data.cdc.gov/Policy-Surveillance/U-S-State-and-Territorial-Public-Mask-Mandates-Fro/62d6-pm5i Data Size: 1,593,869 rows and 10 columns

    **6 – Case Counts & Transmission Level ** Source: This open-source dataset contains seven data items that describe community transmission levels across all counties. This dataset provides the same numbers used to show transmission maps on the COVID Data Tracker and contains reported daily transmission levels at the county level. The dataset is updated every day to include the most current day's data. The calculating procedures below are used to adjust the transmission level to low, moderate, considerable, or high.
    Description: US State and County case counts and transmission level from 16-Aug-2021 to 03-Feb-2022 URL: https://data.cdc.gov/Public-Health-Surveillance/United-States-COVID-19-County-Level-of-Community-T/8396-v7yb Data Size: 550,702 rows and 7 columns

    **7 - World Cases & Vaccination Counts ** Source: This is an open-source dataset collected and maintained by Our World in Data. OWID provides research and data to help against the world’s largest problems.
    Description: This dataset includes vaccinations, tests & positivity, hospital & ICU, confirmed cases, confirmed deaths, reproduction rate, policy responses and other variables of interest. URL: https://github.com/owid/covid-19-data/tree/master/public/data Data Size: 67 columns and 157,000 rows

    **8 - COVID-19 Data in the European Union ** Source: This is an open-source dataset collected and maintained by ECDC. It is an EU agency aimed at strengthening Europe's defenses against infectious diseases.
    Description: This dataset co...

  2. Experiment 4: perceived size of test and reference arrays with lines only...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J Daniel McCarthy; Colin Kupitz; Gideon P Caplovitz (2023). Experiment 4: perceived size of test and reference arrays with lines only present either within the interior of the elements or connecting the elements [Dataset]. http://doi.org/10.6084/m9.figshare.157060.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    J Daniel McCarthy; Colin Kupitz; Gideon P Caplovitz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In experiment 4a the first five rows of data indicate the proportion of times participants perceived an unbound test array as being larger than a fixed unbound reference array. The second five rows of indicate the proportion of times participants perceived an array with a line connecting the local elements as being larger than a fixed unbound reference array. Each proportion was calculated from 20 trials.In experiment 4b the first five rows of data indicate the proportion of times participants perceived an unbound test array as being larger than a fixed unbound reference array. The second five rows of indicate the proportion of times participants perceived an array with a line intersecting only the interiors of the elements as being larger than a fixed unbound reference array. Each proportion was calculated from 20 trials.

  3. Data Center Construction Market Analysis, Size, and Forecast 2025-2029:...

    • technavio.com
    pdf
    Updated Aug 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Data Center Construction Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, Italy, and UK), APAC (China, Japan, and South Korea), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/data-center-construction-market-size-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 9, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    Canada, United States
    Description

    Snapshot img

    Data Center Construction Market Size 2025-2029

    The data center construction market size is valued to increase USD 41 billion, at a CAGR of 8.8% from 2024 to 2029. Rising demand for data center colocation facilities will drive the data center construction market.

    Major Market Trends & Insights

    Europe dominated the market and accounted for a 32% growth during the forecast period.
    By Application - Enterprise segment was valued at USD 23.20 billion in 2023
    By Type - Electrical construction segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 70.71 billion
    Market Future Opportunities: USD 41.00 billion
    CAGR : 8.8%
    Europe: Largest market in 2023
    

    Market Summary

    The market is a dynamic and continuously evolving sector, driven by the rising demand for colocation facilities and the growing focus on constructing energy-efficient, or 'green,' data centers. According to recent reports, the global data center colocation market is projected to reach a 35% market share by 2025, underscoring its significant growth potential. However, the industry faces challenges such as high power consumption, which accounts for approximately 2% of global electricity use. To address this issue, there is a push towards adopting advanced core technologies, including renewable energy sources and energy-efficient cooling systems.
    Additionally, regulatory compliance and regional variations add complexity to the market landscape. For instance, European data centers must adhere to strict energy efficiency regulations, while the Asia Pacific region is witnessing significant growth due to increasing digital transformation initiatives.
    

    What will be the Size of the Data Center Construction Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    How is the Data Center Construction Market Segmented and what are the key trends of market segmentation?

    The data center construction industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Application
    
      Enterprise
      Cloud
      Colocation
      Hyperscale
    
    
    Type
    
      Electrical construction
      Mechanical construction
      General construction
    
    
    Geography
    
      North America
    
        US
        Canada
    
    
      Europe
    
        France
        Germany
        Italy
        UK
    
    
      APAC
    
        China
        Japan
        South Korea
    
    
      South America
    
        Brazil
    
    
      Rest of World (ROW)
    

    By Application Insights

    The enterprise segment is estimated to witness significant growth during the forecast period.

    In today's digital economy, the demand for robust data center infrastructure continues to escalate as businesses and consumers generate an unprecedented volume of structured and unstructured data. Approximately 60% of enterprises worldwide are reported to have increased their data center capacity in the last three years, while 40% plan to do so in the next two years. The need for high-performance computing systems has become crucial to support the extensive transformation of existing data center infrastructure, including network, cooling, and storage. Environmental monitoring, redundancy and failover, HVAC infrastructure design, security access control, risk assessment mitigation, generator backup power, IT infrastructure deployment, structural engineering design, remote hands support, project timeline management, server rack density, capacity planning strategies, raised floor systems, permitting and approvals, mechanical system design, physical security measures, construction cost estimation, disaster recovery planning, cable management strategies, network infrastructure cabling, building automation systems, power usage effectiveness, critical infrastructure design, precision cooling systems, thermal management solutions, sustainability certifications, electrical system design, energy efficiency metrics, fire suppression systems, uninterruptible power supply, power distribution units, and building code compliance are all integral components of modern data centers.

    Request Free Sample

    The Enterprise segment was valued at USD 23.20 billion in 2019 and showed a gradual increase during the forecast period.

    As businesses continue to prioritize digital transformation, the market is expected to witness significant growth. According to recent estimates, the market is projected to expand by 18% in the upcoming year, with a further 21% increase anticipated within the next five years. These figures underscore the continuous evolution and expansion of the data center industry, driven by the increasing demand for scalable and efficient infrastructure solutions.

    Request Free Sample

    Regional Analysis

    Europe is estimated to contribute 32% to the growth of the global marke

  4. f

    Data from: Variability, plot size and border effect in lettuce trials in...

    • datasetcatalog.nlm.nih.gov
    • scielo.figshare.com
    Updated Mar 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lopes, Sidinei José; Lúcio, Alessandro Dal’Col; Filho, Alberto Cargnelutti; Olivoto, Tiago; Santos, Daniel (2018). Variability, plot size and border effect in lettuce trials in protected environment [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000614350
    Explore at:
    Dataset updated
    Mar 14, 2018
    Authors
    Lopes, Sidinei José; Lúcio, Alessandro Dal’Col; Filho, Alberto Cargnelutti; Olivoto, Tiago; Santos, Daniel
    Description

    ABSTRACT The variability within rows of cultivation may reduce the accuracy of experiments conducted in a complete randomized block design if the rows are considered as blocks, however, little is known about this variability in protected environments. Thus, our aim was to study the variability of the fresh mass in lettuce shoot, growing in protected environment, and to verify the border effect and size of the experimental unit in minimizing the productive variability. Data from two uniformity trials carried out in a greenhouse in autumn and spring growing seasons were used. In the statistical analyses, it was considered the existence of parallel cultivation rows the lateral openings of the greenhouse and of columns perpendicular to these openings. Different scenarios were simulated by excluding rows and columns to generate several borders arrangements and also to use different sizes of the experimental unit. For each scenario, homogeneity test of variances between remaining rows and columns was performed, and it was calculated the variance and coefficient of variation. There is variability among rows in trials with lettuce in plastic greenhouses and the border use does not bring benefits in terms of reduction of the coefficient of variation or minimizing the cases of heterogeneous variances among rows. In experiments with lettuce in a plastic greenhouse, the use of an experimental unit size greater than or equal to two plants provides homogeneity of variances among rows and columns and, therefore, allows the use of a completely randomized design.

  5. 2022 Bikeshare Data -Reduced File Size -All Months

    • kaggle.com
    zip
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kendall Marie (2023). 2022 Bikeshare Data -Reduced File Size -All Months [Dataset]. https://www.kaggle.com/datasets/kendallmarie/2022-bikeshare-data-all-months-combined
    Explore at:
    zip(98884 bytes)Available download formats
    Dataset updated
    Mar 8, 2023
    Authors
    Kendall Marie
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This is a condensed version of the raw data obtained through the Google Data Analytics Course, made available by Lyft and the City of Chicago under this license (https://ride.divvybikes.com/data-license-agreement).

    I originally did my study in another platform, and the original files were too large to upload to Posit Cloud in full. Each of the 12 monthly files contained anywhere from 100k to 800k rows. Therefore, I decided to reduce the number of rows drastically by performing grouping, summaries, and thoughtful omissions in Excel for each csv file. What I have uploaded here is the result of that process.

    Data is grouped by: month, day, rider_type, bike_type, and time_of_day. total_rides represent the sum of the data in each grouping as well as the total number of rows that were combined to make the new summarized row, avg_ride_length is the calculated average of all data in each grouping.

    Be sure that you use weighted averages if you want to calculate the mean of avg_ride_length for different subgroups as the values in this file are already averages of the summarized groups. You can include the total_rides value in your weighted average calculation to weigh properly.

    9 Columns:

    date - year, month, and day in date format - includes all days in 2022 day_of_week - Actual day of week as character. Set up a new sort order if needed. rider_type - values are either 'casual', those who pay per ride, or 'member', for riders who have annual memberships. bike_type - Values are 'classic' (non-electric, traditional bikes), or 'electric' (e-bikes). time_of_day - this divides the day into 6 equal time frames, 4 hours each, starting at 12AM. Each individual ride was placed into one of these time frames using the time they STARTED their rides, even if the ride was long enough to end in a later time frame. This column was added to help summarize the original dataset. total_rides - Count of all individual rides in each grouping (row). This column was added to help summarize the original dataset. avg_ride_length - The calculated average of all rides in each grouping (row). Look to total_rides to know how many original rides length values were included in this average. This column was added to help summarize the original dataset. min_ride_length - Minimum ride length of all rides in each grouping (row). This column was added to help summarize the original dataset. max_ride_length - Maximum ride length of all rides in each grouping (row). This column was added to help summarize the original dataset.

    Please note: the time_of_day column has inconsistent spacing. Use mutate(time_of_day = gsub(" ", "", time_of _day)) to remove all spaces.

    Revisions

    Below is the list of revisions I made in Excel before uploading the final csv files to the R environment:

    • Deleted station location columns and lat/long as much of this data was already missing.

    • Deleted ride id column since each observation was unique and I would not be joining with another table on this variable.

    • Deleted rows pertaining to "docked bikes" since there were no member entries for this type and I could not compare member vs casual rider data. I also received no information in the project details about what constitutes a "docked" bike.

    • Used ride start time and end time to calculate a new column called ride_length (by subtracting), and deleted all rows with 0 and 1 minute results, which were explained in the project outline as being related to staff tasks rather than users. An example would be taking a bike out of rotation for maintenance.

    • Placed start time into a range of times (time_of_day) in order to group more observations while maintaining general time data. time_of_day now represents a time frame when the bike ride BEGAN. I created six 4-hour time frames, beginning at 12AM.

    • Added a Day of Week column, with Sunday = 1 and Saturday = 7, then changed from numbers to the actual day names.

    • Used pivot tables to group total_rides, avg_ride_length, min_ride_length, and max_ride_length by date, rider_type, bike_type, and time_of_day.

    • Combined into one csv file with all months, containing less than 9,000 rows (instead of several million)

  6. T

    Enterprise Data Platform Asset Total and Size

    • mydata.iowa.gov
    • data.iowa.gov
    csv, xlsx, xml
    Updated Nov 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Enterprise Data Platform Asset Total and Size [Dataset]. https://mydata.iowa.gov/w/v4nh-jbad/default?cur=jLt_9ZSUFnx&from=JdiBSQYXv5R
    Explore at:
    csv, xml, xlsxAvailable download formats
    Dataset updated
    Nov 18, 2025
    Description

    This query returns the total number of assets published on the Enterprise Data Platform and the total number of rows, columns and values published in datasets.

  7. R

    Training and testing XRD dataset for crystallite size and microstrain...

    • entrepot.recherche.data.gouv.fr
    image/x-silx-numpy +1
    Updated Nov 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandre BOULLE; Alexandre BOULLE; Arthur SOUESME; Arthur SOUESME (2025). Training and testing XRD dataset for crystallite size and microstrain determination using deep neural networks [Dataset]. http://doi.org/10.57745/SVQART
    Explore at:
    text/markdown(1068), image/x-silx-numpy(6059958836), image/x-silx-numpy(673347924)Available download formats
    Dataset updated
    Nov 20, 2025
    Dataset provided by
    Recherche Data Gouv
    Authors
    Alexandre BOULLE; Alexandre BOULLE; Arthur SOUESME; Arthur SOUESME
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Time period covered
    Oct 1, 2023 - Oct 1, 2026
    Dataset funded by
    Région Nouvelle Aquitaine
    Description

    Numpy tensors to train and test a convolutional neural network dedicated to determine crystallite size and/or microstrain from X-ray diffraction data (XRD): train_size.npz: training dataset with only crystallite size test_size.npz: testing dataset with only crystallite size train_size_strain.npz: training dataset with crystallite size and microstrain test_size_strain.npz: testing dataset with crystallite size and microstrain Each dataset contains the XRD data and the labels ("ground truth") in the form of 2D tensors with 10501 data points (columns) for the XRD data, and 24 labels (columns) for the labels. Training data contain 71971 rows ; testing data contain 7997 rows. Example python script to read the data: import numpy as np train = np.load("train_size.npz") train_data, train_label = train["train_data"], train["train_label"] print(f"Train data shape: {train_data.shape}, Train labels shape: {train_label.shape}") Jupyter notebooks to train and test a neural network can be found here: https://github.com/aboulle/LPA-NN

  8. d

    Company Data: Company Size, Address, Contact Details and Business Scope

    • datarade.ai
    .csv, .xls
    Updated Mar 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    C Insights Africa (2023). Company Data: Company Size, Address, Contact Details and Business Scope [Dataset]. https://datarade.ai/data-products/company-data-company-size-address-contact-details-and-busi-c-insights-africa
    Explore at:
    .csv, .xlsAvailable download formats
    Dataset updated
    Mar 22, 2023
    Dataset authored and provided by
    C Insights Africa
    Area covered
    Nigeria
    Description

    C Insights Africa's company database contains details of more than 10,000 organizations in Nigeria ranging from the large corporates, to the mid-sized and small companies. Our database contains attributes such as company size, address(s), contact details, type of business and related companies (where applicable). Marketing and sales executives can enrich their pipeline with our database, while business development teams or C-suite executives interested in finding new partners/frontiers are sure to find this database invaluable.

  9. d

    Data from: Trade-offs between growth rate, tree size and lifespan of...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated May 26, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christof Bigler (2016). Trade-offs between growth rate, tree size and lifespan of mountain pine (Pinus montana) in the Swiss National Park [Dataset]. http://doi.org/10.5061/dryad.d2680
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 26, 2016
    Dataset provided by
    Dryad
    Authors
    Christof Bigler
    Time period covered
    Jul 6, 2015
    Area covered
    Swiss National Park, canton of Grisons, Switzerland
    Description

    A within-species trade-off between growth rates and lifespan has been observed across different taxa of trees, however, there is some uncertainty whether this trade-off also applies to shade-intolerant tree species. The main objective of this study was to investigate the relationships between radial growth, tree size and lifespan of shade-intolerant mountain pines. For 200 dead standing mountain pines (Pinus montana) located along gradients of aspect, slope steepness and elevation in the Swiss National Park, radial annual growth rates and lifespan were reconstructed. While early growth (i.e. mean tree-ring width over the first 50 years) correlated positively with diameter at the time of tree death, a negative correlation resulted with lifespan, i.e. rapidly growing mountain pines face a trade-off between reaching a large diameter at the cost of early tree death. Slowly growing mountain pines may reach a large diameter and a long lifespan, but risk to die young at a small size. Early gro...

  10. H

    Bioinformatics Services Market Size and Forecast (2025 - 2035), Global and...

    • wemarketresearch.com
    csv, pdf
    Updated May 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    We Market Research (2025). Bioinformatics Services Market Size and Forecast (2025 - 2035), Global and Regional Growth, Trend, Share and Industry Analysis Report Coverage: By Service Type (Data Analysis & Interpretation, Sequencing Services, Data Management Services, Software & Tool Development, Consulting Services, Outsourcing Services, Others); Application (Genomics, Proteomics, Transcriptomics, Pharmacogenomics, Clinical Diagnostics, Personalized Medicine and Others) End-user (Pharmaceutical & Biotechnology Companies, Academic & Research Institutes, Hospitals & Healthcare Institutions, Contract Research Organizations (CROs) and Others) and Geography. [Dataset]. https://wemarketresearch.com/reports/bioinformatics-services-market/1735
    Explore at:
    pdf, csvAvailable download formats
    Dataset updated
    May 20, 2025
    Dataset authored and provided by
    We Market Research
    License

    https://wemarketresearch.com/privacy-policyhttps://wemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2035
    Area covered
    Worldwide
    Description

    The Bioinformatics Services Market will grow from $4.3B in 2025 to $15.7B by 2035, at a CAGR of 12.6%, driven by rising demand for biologics and biosimilars.

    Report AttributeDescription
    Market Size in 2025USD 4.3 Billion
    Market Forecast in 2035USD 15.7 Billion
    CAGR % 2025-203512.6%
    Base Year2024
    Historic Data2020-2024
    Forecast Period2025-2035
    Report USPProduction, Consumption, company share, company heatmap, company production capacity, growth factors and more
    Segments CoveredBy Service Type, By Application, By End-user
    Regional ScopeNorth America, Europe, APAC, Latin America, Middle East and Africa
    Country ScopeU.S., Canada, U.K., Germany, France, Italy, Spain, Benelux, Nordic Countries, Russia, China, India, Japan, South Korea, Australia, Indonesia, Thailand, Mexico, Brazil, Argentina, Saudi Arabia, UAE, Egypt, South Africa, Nigeria
  11. Visualizing Chicago Crime Data

    • kaggle.com
    zip
    Updated Jul 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elijah Toumoua (2022). Visualizing Chicago Crime Data [Dataset]. https://www.kaggle.com/datasets/elijahtoumoua/chicago-analysis-of-crime-data-dashboard
    Explore at:
    zip(94861784 bytes)Available download formats
    Dataset updated
    Jul 1, 2022
    Authors
    Elijah Toumoua
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Chicago
    Description

    Prelude

    This dataset is a cleaned version of the Chicago Crime Dataset, which can be found here. All rights for the dataset go to the original owners. The purpose of this dataset is to display my skills in visualizations and creating dashboards. To be specific, I will attempt to create a dashboard that will allow users to see metrics for a specific crime within a given year using filters and metrics. Due to this, there will not be much of a focus on the analysis of the data, but there will be portions discussing the validity of the dataset, the steps I took to clean the data, and how I organized it. The cleaned datasets can be found below, the Query (which utilized BigQuery) can be found here and the Tableau dashboard can be found here.

    About the Dataset

    Important Facts

    The dataset comes directly from the City of Chicago's website under the page "City Data Catalog." The data is gathered directly from the Chicago Police's CLEAR (Citizen Law Enforcement Analysis and Reporting) and is updated daily to present the information accurately. This means that a crime on a specific date may be changed to better display the case. The dataset represents crimes starting all the way from 2001 to seven days prior to today's date.

    Reliability

    Using the ROCCC method, we can see that: * The data has high reliability: The data covers the entirety of Chicago from a little over 2 decades. It covers all the wards within Chicago and even gives the street names. While we may not have an idea for how big the sample size is, I do believe that the dataset has high reliability since it geographically covers the entirety of Chicago. * The data has high originality: The dataset was gained directly from the Chicago Police Dept. using their database, so we can say this dataset is original. * The data is somewhat comprehensive: While we do have important information such as the types of crimes committed and their geographic location, I do not think this gives us proper insights as to why these crimes take place. We can pinpoint the location of the crime, but we are limited by the information we have. How hot was the day of the crime? Did the crime take place in a neighborhood with low-income? I believe that these key factors prevent us from getting proper insights as to why these crimes take place, so I would say that this dataset is subpar with how comprehensive it is. * The data is current: The dataset is updated frequently to display crimes that took place seven days prior to today's date and may even update past crimes as more information comes to light. Due to the frequent updates, I do believe the data is current. * The data is cited: As mentioned prior, the data is collected directly from the polices CLEAR system, so we can say that the data is cited.

    Processing the Data

    Cleaning the Dataset

    The purpose of this step is to clean the dataset such that there are no outliers in the dashboard. To do this, we are going to do the following: * Check for any null values and determine whether we should remove them. * Update any values where there may be typos. * Check for outliers and determine if we should remove them.

    The following steps will be explained in the code segments below. (I used BigQuery for this so the coding will follow BigQuery's syntax) ```

    Examining the dataset

    There are over 7.5 million rows of data

    Putting a limit so it does not take a long time to run

    SELECT * FROM portfolioproject-350601.ChicagoCrime.Crime LIMIT 1000;

    Seeing which points are null

    There are 85,000 null points so we can exclude them as it's not a significant amount since it is only ~1.3% of the dataset

    Most of the null points are in the lat and long, which we will need later

    Because we don't have the full address, we can't estimate the lat and long in SQL so we will have to delete the rows with Null Data

    SELECT * FROM portfolioproject-350601.ChicagoCrime.Crime WHERE unique_key IS NULL OR case_number IS NULL OR date IS NULL OR primary_type IS NULL OR location_description IS NULL OR arrest IS NULL OR longitude IS NULL OR latitude IS NULL;

    Deleting all null rows

    DELETE FROM portfolioproject-350601.ChicagoCrime.Crime WHERE
    unique_key IS NULL OR case_number IS NULL OR date IS NULL OR primary_type IS NULL OR location_description IS NULL OR arrest IS NULL OR longitude IS NULL OR latitude IS NULL;

    Checking for any duplicates in the unique keys

    None to be found

    SELECT unique_key, COUNT(unique_key) FROM `portfolioproject-350601.ChicagoCrime....

  12. T

    Study of Data Orchestration Tool Market by Cloud based and...

    • futuremarketinsights.com
    html, pdf
    Updated Apr 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sudip Saha (2024). Study of Data Orchestration Tool Market by Cloud based and Telecommunications from 2024 to 2034 [Dataset]. https://www.futuremarketinsights.com/reports/data-orchestration-tool-market
    Explore at:
    html, pdfAvailable download formats
    Dataset updated
    Apr 12, 2024
    Authors
    Sudip Saha
    License

    https://www.futuremarketinsights.com/privacy-policyhttps://www.futuremarketinsights.com/privacy-policy

    Time period covered
    2024 - 2034
    Area covered
    Worldwide
    Description

    Organizations have been overwhelmed with vast amounts of data generated from various sources, such as enterprise applications, IoT devices, social media platforms, as well as cloud services. Effectively harnessing this data to drive business insights and innovation has become a critical imperative for organizations seeking to maintain competitiveness and relevance in their respective industries.

    AttributesKey Insights
    Data Orchestration Tool Market Estimated Size in 2024US$ 1.3 billion
    Projected Market Value in 2034US$ 4.3 billion
    Value-based CAGR from 2024 to 203412.1%

    Country-wise Insights

    CountryThe United States
    CAGR through 20348.1%
    CountryGermany
    CAGR through 20345.3%
    CountryChina
    CAGR through 203412.6%
    CountryJapan
    CAGR through 20344.2%
    CountryAustralia and New Zealand
    CAGR through 20347.3%

    Category-wise Insights

    CategoryShares in 2024
    Cloud Based62.3%
    Telecommunications24.2%

    Report Scope

    AttributeDetails
    Estimated Market Size in 2024US$ 1.3 billion
    Projected Market Valuation in 2034US$ 4.3 billion
    Value-based CAGR 2024 to 203412.1%
    Forecast Period2024 to 2034
    Historical Data Available for2019 to 2023
    Market AnalysisValue in US$ Billion
    Key Regions Covered
    • North America
    • Latin America
    • Western Europe
    • Eastern Europe
    • South Asia and Pacific
    • East Asia
    • The Middle East & Africa
    Key Market Segments Covered
    • Deployment Model
    • Industry
    • Region
    Key Countries Profiled
    • The United States
    • Canada
    • Brazil
    • Mexico
    • Germany
    • France
    • France
    • Spain
    • Italy
    • Russia
    • Poland
    • Czech Republic
    • Romania
    • India
    • Bangladesh
    • Australia
    • New Zealand
    • China
    • Japan
    • South Korea
    • GCC countries
    • South Africa
    • Israel
    Key Companies Profiled
    • AWS
    • Google
    • SAP
    • Microsoft
    • Prefect
    • Dagster
    • Luigi
    • Metaflow

  13. m

    A Litopenaeus vannamei shrimp dataset with images and corresponding...

    • data.mendeley.com
    Updated Jul 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Joaquín Ramírez-Coronel (2024). A Litopenaeus vannamei shrimp dataset with images and corresponding weight-size measurements for the development of artificial intelligence-based biomass estimation and organism detection algorithms [Dataset]. http://doi.org/10.17632/h8tcn6ykky.2
    Explore at:
    Dataset updated
    Jul 1, 2024
    Authors
    Fernando Joaquín Ramírez-Coronel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was compiled with the ultimate goal of developing non-invasive computer vision algorithms for assessing shrimp biometrics and biomass estimation. The main folder, labeled "DATASET," contains five sub-folders—DB1, DB2, DB3, DB4, and DB5—each filled with images of shrimps. Additionally, each sub-folder is accompanied by an Excel file that includes manually measured data for the shrimps pictured. The files are named respectively: DB1_INDUSTRIAL_FARM_1, DB2_INDUSTRIAL_FARM_2_C1, DB3_INDUSTRIAL_FARM_2_C2, DB4_ACADEMIC_POND_S1, and DB5_ACADEMIC_POND_S2.

    Here’s a detailed description of the contents of each sub-folder and its corresponding Excel file:

    1) DB1 includes 490 PNG images of 22 shrimps taken from one pond at an industrial farm. The associated Excel file, DB1_INDUSTRIAL_FARM_1, contains columns for: SAMPLE: Reflecting the number of individual shrimps (22 entries or rows). LENGTH (cm): Measuring from the rostrum (near the eyes) to the start of the tail. WEIGHT (g): Recorded using a scale. COMPLETE SHRIMP IMAGES: Indicates if at least one full-body image is available (1) or not (0).

    2) DB2 consists of 2002 PNG images of 58 shrimps. The Excel file, DB2_INDUSTRIAL_FARM_2_C1, includes: SAMPLE: Number of shrimps (58 entries or rows). CEPHALOTHORAX (cm): Total length of the cephalothorax. LENGTH (cm) and WEIGHT (g): Similar measurements as DB1. COMPLETE SHRIMP IMAGES: Presence (1) or absence (0) of full-body images.

    3) DB3 contains 1719 PNG images of 50 shrimps, with its Excel file, DB3_INDUSTRIAL_FARM_2_C2, documenting: SAMPLE: Number of shrimps (50 entries or rows). Measurements and categories identical to DB2.

    4) DB4 encompasses 635 PNG images of 20 shrimps, detailed in the Excel file DB4_ACADEMIC_POND_S1. This includes: SAMPLE: Number of shrimps (20 entries or rows). CEPHALOTHORAX (cm), LENGTH (cm), WEIGHT (g), and COMPLETE SHRIMP IMAGES: Documented as in other datasets.

    5) DB5 includes 661 PNG images of 20 shrimps, with DB5_ACADEMIC_POND_S2 as the corresponding Excel file. The file mirrors the structure and measurements of DB4.

    The images for each foler are named "sm_n", where m is the number of shrimp sample and n is the number of picture of that shrimp. This carefully structured dataset provides comprehensive biometric data on shrimps, facilitating the development of algorithms aimed at non-invasive measurement techniques. This will likely be pivotal in enhancing the precision of biomass estimation in aquaculture farming, utilizing advanced statistical morphology analysis and machine learning techniques.

    CHANGES FROM VERSION 1:

    The cephalothorax metric is the length rather than the width. That was an error in the first version. The name in the columns also had a typo, which has been corrected (from CEPHALOTORAX to CEPHALOTHORAX).

  14. D

    Data Center Cooling Market Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Data Center Cooling Market Report [Dataset]. https://www.marketreportanalytics.com/reports/data-center-cooling-market-10650
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Mar 18, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Data Center Cooling market is experiencing robust growth, projected to reach $1452.12 million in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 6.78% from 2025 to 2033. This expansion is fueled by several key factors. The increasing density of data centers, driven by the exponential growth of data generated globally, necessitates advanced cooling solutions to prevent overheating and ensure optimal performance. Furthermore, rising energy costs and growing concerns about environmental sustainability are pushing the adoption of energy-efficient cooling technologies like liquid cooling and adiabatic cooling systems. The market is segmented by cooling type, with room-cooling, rack-cooling, and row-cooling solutions catering to diverse data center needs and sizes. Leading companies are aggressively pursuing innovative strategies, including mergers and acquisitions, strategic partnerships, and research and development investments, to strengthen their market positions and capitalize on this burgeoning market. Geographic expansion, particularly in rapidly developing economies in Asia-Pacific and other regions with increasing data center deployments, presents significant growth opportunities. However, challenges such as high initial investment costs associated with advanced cooling systems and the need for skilled professionals to manage and maintain these complex technologies may act as restraints. The competitive landscape is marked by the presence of both established players and emerging technology companies. Major players like 3M, Daikin, Schneider Electric, and Vertiv are leveraging their technological expertise and extensive distribution networks to maintain their dominance. Meanwhile, smaller, innovative companies are introducing niche solutions and challenging the incumbents. The market's future growth trajectory hinges on technological advancements, the evolution of data center designs, and the ongoing demand for environmentally sustainable cooling solutions. The consistent need for reliable, energy-efficient, and scalable cooling infrastructure will be the primary driver of this market's continued expansion throughout the forecast period.

  15. T

    Mobile Payment Data Protection Market Analysis by Contactless and Remote...

    • futuremarketinsights.com
    html, pdf
    Updated Jun 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sudip Saha (2024). Mobile Payment Data Protection Market Analysis by Contactless and Remote Tokenization through 2034 [Dataset]. https://www.futuremarketinsights.com/reports/mobile-payment-data-protection-market
    Explore at:
    html, pdfAvailable download formats
    Dataset updated
    Jun 14, 2024
    Authors
    Sudip Saha
    License

    https://www.futuremarketinsights.com/privacy-policyhttps://www.futuremarketinsights.com/privacy-policy

    Time period covered
    2024 - 2034
    Area covered
    Worldwide
    Description

    The global mobile payment data protection market size is anticipated to experience notable growth of USD 7,30,843.3 million in 2024 from USD 6,59,096.1 million in 2023. The industry is foreseen to sustain expansion with outstanding numbers of USD 23,66,892.7 million by 2034, with a CAGR of 12.5% through 2034.

    AttributesDescription
    Estimated Global Mobile Payment Data Protection Market Size, 2024USD 7,30,843.3 million
    Projected Global Mobile Payment Data Protection Market Size, 2034USD 23,66,892.7 million
    Value-based CAGR (2024 to 2034)12.5% CAGR

    Semi Annual Market Update

    ParticularValue CAGR
    H19.8% (2023 to 2033)
    H210.2% (2023 to 2033)
    H110% (2024 to 2034)
    H210.2% (2024 to 2034)

    Country-wise Insights

    CountriesCAGR from 2024 to 2034
    Australia16%
    China13%
    United States9.3%
    Germany7.9%
    Japan7.2%

    Category-wise Insights

    SegmentContactless Tokenisation (Product)
    Value Share (2024)56.2%
    SegmentBanking and Financial Service (End User)
    Value Share (2024)33.7%
  16. Synthetic datasets reflecting the shRNA-seq knockdown ENCODE data for HepG2...

    • zenodo.org
    txt
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erik Sonnhammer lab; Claudia Kutter lab; Erik Sonnhammer lab; Claudia Kutter lab (2024). Synthetic datasets reflecting the shRNA-seq knockdown ENCODE data for HepG2 and K562 with coresponding GRN [Dataset]. http://doi.org/10.5281/zenodo.12165429
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Erik Sonnhammer lab; Claudia Kutter lab; Erik Sonnhammer lab; Claudia Kutter lab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 19, 2024
    Description

    Synthetic data correspond to the ENCODE data for cell lines HepG2 (https://www.encodeproject.org/biosamples/ENCBS282XVK/) and K562 (https://www.encodeproject.org/biosamples/ENCBS023XVB/). The data and networks were generated using GeneSPIDER (publicly available at https://bitbucket.org/sonnhammergrni/genespider/).

    Table.1 Description of the files

    data_HepG2like_SNR_L=0.0054699_diff=1.6188e-05.txtSynthetic gene expression knockdown (shRNA-seq) data immitating the ENCODE data for HepG2 cell line. Data size: 232 RBPs vs 464 experiments (2 replicates). SNR_L is the value of signal to noise ratio. Difference (diff) value tells the difference between replicate correlation coefficients of real and synthetic ENCODE data. Columns represent experiments, rows represent genes.
    data_K562like_SNR_L=0.0028692_diff=0.00017339.txtSynthetic gene expression knockdown (shRNA-seq) data immitating the ENCODE data for K562 cell line. Data size: 232 RBPs vs 464 experiments (2 replicates). SNR_L is the value of signal to noise ratio. Difference (diff) value tells the difference between replicate correlation coefficients of real and synthetic ENCODE data. Columns represent experiments, rows represent genes.
    network_HEPG2like_sparsity4.txtSynthetic scale-free gene regulatory network compatibile with data_HepG2like_SNR_L=0.0054699_diff=1.6188e-05.txt. Sparsity (average node degree) is 4 including selfloops. Direction should be read from columns to rows.
    network_K562like_sparsity4.txtSynthetic scale-free gene regulatory network compatibile with data_K562like_SNR_L=0.0028692_diff=0.00017339.txt. Sparsity (average node degree) is 4 including selfloops. Direction should be read from columns to rows.
    perturbations_HepG2&K562_2replicates.txtPerturbation matrix including information about knockeddown RBPs. Data size: 232 RBPs vs 464 experiments (2 replicates).

    Created by Garbulowski et al. (2024) as a part of the work entitled "Comprehensive analysis of the RBP regulome reveals functional modules and drug candidates in liver cancer"

  17. House Price Regression Dataset

    • kaggle.com
    zip
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prokshitha Polemoni (2024). House Price Regression Dataset [Dataset]. https://www.kaggle.com/datasets/prokshitha/home-value-insights
    Explore at:
    zip(27045 bytes)Available download formats
    Dataset updated
    Sep 6, 2024
    Authors
    Prokshitha Polemoni
    Description

    Home Value Insights: A Beginner's Regression Dataset

    This dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.

    Features:

    1. Square_Footage: The size of the house in square feet. Larger homes typically have higher prices.
    2. Num_Bedrooms: The number of bedrooms in the house. More bedrooms generally increase the value of a home.
    3. Num_Bathrooms: The number of bathrooms in the house. Houses with more bathrooms are typically priced higher.
    4. Year_Built: The year the house was built. Older houses may be priced lower due to wear and tear.
    5. Lot_Size: The size of the lot the house is built on, measured in acres. Larger lots tend to add value to a property.
    6. Garage_Size: The number of cars that can fit in the garage. Houses with larger garages are usually more expensive.
    7. Neighborhood_Quality: A rating of the neighborhood’s quality on a scale of 1-10, where 10 indicates a high-quality neighborhood. Better neighborhoods usually command higher prices.
    8. House_Price (Target Variable): The price of the house, which is the dependent variable you aim to predict.

    Potential Uses:

    1. Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.

    2. Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.

    3. Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.

    4. Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.

    Versatility:

    • The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.

    • It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.

    • This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.

  18. Z

    Mineral spectral refractive index and bulk optical property dataset for...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhang, Yuheng; Saito, Masanori; Yang, Ping; Schuster, Gregory; Trepte, Charles (2024). Mineral spectral refractive index and bulk optical property dataset for aerosol studies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8144788
    Explore at:
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Texas A&M University
    NASA Langley Research Center
    Authors
    Zhang, Yuheng; Saito, Masanori; Yang, Ping; Schuster, Gregory; Trepte, Charles
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Version 1.3, updated 11/15/2024.

    Added a file with 27 regional dust sample mineral composition information 'NewRegionalSamples.xlsx',

    along with the refractive index data.

    All refractive index files here have 127 rows (wavelengths) and 27 columns (samples)

    'kall27_coarse.dat' is the imaginary part of the coarse mode.

    'kall27_fine.dat' is the imaginary part of the fine mode.

    'nall27_coarse.dat' is the real part of the coarse mode.

    'nall27_fine.dat' is the real part of the fine mode.

    Version 1.2, updated 04/23/2024.Major changes: Changed all the data file names to new format: "mix"+{property name}+{number}, rearranged the number of mixing samples

    Updated all the bulk optical property data. This version use constant values of standard deviation in the lognormal size distribution settings for the coarse mode and the fine mode respectively.

    The phase matrices are separated from the other bulk properties due to their large file sizes. The readme file is updated correspondingly. The information of scattering angles (498 angles in total) is uploaded as "TAMUdust2020_Angle.dat".

    Added supplemental file data in 'Supplemental.tar.gz'.

    Additional refractive indices are zipped in 'AdditionalRefInd.tar.gz'

    Version 1.1, updated 03/14/2024.Major changes: Added mixed bulk properties for "0 (99%coarse+1%fine)" and "11 (2.0 µm coarse+ 0.4 µm fine)";Added "reff.dat" in the 'BulkProperties.tar.gz'. The data include four columns: fine mode fraction, bulk projected area , bulk volume , effective radius r_eff. The information is for mixed sample number 0 to 11, each corresponds to one row.Added refractive indices for chlorite, mica, smectite, pyroxene, vermiculite and pyroxenes. These groups can be applied in some other models.

    Version 1.0, uploaded 01/02/2024.

    This database include supplemental data and files for the publication of this paper:

    Sensitivities of Spectral Optical Properties of Dust Aerosols to their Mineralogical and Microphysical Properties. Yuheng Zhang, M. Saito, P. Yang, G. L. Schuster, and C. R. Trepte, J. Geophys. Res. Atmos. 2024.

    The supplemental data include:

    1) 'GroupRefInd.tar.gz' Mineral (group) refractive index files.E. g., 1All_Illite.dat contains the complex refractive index files of illite group. Format (from left to right columns): Wavelength (unit: µm), Real part (n), Imaginary part (k), standard deviation of n, standard deviation of k.

    The file 'fine_log.dat' includes the mean and standard deviation values of n and k for all the generated fine mode dust samples at 11,044 wavelengths from 0.2 to 50 micron.

    The file 'fine_log127.dat' only includes the values at 127 wavelengths from 0.2 to 50 micron (defined in 'swav.txt' and 'lwav.txt'), and is used for the bulk property computations.

    The files 'coarse_log.dat' and 'coarse_log127.dat' are for the coarse mode dust samples.

    2) 'CompositionFraction.xlsx': Mineral composition data sources/references and composition data (mean and standard deviation values of each group).'Vlog_coarse.dat': Randomly generated VOLUME FRACTION of 9 mineral groups for the coarse mode dust. Left to right: Illite, Kaolinite, Montmorillonite (Other clays), Quartz, Feldspar, Carbonate, Gypsum (Sulphate), Hematite, Goethite.

    'Vlog_fine.dat': For the fine mode dust.

    3) 'RefSources.xlsx': The data source references of mineral refractive indices. We didn't include the olivine, other silicates, soot and titanium-rich minerals in the paper, but the refractive indices are available for those who are interested. Chlorite, Mica and Vermiculite group are mentioned in some studies, and we included the refractive indices for these minerals as well.

    4) 'DustSamples.tar.gz' Dust sample refractive index files.The files are enclosed in four folders: fine_sw/ fine_lw/ coarse_sw/ coarse_lw/.

    fine: fine mode. coarse: coarse mode.

    'sw' means shortwave (< 4 µm, in total 76 wavelengths defined in 'swav.txt') while 'lw' means longwave (>= 4 µm, in total 51 wavelengths defined in 'lwav.txt').

    All files start with 'rdn', which means that they are computed based on randomly generated composition (data given in sheet 2 of 'CompositionFraction.xlsx').

    The four digit number after 'rdn' is the index of each dust sample. In total, there are 5,000 samples. The sample composition is the same for the same sample index in the same size mode (fine/coarse). Data file format (from left to right columns): real part, imaginary part.

    5) 'BulkProperties.tar.gz' Bulk property files (excluding phase matrices)'mixqx.dat' files format (from left to right columns): Extinction efficiency (Qext), Scattering efficiency (Qsca), Backscattering efficiency (Qbck), and Asymmetry coefficient (Qasy). To obtain asymmetry factor, use Qasy/Qsca.

    'mixbkx.dat' files format (from left to right columns): P11(pi) P12(pi) P22(pi) P33(pi) P34(pi) P44(pi).

    'x' refers to the number at the end of the file name. It can be 100 ~ 112, each represents a setting of coarse and fine mode effective radius and volume fraction (see details in "reff.dat")

    'reff.dat' contains the effective radius information of the mixture. It has 7 columns: File number "x", Fine mode volume fraction, Fine mode effective radius (µm), Coarse mode effective radius (µm), Bulk projected area (µm^2), Bulk volume (µm^3), Bulk effective radius (µm).

    6) 'PhaseMatrices.tar.gz' Phase matrices data'mixphswx.dat' files contain phase matrix results at 532 nm (shortwave). From left to right: P11, P12, P22, P33, P34, P44.

    'mixphlwx.dat' files contain phase matrix results at 10.5 µm (longwave).

    There are 635,000 rows in each data file. 635,000 rows = 127 wavelengths * 5,000 samples. Row 1~127 is sample 1, row 128~254 is sample 2, etc.. Suggest to use matlab function 'reshape(property, 127, 5000)' for each column when processing the data.

    7) 'Supplemental.tar.gz'

    We also include data files mentioned in the supplemental file of the paper. The adjusted source data files of the nine mineral groups are included.

    The supplemental bulk property files are named based on the figure number.

    8) 'AdditionalRefInd.tar.gz'

    We also include additional refractive indices for chlorite, smectite, vermiculite, mica, dolomite, titanium-rich minerals, pyroxenes and soot. These data can be useful in other models.

    For more detailed information and datasets, please contact: Yuheng Zhang, yuheng98@tamu.edu or yuhengz98@qq.com.

  19. Data from: Size of plots for experiments with cactus pear cv. Gigante

    • scielo.figshare.com
    jpeg
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bruno V. C. Guimarães; Sérgio L. R. Donato; Ignacio Aspiazú; Alcinei M. Azevedo; Abner J. de Carvalho (2023). Size of plots for experiments with cactus pear cv. Gigante [Dataset]. http://doi.org/10.6084/m9.figshare.8092634.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Bruno V. C. Guimarães; Sérgio L. R. Donato; Ignacio Aspiazú; Alcinei M. Azevedo; Abner J. de Carvalho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT The definition of experimental plot size is an essential tool to ensure precision in statistical analysis in experiments. The objective of this study was to estimate the plot size for the cactus pear cv. Gigante using the Modified Maximum Curvature Method, under the semi-arid conditions of Northeastern Brazil. The uniformity test was conducted at the Federal Institute of Bahia, Guanambi Campus, Bahia state, Brazil, during the agricultural period from 2009 to 2011. The spatial arrangement was composed of ten rows with 50 plants each, whose evaluated area was formed by the eight central rows with 48 plants per row, making 384 plants and area of 153.60 m2. The following variables were evaluated: plant height; length, width and thickness of cladode; number of cladodes; total area of cladodes; cladode area and green mass yield in the third production cycle. In the evaluations, each plant was considered as a basic experimental unit (BEU), with an area of 0.4 m2, comprising 384 basic units (BU), whose adjacent ones were combined to form 15 pre-established plot sizes with rectangular shapes and in rows. The characteristics total area of cladodes and green mass yield require larger plot sizes to be evaluated with greater experimental accuracy. For experimental evaluation of cactus pear cv. Gigante, plot size should be eight plants in the direction of the crop row.

  20. Retail Market Basket Transactions Dataset

    • kaggle.com
    Updated Aug 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wasiq Ali (2025). Retail Market Basket Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/retail-market-basket-transactions-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Wasiq Ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview

    The Market_Basket_Optimisation dataset is a classic transactional dataset often used in association rule mining and market basket analysis.
    It consists of multiple transactions where each transaction represents the collection of items purchased together by a customer in a single shopping trip.

    • File Name: Market_Basket_Optimisation.csv
    • Format: CSV (Comma-Separated Values)
    • Structure: Each row corresponds to one shopping basket. Each column in that row contains an item purchased in that basket.
    • Nature of Data: Transactional, categorical, sparse.
    • Primary Use Case: Discovering frequent itemsets and association rules to understand shopping patterns, product affinities, and to build recommender systems.

    Detailed Information

    📊 Dataset Composition

    • Transactions: 7,501 (each row = one basket).
    • Items (unique): Around 120 distinct products (e.g., bread, mineral water, chocolate, etc.).
    • Columns per row: Up to 20 possible items (not fixed; some rows have fewer, some more).
    • Data Type: Purely categorical (no numerical or continuous features).
    • Missing Values: Present in the form of empty cells (since not every basket has all 20 columns).
    • Duplicates: Some baskets may appear more than once — this is acceptable in transactional data as multiple customers can buy the same set of items.

    🛒 Nature of Transactions

    • Basket Definition: Each row captures items bought together during a single visit to the store.
    • Variability: Basket size varies from 1 to 20 items. Some customers buy only one product, while others purchase a full set of groceries.
    • Sparsity: Since there are ~120 unique items but only a handful appear in each basket, the dataset is sparse. Most entries in the one-hot encoded representation are zeros.

    🔎 Examples of Data

    Example transaction rows (simplified):

    Item 1Item 2Item 3Item 4...
    BreadButterJam
    Mineral waterChocolateEggsMilk
    SpaghettiTomato sauceParmesan

    Here, empty cells mean no item was purchased in that slot.

    📈 Applications of This Dataset

    This dataset is frequently used in data mining, analytics, and recommendation systems. Common applications include:

    1. Association Rule Mining (Apriori, FP-Growth):

      • Discover rules like {Bread, Butter} ⇒ {Jam} with high support and confidence.
      • Identify cross-selling opportunities.
    2. Product Affinity Analysis:

      • Understand which items tend to be purchased together.
      • Helps with store layout decisions (placing related items near each other).
    3. Recommendation Engines:

      • Build systems that suggest "You may also like" products.
      • Example: If a customer buys pasta and tomato sauce, recommend cheese.
    4. Marketing Campaigns:

      • Bundle promotions and discounts on frequently co-purchased products.
      • Personalized offers based on buying history.
    5. Inventory Management:

      • Anticipate demand for certain product combinations.
      • Prevent stockouts of items that drive the purchase of others.

    📌 Key Insights Potentially Hidden in the Dataset

    • Popular Items: Some items (like mineral water, eggs, spaghetti) occur far more frequently than others.
    • Product Pairs: Frequent pairs and triplets (e.g., pasta + sauce + cheese) reflect natural meal-prep combinations.
    • Basket Size Distribution: Most customers buy fewer than 5 items, but a small fraction buy 10+ items, showing long-tail behavior.
    • Seasonality (if extended with timestamps): Certain items might show peaks in demand during weekends or holidays (though timestamps are not included in this dataset).

    📂 Dataset Limitations

    1. No Customer Identifiers:

      • We cannot track repeated purchases by the same customer.
      • Analysis is limited to basket-level insights.
    2. No Timestamps:

      • No temporal analysis (trends over time, seasonality) is possible.
    3. No Quantities or Prices:

      • We only know whether an item was purchased, not how many units or its cost.
    4. Sparse & Noisy:

      • Many baskets are small (1–2 items), which may produce weak or trivial rules.

    🔮 Potential Extensions

    • Synthetic Timestamps: Assign simulated timestamps to study temporal buying patterns.
    • Add Customer IDs: If merged with external data, one can perform personalized recommendations.
    • Price Data: Adding cost allows for profit-driven association rules (not just frequency-based).
    • Deep Learning Models: Sequence models (RNNs, Transformers) could be applied if temporal ordering of items is introduced.

    ...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Suradech Kongkiatpaiboon (2022). COVID19_datasets [Dataset]. https://www.kaggle.com/datasets/suradechk/covid19-datasets/discussion
Organization logo

COVID19_datasets

COVID-19 datasets obtained from github.com/nytimes/covid-19-data/ and cdc sites

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(136322570 bytes)Available download formats
Dataset updated
Apr 2, 2022
Authors
Suradech Kongkiatpaiboon
Description

Collected COVID-19 datasets from various sources as part of DAAN-888 course, Penn State, Spring 2022. Collaborators: Mohamed Abdelgayed, Heather Beckwith, Mayank Sharma, Suradech Kongkiatpaiboon, and Alex Stroud

**1 - COVID-19 Data in the United States ** Source: The data is collected from multiple public health official sources by NY Times journalists and compiled in one single file. Description: Daily count of new COVID-19 cases and deaths for each state. Data is updated daily and runs from 1/21/2020 to 2/4/2022. URL: https://github.com/nytimes/covid-19-data/blob/master/us-states.csv Data size: 38,814 row and 5 columns.

**2 - Mask-Wearing Survey Data ** Source: The New York Times is releasing estimates of mask usage by county in the United States. Description: This data comes from a large number of interviews conducted online by the global data and survey firm Dynata, at the request of The New York Times. The firm asked a question about mask usage to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. URL: https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv Data size: 3,142 rows and 6 columns

**3a - Vaccine Data – Global ** Source: This data comes from the US Centers for Disease Control and Prevention (CDC), Our World in Data (OWiD) and the World Health Organization (WHO). Description: Time series data of vaccine doses administered and the number of fully and partially vaccinated people by country. This data was last updated on February 3, 2022 URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/global_data/time_series_covid19_vaccine_global.csv
Data Size: 162,521 rows and 8 columns

**3b -Vaccine Data – United States ** Source: The data is comprised of individual State's public dashboards and data from the US Centers for Disease Control and Prevention (CDC). Description: Time series data of the total vaccine doses shipped and administered by manufacturer, the dose number (first or second) by state. This data was last updated on February 3, 2022. URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/us_data/time_series/vaccine_data_us_timeline.csv
Data Size: 141,503 rows and 13 columns

**4 - Testing Data ** Source: The data is comprised of individual State's public dashboards and data from the U.S. Department of Health & Human Services. Description: Time series data of total tests administered by county and state. This data was last updated on January 25, 2022. URL: https://github.com/govex/COVID-19/blob/master/data_tables/testing_data/county_time_series_covid19_US.csv
Data size: 322,154 rows and 8 columns

**5 – US State and Territorial Public Mask Mandates ** Source: Data from state and territory executive orders, administrative orders, resolutions, and proclamations is gathered from government websites and cataloged and coded by one coder using Microsoft Excel, with quality checking provided by one or more other coders. Description: US State and Territorial Public Mask Mandates from April 10, 2020 through August 15, 2021 by County by Day URL: https://data.cdc.gov/Policy-Surveillance/U-S-State-and-Territorial-Public-Mask-Mandates-Fro/62d6-pm5i Data Size: 1,593,869 rows and 10 columns

**6 – Case Counts & Transmission Level ** Source: This open-source dataset contains seven data items that describe community transmission levels across all counties. This dataset provides the same numbers used to show transmission maps on the COVID Data Tracker and contains reported daily transmission levels at the county level. The dataset is updated every day to include the most current day's data. The calculating procedures below are used to adjust the transmission level to low, moderate, considerable, or high.
Description: US State and County case counts and transmission level from 16-Aug-2021 to 03-Feb-2022 URL: https://data.cdc.gov/Public-Health-Surveillance/United-States-COVID-19-County-Level-of-Community-T/8396-v7yb Data Size: 550,702 rows and 7 columns

**7 - World Cases & Vaccination Counts ** Source: This is an open-source dataset collected and maintained by Our World in Data. OWID provides research and data to help against the world’s largest problems.
Description: This dataset includes vaccinations, tests & positivity, hospital & ICU, confirmed cases, confirmed deaths, reproduction rate, policy responses and other variables of interest. URL: https://github.com/owid/covid-19-data/tree/master/public/data Data Size: 67 columns and 157,000 rows

**8 - COVID-19 Data in the European Union ** Source: This is an open-source dataset collected and maintained by ECDC. It is an EU agency aimed at strengthening Europe's defenses against infectious diseases.
Description: This dataset co...

Search
Clear search
Close search
Google apps
Main menu