34 datasets found

COVID19_datasets
kaggle.com
zip
Updated Apr 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suradech Kongkiatpaiboon (2022). COVID19_datasets [Dataset]. https://www.kaggle.com/datasets/suradechk/covid19-datasets/discussion
Explore at:
zip(136322570 bytes)Available download formats
Dataset updated
Apr 2, 2022
Authors
Suradech Kongkiatpaiboon
Description
Collected COVID-19 datasets from various sources as part of DAAN-888 course, Penn State, Spring 2022. Collaborators: Mohamed Abdelgayed, Heather Beckwith, Mayank Sharma, Suradech Kongkiatpaiboon, and Alex Stroud

**1 - COVID-19 Data in the United States ** Source: The data is collected from multiple public health official sources by NY Times journalists and compiled in one single file. Description: Daily count of new COVID-19 cases and deaths for each state. Data is updated daily and runs from 1/21/2020 to 2/4/2022. URL: https://github.com/nytimes/covid-19-data/blob/master/us-states.csv Data size: 38,814 row and 5 columns.

**2 - Mask-Wearing Survey Data ** Source: The New York Times is releasing estimates of mask usage by county in the United States. Description: This data comes from a large number of interviews conducted online by the global data and survey firm Dynata, at the request of The New York Times. The firm asked a question about mask usage to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. URL: https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv Data size: 3,142 rows and 6 columns

**3a - Vaccine Data – Global ** Source: This data comes from the US Centers for Disease Control and Prevention (CDC), Our World in Data (OWiD) and the World Health Organization (WHO). Description: Time series data of vaccine doses administered and the number of fully and partially vaccinated people by country. This data was last updated on February 3, 2022 URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/global_data/time_series_covid19_vaccine_global.csv
Data Size: 162,521 rows and 8 columns

**3b -Vaccine Data – United States ** Source: The data is comprised of individual State's public dashboards and data from the US Centers for Disease Control and Prevention (CDC). Description: Time series data of the total vaccine doses shipped and administered by manufacturer, the dose number (first or second) by state. This data was last updated on February 3, 2022. URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/us_data/time_series/vaccine_data_us_timeline.csv
Data Size: 141,503 rows and 13 columns

**4 - Testing Data ** Source: The data is comprised of individual State's public dashboards and data from the U.S. Department of Health & Human Services. Description: Time series data of total tests administered by county and state. This data was last updated on January 25, 2022. URL: https://github.com/govex/COVID-19/blob/master/data_tables/testing_data/county_time_series_covid19_US.csv
Data size: 322,154 rows and 8 columns

**5 – US State and Territorial Public Mask Mandates ** Source: Data from state and territory executive orders, administrative orders, resolutions, and proclamations is gathered from government websites and cataloged and coded by one coder using Microsoft Excel, with quality checking provided by one or more other coders. Description: US State and Territorial Public Mask Mandates from April 10, 2020 through August 15, 2021 by County by Day URL: https://data.cdc.gov/Policy-Surveillance/U-S-State-and-Territorial-Public-Mask-Mandates-Fro/62d6-pm5i Data Size: 1,593,869 rows and 10 columns

**6 – Case Counts & Transmission Level ** Source: This open-source dataset contains seven data items that describe community transmission levels across all counties. This dataset provides the same numbers used to show transmission maps on the COVID Data Tracker and contains reported daily transmission levels at the county level. The dataset is updated every day to include the most current day's data. The calculating procedures below are used to adjust the transmission level to low, moderate, considerable, or high.
Description: US State and County case counts and transmission level from 16-Aug-2021 to 03-Feb-2022 URL: https://data.cdc.gov/Public-Health-Surveillance/United-States-COVID-19-County-Level-of-Community-T/8396-v7yb Data Size: 550,702 rows and 7 columns

**7 - World Cases & Vaccination Counts ** Source: This is an open-source dataset collected and maintained by Our World in Data. OWID provides research and data to help against the world’s largest problems.
Description: This dataset includes vaccinations, tests & positivity, hospital & ICU, confirmed cases, confirmed deaths, reproduction rate, policy responses and other variables of interest. URL: https://github.com/owid/covid-19-data/tree/master/public/data Data Size: 67 columns and 157,000 rows

**8 - COVID-19 Data in the European Union ** Source: This is an open-source dataset collected and maintained by ECDC. It is an EU agency aimed at strengthening Europe's defenses against infectious diseases.
Description: This dataset co...
Experiment 4: perceived size of test and reference arrays with lines only...
figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J Daniel McCarthy; Colin Kupitz; Gideon P Caplovitz (2023). Experiment 4: perceived size of test and reference arrays with lines only present either within the interior of the elements or connecting the elements [Dataset]. http://doi.org/10.6084/m9.figshare.157060.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.157060.v2
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
J Daniel McCarthy; Colin Kupitz; Gideon P Caplovitz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In experiment 4a the first five rows of data indicate the proportion of times participants perceived an unbound test array as being larger than a fixed unbound reference array. The second five rows of indicate the proportion of times participants perceived an array with a line connecting the local elements as being larger than a fixed unbound reference array. Each proportion was calculated from 20 trials.In experiment 4b the first five rows of data indicate the proportion of times participants perceived an unbound test array as being larger than a fixed unbound reference array. The second five rows of indicate the proportion of times participants perceived an array with a line intersecting only the interiors of the elements as being larger than a fixed unbound reference array. Each proportion was calculated from 20 trials.
Data Center Construction Market Analysis, Size, and Forecast 2025-2029:...
technavio.com
pdf
Updated Aug 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Data Center Construction Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, Italy, and UK), APAC (China, Japan, and South Korea), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/data-center-construction-market-size-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Aug 9, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
Canada, United States
Description
Snapshot img

Data Center Construction Market Size 2025-2029

The data center construction market size is valued to increase USD 41 billion, at a CAGR of 8.8% from 2024 to 2029. Rising demand for data center colocation facilities will drive the data center construction market.

Major Market Trends & Insights

Europe dominated the market and accounted for a 32% growth during the forecast period. By Application - Enterprise segment was valued at USD 23.20 billion in 2023 By Type - Electrical construction segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 70.71 billion Market Future Opportunities: USD 41.00 billion CAGR : 8.8% Europe: Largest market in 2023

Market Summary

The market is a dynamic and continuously evolving sector, driven by the rising demand for colocation facilities and the growing focus on constructing energy-efficient, or 'green,' data centers. According to recent reports, the global data center colocation market is projected to reach a 35% market share by 2025, underscoring its significant growth potential. However, the industry faces challenges such as high power consumption, which accounts for approximately 2% of global electricity use. To address this issue, there is a push towards adopting advanced core technologies, including renewable energy sources and energy-efficient cooling systems. Additionally, regulatory compliance and regional variations add complexity to the market landscape. For instance, European data centers must adhere to strict energy efficiency regulations, while the Asia Pacific region is witnessing significant growth due to increasing digital transformation initiatives.

What will be the Size of the Data Center Construction Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free Sample

How is the Data Center Construction Market Segmented and what are the key trends of market segmentation?

The data center construction industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Application Enterprise Cloud Colocation Hyperscale Type Electrical construction Mechanical construction General construction Geography North America US Canada Europe France Germany Italy UK APAC China Japan South Korea South America Brazil Rest of World (ROW)

By Application Insights

The enterprise segment is estimated to witness significant growth during the forecast period.

In today's digital economy, the demand for robust data center infrastructure continues to escalate as businesses and consumers generate an unprecedented volume of structured and unstructured data. Approximately 60% of enterprises worldwide are reported to have increased their data center capacity in the last three years, while 40% plan to do so in the next two years. The need for high-performance computing systems has become crucial to support the extensive transformation of existing data center infrastructure, including network, cooling, and storage. Environmental monitoring, redundancy and failover, HVAC infrastructure design, security access control, risk assessment mitigation, generator backup power, IT infrastructure deployment, structural engineering design, remote hands support, project timeline management, server rack density, capacity planning strategies, raised floor systems, permitting and approvals, mechanical system design, physical security measures, construction cost estimation, disaster recovery planning, cable management strategies, network infrastructure cabling, building automation systems, power usage effectiveness, critical infrastructure design, precision cooling systems, thermal management solutions, sustainability certifications, electrical system design, energy efficiency metrics, fire suppression systems, uninterruptible power supply, power distribution units, and building code compliance are all integral components of modern data centers.

Request Free Sample

The Enterprise segment was valued at USD 23.20 billion in 2019 and showed a gradual increase during the forecast period.

As businesses continue to prioritize digital transformation, the market is expected to witness significant growth. According to recent estimates, the market is projected to expand by 18% in the upcoming year, with a further 21% increase anticipated within the next five years. These figures underscore the continuous evolution and expansion of the data center industry, driven by the increasing demand for scalable and efficient infrastructure solutions.

Request Free Sample

Regional Analysis

Europe is estimated to contribute 32% to the growth of the global marke
f
Data from: Variability, plot size and border effect in lettuce trials in...
datasetcatalog.nlm.nih.gov
scielo.figshare.com
Updated Mar 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lopes, Sidinei José; Lúcio, Alessandro Dal’Col; Filho, Alberto Cargnelutti; Olivoto, Tiago; Santos, Daniel (2018). Variability, plot size and border effect in lettuce trials in protected environment [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000614350
Explore at:
Dataset updated
Mar 14, 2018
Authors
Lopes, Sidinei José; Lúcio, Alessandro Dal’Col; Filho, Alberto Cargnelutti; Olivoto, Tiago; Santos, Daniel
Description
ABSTRACT The variability within rows of cultivation may reduce the accuracy of experiments conducted in a complete randomized block design if the rows are considered as blocks, however, little is known about this variability in protected environments. Thus, our aim was to study the variability of the fresh mass in lettuce shoot, growing in protected environment, and to verify the border effect and size of the experimental unit in minimizing the productive variability. Data from two uniformity trials carried out in a greenhouse in autumn and spring growing seasons were used. In the statistical analyses, it was considered the existence of parallel cultivation rows the lateral openings of the greenhouse and of columns perpendicular to these openings. Different scenarios were simulated by excluding rows and columns to generate several borders arrangements and also to use different sizes of the experimental unit. For each scenario, homogeneity test of variances between remaining rows and columns was performed, and it was calculated the variance and coefficient of variation. There is variability among rows in trials with lettuce in plastic greenhouses and the border use does not bring benefits in terms of reduction of the coefficient of variation or minimizing the cases of heterogeneous variances among rows. In experiments with lettuce in a plastic greenhouse, the use of an experimental unit size greater than or equal to two plants provides homogeneity of variances among rows and columns and, therefore, allows the use of a completely randomized design.
2022 Bikeshare Data -Reduced File Size -All Months
kaggle.com
zip
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kendall Marie (2023). 2022 Bikeshare Data -Reduced File Size -All Months [Dataset]. https://www.kaggle.com/datasets/kendallmarie/2022-bikeshare-data-all-months-combined
Explore at:
zip(98884 bytes)Available download formats
Dataset updated
Mar 8, 2023
Authors
Kendall Marie
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This is a condensed version of the raw data obtained through the Google Data Analytics Course, made available by Lyft and the City of Chicago under this license (https://ride.divvybikes.com/data-license-agreement).

I originally did my study in another platform, and the original files were too large to upload to Posit Cloud in full. Each of the 12 monthly files contained anywhere from 100k to 800k rows. Therefore, I decided to reduce the number of rows drastically by performing grouping, summaries, and thoughtful omissions in Excel for each csv file. What I have uploaded here is the result of that process.

Data is grouped by: month, day, rider_type, bike_type, and time_of_day. total_rides represent the sum of the data in each grouping as well as the total number of rows that were combined to make the new summarized row, avg_ride_length is the calculated average of all data in each grouping.

Be sure that you use weighted averages if you want to calculate the mean of avg_ride_length for different subgroups as the values in this file are already averages of the summarized groups. You can include the total_rides value in your weighted average calculation to weigh properly.

9 Columns:

date - year, month, and day in date format - includes all days in 2022 day_of_week - Actual day of week as character. Set up a new sort order if needed. rider_type - values are either 'casual', those who pay per ride, or 'member', for riders who have annual memberships. bike_type - Values are 'classic' (non-electric, traditional bikes), or 'electric' (e-bikes). time_of_day - this divides the day into 6 equal time frames, 4 hours each, starting at 12AM. Each individual ride was placed into one of these time frames using the time they STARTED their rides, even if the ride was long enough to end in a later time frame. This column was added to help summarize the original dataset. total_rides - Count of all individual rides in each grouping (row). This column was added to help summarize the original dataset. avg_ride_length - The calculated average of all rides in each grouping (row). Look to total_rides to know how many original rides length values were included in this average. This column was added to help summarize the original dataset. min_ride_length - Minimum ride length of all rides in each grouping (row). This column was added to help summarize the original dataset. max_ride_length - Maximum ride length of all rides in each grouping (row). This column was added to help summarize the original dataset.

Please note: the time_of_day column has inconsistent spacing. Use mutate(time_of_day = gsub(" ", "", time_of _day)) to remove all spaces.

Revisions

Below is the list of revisions I made in Excel before uploading the final csv files to the R environment:

Deleted station location columns and lat/long as much of this data was already missing.

Deleted ride id column since each observation was unique and I would not be joining with another table on this variable.

Deleted rows pertaining to "docked bikes" since there were no member entries for this type and I could not compare member vs casual rider data. I also received no information in the project details about what constitutes a "docked" bike.

Used ride start time and end time to calculate a new column called ride_length (by subtracting), and deleted all rows with 0 and 1 minute results, which were explained in the project outline as being related to staff tasks rather than users. An example would be taking a bike out of rotation for maintenance.

Placed start time into a range of times (time_of_day) in order to group more observations while maintaining general time data. time_of_day now represents a time frame when the bike ride BEGAN. I created six 4-hour time frames, beginning at 12AM.

Added a Day of Week column, with Sunday = 1 and Saturday = 7, then changed from numbers to the actual day names.

Used pivot tables to group total_rides, avg_ride_length, min_ride_length, and max_ride_length by date, rider_type, bike_type, and time_of_day.

Combined into one csv file with all months, containing less than 9,000 rows (instead of several million)
T
Enterprise Data Platform Asset Total and Size
mydata.iowa.gov
data.iowa.gov
csv, xlsx, xml
Updated Nov 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Enterprise Data Platform Asset Total and Size [Dataset]. https://mydata.iowa.gov/w/v4nh-jbad/default?cur=jLt_9ZSUFnx&from=JdiBSQYXv5R
Explore at:
csv, xml, xlsxAvailable download formats
Dataset updated
Nov 18, 2025
Description
This query returns the total number of assets published on the Enterprise Data Platform and the total number of rows, columns and values published in datasets.
R
Training and testing XRD dataset for crystallite size and microstrain...
entrepot.recherche.data.gouv.fr
image/x-silx-numpy +1
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandre BOULLE; Alexandre BOULLE; Arthur SOUESME; Arthur SOUESME (2025). Training and testing XRD dataset for crystallite size and microstrain determination using deep neural networks [Dataset]. http://doi.org/10.57745/SVQART
Explore at:
text/markdown(1068), image/x-silx-numpy(6059958836), image/x-silx-numpy(673347924)Available download formats
Unique identifier
https://doi.org/10.57745/SVQART
Dataset updated
Nov 20, 2025
Dataset provided by
Recherche Data Gouv
Authors
Alexandre BOULLE; Alexandre BOULLE; Arthur SOUESME; Arthur SOUESME
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Time period covered
Oct 1, 2023 - Oct 1, 2026
Dataset funded by
Région Nouvelle Aquitaine
Description
Numpy tensors to train and test a convolutional neural network dedicated to determine crystallite size and/or microstrain from X-ray diffraction data (XRD): train_size.npz: training dataset with only crystallite size test_size.npz: testing dataset with only crystallite size train_size_strain.npz: training dataset with crystallite size and microstrain test_size_strain.npz: testing dataset with crystallite size and microstrain Each dataset contains the XRD data and the labels ("ground truth") in the form of 2D tensors with 10501 data points (columns) for the XRD data, and 24 labels (columns) for the labels. Training data contain 71971 rows ; testing data contain 7997 rows. Example python script to read the data: import numpy as np train = np.load("train_size.npz") train_data, train_label = train["train_data"], train["train_label"] print(f"Train data shape: {train_data.shape}, Train labels shape: {train_label.shape}") Jupyter notebooks to train and test a neural network can be found here: https://github.com/aboulle/LPA-NN
d
Company Data: Company Size, Address, Contact Details and Business Scope
datarade.ai
.csv, .xls
Updated Mar 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
C Insights Africa (2023). Company Data: Company Size, Address, Contact Details and Business Scope [Dataset]. https://datarade.ai/data-products/company-data-company-size-address-contact-details-and-busi-c-insights-africa
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
Mar 22, 2023
Dataset authored and provided by
C Insights Africa
Area covered
Nigeria
Description
C Insights Africa's company database contains details of more than 10,000 organizations in Nigeria ranging from the large corporates, to the mid-sized and small companies. Our database contains attributes such as company size, address(s), contact details, type of business and related companies (where applicable). Marketing and sales executives can enrich their pipeline with our database, while business development teams or C-suite executives interested in finding new partners/frontiers are sure to find this database invaluable.
d
Data from: Trade-offs between growth rate, tree size and lifespan of...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated May 26, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christof Bigler (2016). Trade-offs between growth rate, tree size and lifespan of mountain pine (Pinus montana) in the Swiss National Park [Dataset]. http://doi.org/10.5061/dryad.d2680
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.d2680
Dataset updated
May 26, 2016
Dataset provided by
Dryad
Authors
Christof Bigler
Time period covered
Jul 6, 2015
Area covered
Swiss National Park, canton of Grisons, Switzerland
Description
A within-species trade-off between growth rates and lifespan has been observed across different taxa of trees, however, there is some uncertainty whether this trade-off also applies to shade-intolerant tree species. The main objective of this study was to investigate the relationships between radial growth, tree size and lifespan of shade-intolerant mountain pines. For 200 dead standing mountain pines (Pinus montana) located along gradients of aspect, slope steepness and elevation in the Swiss National Park, radial annual growth rates and lifespan were reconstructed. While early growth (i.e. mean tree-ring width over the first 50 years) correlated positively with diameter at the time of tree death, a negative correlation resulted with lifespan, i.e. rapidly growing mountain pines face a trade-off between reaching a large diameter at the cost of early tree death. Slowly growing mountain pines may reach a large diameter and a long lifespan, but risk to die young at a small size. Early gro...

Bioinformatics Services Market Size and Forecast (2025 - 2035), Global and...

wemarketresearch.com

csv, pdf

Updated May 20, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

We Market Research (2025). Bioinformatics Services Market Size and Forecast (2025 - 2035), Global and Regional Growth, Trend, Share and Industry Analysis Report Coverage: By Service Type (Data Analysis & Interpretation, Sequencing Services, Data Management Services, Software & Tool Development, Consulting Services, Outsourcing Services, Others); Application (Genomics, Proteomics, Transcriptomics, Pharmacogenomics, Clinical Diagnostics, Personalized Medicine and Others) End-user (Pharmaceutical & Biotechnology Companies, Academic & Research Institutes, Hospitals & Healthcare Institutions, Contract Research Organizations (CROs) and Others) and Geography. [Dataset]. https://wemarketresearch.com/reports/bioinformatics-services-market/1735

Explore at:

pdf, csvAvailable download formats

Dataset updated

May 20, 2025

Dataset authored and provided by

We Market Research

License

https://wemarketresearch.com/privacy-policyhttps://wemarketresearch.com/privacy-policy

Time period covered

2025 - 2035

Area covered

Worldwide

Description

The Bioinformatics Services Market will grow from $4.3B in 2025 to $15.7B by 2035, at a CAGR of 12.6%, driven by rising demand for biologics and biosimilars.

Report Attribute	Description
Market Size in 2025	USD 4.3 Billion
Market Forecast in 2035	USD 15.7 Billion
CAGR % 2025-2035	12.6%
Base Year	2024
Historic Data	2020-2024
Forecast Period	2025-2035
Report USP	Production, Consumption, company share, company heatmap, company production capacity, growth factors and more
Segments Covered	By Service Type, By Application, By End-user
Regional Scope	North America, Europe, APAC, Latin America, Middle East and Africa
Country Scope	U.S., Canada, U.K., Germany, France, Italy, Spain, Benelux, Nordic Countries, Russia, China, India, Japan, South Korea, Australia, Indonesia, Thailand, Mexico, Brazil, Argentina, Saudi Arabia, UAE, Egypt, South Africa, Nigeria

Visualizing Chicago Crime Data
kaggle.com
zip
Updated Jul 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elijah Toumoua (2022). Visualizing Chicago Crime Data [Dataset]. https://www.kaggle.com/datasets/elijahtoumoua/chicago-analysis-of-crime-data-dashboard
Explore at:
zip(94861784 bytes)Available download formats
Dataset updated
Jul 1, 2022
Authors
Elijah Toumoua
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Chicago
Description
Prelude

This dataset is a cleaned version of the Chicago Crime Dataset, which can be found here. All rights for the dataset go to the original owners. The purpose of this dataset is to display my skills in visualizations and creating dashboards. To be specific, I will attempt to create a dashboard that will allow users to see metrics for a specific crime within a given year using filters and metrics. Due to this, there will not be much of a focus on the analysis of the data, but there will be portions discussing the validity of the dataset, the steps I took to clean the data, and how I organized it. The cleaned datasets can be found below, the Query (which utilized BigQuery) can be found here and the Tableau dashboard can be found here.

About the Dataset

Important Facts

The dataset comes directly from the City of Chicago's website under the page "City Data Catalog." The data is gathered directly from the Chicago Police's CLEAR (Citizen Law Enforcement Analysis and Reporting) and is updated daily to present the information accurately. This means that a crime on a specific date may be changed to better display the case. The dataset represents crimes starting all the way from 2001 to seven days prior to today's date.

Reliability

Using the ROCCC method, we can see that: * The data has high reliability: The data covers the entirety of Chicago from a little over 2 decades. It covers all the wards within Chicago and even gives the street names. While we may not have an idea for how big the sample size is, I do believe that the dataset has high reliability since it geographically covers the entirety of Chicago. * The data has high originality: The dataset was gained directly from the Chicago Police Dept. using their database, so we can say this dataset is original. * The data is somewhat comprehensive: While we do have important information such as the types of crimes committed and their geographic location, I do not think this gives us proper insights as to why these crimes take place. We can pinpoint the location of the crime, but we are limited by the information we have. How hot was the day of the crime? Did the crime take place in a neighborhood with low-income? I believe that these key factors prevent us from getting proper insights as to why these crimes take place, so I would say that this dataset is subpar with how comprehensive it is. * The data is current: The dataset is updated frequently to display crimes that took place seven days prior to today's date and may even update past crimes as more information comes to light. Due to the frequent updates, I do believe the data is current. * The data is cited: As mentioned prior, the data is collected directly from the polices CLEAR system, so we can say that the data is cited.

Processing the Data

Cleaning the Dataset

The purpose of this step is to clean the dataset such that there are no outliers in the dashboard. To do this, we are going to do the following: * Check for any null values and determine whether we should remove them. * Update any values where there may be typos. * Check for outliers and determine if we should remove them.

The following steps will be explained in the code segments below. (I used BigQuery for this so the coding will follow BigQuery's syntax) ```

Examining the dataset

There are over 7.5 million rows of data

Putting a limit so it does not take a long time to run

SELECT * FROM portfolioproject-350601.ChicagoCrime.Crime LIMIT 1000;

Seeing which points are null

There are 85,000 null points so we can exclude them as it's not a significant amount since it is only ~1.3% of the dataset

Most of the null points are in the lat and long, which we will need later

Because we don't have the full address, we can't estimate the lat and long in SQL so we will have to delete the rows with Null Data

SELECT * FROM portfolioproject-350601.ChicagoCrime.Crime WHERE unique_key IS NULL OR case_number IS NULL OR date IS NULL OR primary_type IS NULL OR location_description IS NULL OR arrest IS NULL OR longitude IS NULL OR latitude IS NULL;

Deleting all null rows

DELETE FROM portfolioproject-350601.ChicagoCrime.Crime WHERE
unique_key IS NULL OR case_number IS NULL OR date IS NULL OR primary_type IS NULL OR location_description IS NULL OR arrest IS NULL OR longitude IS NULL OR latitude IS NULL;

Checking for any duplicates in the unique keys

None to be found

SELECT unique_key, COUNT(unique_key) FROM `portfolioproject-350601.ChicagoCrime....

Study of Data Orchestration Tool Market by Cloud based and...

futuremarketinsights.com

html, pdf

Updated Apr 12, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Sudip Saha (2024). Study of Data Orchestration Tool Market by Cloud based and Telecommunications from 2024 to 2034 [Dataset]. https://www.futuremarketinsights.com/reports/data-orchestration-tool-market

Explore at:

html, pdfAvailable download formats

Dataset updated

Apr 12, 2024

Authors

Sudip Saha

License

https://www.futuremarketinsights.com/privacy-policyhttps://www.futuremarketinsights.com/privacy-policy

Time period covered

2024 - 2034

Area covered

Worldwide

Description

Organizations have been overwhelmed with vast amounts of data generated from various sources, such as enterprise applications, IoT devices, social media platforms, as well as cloud services. Effectively harnessing this data to drive business insights and innovation has become a critical imperative for organizations seeking to maintain competitiveness and relevance in their respective industries.

Attributes	Key Insights
Data Orchestration Tool Market Estimated Size in 2024	US$ 1.3 billion
Projected Market Value in 2034	US$ 4.3 billion
Value-based CAGR from 2024 to 2034	12.1%

Country-wise Insights

Country	The United States
CAGR through 2034	8.1%

Country	Germany
CAGR through 2034	5.3%

Country	China
CAGR through 2034	12.6%

Country	Japan
CAGR through 2034	4.2%

Country	Australia and New Zealand
CAGR through 2034	7.3%

Category-wise Insights

Category	Shares in 2024
Cloud Based	62.3%
Telecommunications	24.2%

Report Scope

Attribute	Details
Estimated Market Size in 2024	US$ 1.3 billion
Projected Market Valuation in 2034	US$ 4.3 billion
Value-based CAGR 2024 to 2034	12.1%
Forecast Period	2024 to 2034
Historical Data Available for	2019 to 2023
Market Analysis	Value in US$ Billion
Key Regions Covered	North America Latin America Western Europe Eastern Europe South Asia and Pacific East Asia The Middle East & Africa
Key Market Segments Covered	Deployment Model Industry Region
Key Countries Profiled	The United States Canada Brazil Mexico Germany France France Spain Italy Russia Poland Czech Republic Romania India Bangladesh Australia New Zealand China Japan South Korea GCC countries South Africa Israel
Key Companies Profiled	AWS Google SAP Microsoft Prefect Dagster Luigi Metaflow

m
A Litopenaeus vannamei shrimp dataset with images and corresponding...
data.mendeley.com
Updated Jul 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Joaquín Ramírez-Coronel (2024). A Litopenaeus vannamei shrimp dataset with images and corresponding weight-size measurements for the development of artificial intelligence-based biomass estimation and organism detection algorithms [Dataset]. http://doi.org/10.17632/h8tcn6ykky.2
Explore at:
Unique identifier
https://doi.org/10.17632/h8tcn6ykky.2
Dataset updated
Jul 1, 2024
Authors
Fernando Joaquín Ramírez-Coronel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was compiled with the ultimate goal of developing non-invasive computer vision algorithms for assessing shrimp biometrics and biomass estimation. The main folder, labeled "DATASET," contains five sub-folders—DB1, DB2, DB3, DB4, and DB5—each filled with images of shrimps. Additionally, each sub-folder is accompanied by an Excel file that includes manually measured data for the shrimps pictured. The files are named respectively: DB1_INDUSTRIAL_FARM_1, DB2_INDUSTRIAL_FARM_2_C1, DB3_INDUSTRIAL_FARM_2_C2, DB4_ACADEMIC_POND_S1, and DB5_ACADEMIC_POND_S2.

Here’s a detailed description of the contents of each sub-folder and its corresponding Excel file:

1) DB1 includes 490 PNG images of 22 shrimps taken from one pond at an industrial farm. The associated Excel file, DB1_INDUSTRIAL_FARM_1, contains columns for: SAMPLE: Reflecting the number of individual shrimps (22 entries or rows). LENGTH (cm): Measuring from the rostrum (near the eyes) to the start of the tail. WEIGHT (g): Recorded using a scale. COMPLETE SHRIMP IMAGES: Indicates if at least one full-body image is available (1) or not (0).

2) DB2 consists of 2002 PNG images of 58 shrimps. The Excel file, DB2_INDUSTRIAL_FARM_2_C1, includes: SAMPLE: Number of shrimps (58 entries or rows). CEPHALOTHORAX (cm): Total length of the cephalothorax. LENGTH (cm) and WEIGHT (g): Similar measurements as DB1. COMPLETE SHRIMP IMAGES: Presence (1) or absence (0) of full-body images.

3) DB3 contains 1719 PNG images of 50 shrimps, with its Excel file, DB3_INDUSTRIAL_FARM_2_C2, documenting: SAMPLE: Number of shrimps (50 entries or rows). Measurements and categories identical to DB2.

4) DB4 encompasses 635 PNG images of 20 shrimps, detailed in the Excel file DB4_ACADEMIC_POND_S1. This includes: SAMPLE: Number of shrimps (20 entries or rows). CEPHALOTHORAX (cm), LENGTH (cm), WEIGHT (g), and COMPLETE SHRIMP IMAGES: Documented as in other datasets.

5) DB5 includes 661 PNG images of 20 shrimps, with DB5_ACADEMIC_POND_S2 as the corresponding Excel file. The file mirrors the structure and measurements of DB4.

The images for each foler are named "sm_n", where m is the number of shrimp sample and n is the number of picture of that shrimp. This carefully structured dataset provides comprehensive biometric data on shrimps, facilitating the development of algorithms aimed at non-invasive measurement techniques. This will likely be pivotal in enhancing the precision of biomass estimation in aquaculture farming, utilizing advanced statistical morphology analysis and machine learning techniques.

CHANGES FROM VERSION 1:

The cephalothorax metric is the length rather than the width. That was an error in the first version. The name in the columns also had a typo, which has been corrected (from CEPHALOTORAX to CEPHALOTHORAX).
D
Data Center Cooling Market Report
marketreportanalytics.com
doc, pdf, ppt
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Data Center Cooling Market Report [Dataset]. https://www.marketreportanalytics.com/reports/data-center-cooling-market-10650
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Mar 18, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Data Center Cooling market is experiencing robust growth, projected to reach $1452.12 million in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 6.78% from 2025 to 2033. This expansion is fueled by several key factors. The increasing density of data centers, driven by the exponential growth of data generated globally, necessitates advanced cooling solutions to prevent overheating and ensure optimal performance. Furthermore, rising energy costs and growing concerns about environmental sustainability are pushing the adoption of energy-efficient cooling technologies like liquid cooling and adiabatic cooling systems. The market is segmented by cooling type, with room-cooling, rack-cooling, and row-cooling solutions catering to diverse data center needs and sizes. Leading companies are aggressively pursuing innovative strategies, including mergers and acquisitions, strategic partnerships, and research and development investments, to strengthen their market positions and capitalize on this burgeoning market. Geographic expansion, particularly in rapidly developing economies in Asia-Pacific and other regions with increasing data center deployments, presents significant growth opportunities. However, challenges such as high initial investment costs associated with advanced cooling systems and the need for skilled professionals to manage and maintain these complex technologies may act as restraints. The competitive landscape is marked by the presence of both established players and emerging technology companies. Major players like 3M, Daikin, Schneider Electric, and Vertiv are leveraging their technological expertise and extensive distribution networks to maintain their dominance. Meanwhile, smaller, innovative companies are introducing niche solutions and challenging the incumbents. The market's future growth trajectory hinges on technological advancements, the evolution of data center designs, and the ongoing demand for environmentally sustainable cooling solutions. The consistent need for reliable, energy-efficient, and scalable cooling infrastructure will be the primary driver of this market's continued expansion throughout the forecast period.

Mobile Payment Data Protection Market Analysis by Contactless and Remote...

futuremarketinsights.com

html, pdf

Updated Jun 14, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Sudip Saha (2024). Mobile Payment Data Protection Market Analysis by Contactless and Remote Tokenization through 2034 [Dataset]. https://www.futuremarketinsights.com/reports/mobile-payment-data-protection-market

Explore at:

html, pdfAvailable download formats

Dataset updated

Jun 14, 2024

Authors

Sudip Saha

License

https://www.futuremarketinsights.com/privacy-policyhttps://www.futuremarketinsights.com/privacy-policy

Time period covered

2024 - 2034

Area covered

Worldwide

Description

The global mobile payment data protection market size is anticipated to experience notable growth of USD 7,30,843.3 million in 2024 from USD 6,59,096.1 million in 2023. The industry is foreseen to sustain expansion with outstanding numbers of USD 23,66,892.7 million by 2034, with a CAGR of 12.5% through 2034.

Attributes	Description
Estimated Global Mobile Payment Data Protection Market Size, 2024	USD 7,30,843.3 million
Projected Global Mobile Payment Data Protection Market Size, 2034	USD 23,66,892.7 million
Value-based CAGR (2024 to 2034)	12.5% CAGR

Semi Annual Market Update

Particular	Value CAGR
H1	9.8% (2023 to 2033)
H2	10.2% (2023 to 2033)
H1	10% (2024 to 2034)
H2	10.2% (2024 to 2034)

Country-wise Insights

Countries	CAGR from 2024 to 2034
Australia	16%
China	13%
United States	9.3%
Germany	7.9%
Japan	7.2%

Category-wise Insights

Segment	Contactless Tokenisation (Product)
Value Share (2024)	56.2%

Segment	Banking and Financial Service (End User)
Value Share (2024)	33.7%

Synthetic datasets reflecting the shRNA-seq knockdown ENCODE data for HepG2...

zenodo.org

txt

Updated Jun 28, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Erik Sonnhammer lab; Claudia Kutter lab; Erik Sonnhammer lab; Claudia Kutter lab (2024). Synthetic datasets reflecting the shRNA-seq knockdown ENCODE data for HepG2 and K562 with coresponding GRN [Dataset]. http://doi.org/10.5281/zenodo.12165429

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.12165429

Dataset updated

Jun 28, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Erik Sonnhammer lab; Claudia Kutter lab; Erik Sonnhammer lab; Claudia Kutter lab

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jun 19, 2024

Description

Synthetic data correspond to the ENCODE data for cell lines HepG2 (https://www.encodeproject.org/biosamples/ENCBS282XVK/) and K562 (https://www.encodeproject.org/biosamples/ENCBS023XVB/). The data and networks were generated using GeneSPIDER (publicly available at https://bitbucket.org/sonnhammergrni/genespider/).

Table.1 Description of the files

data_HepG2like_SNR_L=0.0054699_diff=1.6188e-05.txt	Synthetic gene expression knockdown (shRNA-seq) data immitating the ENCODE data for HepG2 cell line. Data size: 232 RBPs vs 464 experiments (2 replicates). SNR_L is the value of signal to noise ratio. Difference (diff) value tells the difference between replicate correlation coefficients of real and synthetic ENCODE data. Columns represent experiments, rows represent genes.
data_K562like_SNR_L=0.0028692_diff=0.00017339.txt	Synthetic gene expression knockdown (shRNA-seq) data immitating the ENCODE data for K562 cell line. Data size: 232 RBPs vs 464 experiments (2 replicates). SNR_L is the value of signal to noise ratio. Difference (diff) value tells the difference between replicate correlation coefficients of real and synthetic ENCODE data. Columns represent experiments, rows represent genes.
network_HEPG2like_sparsity4.txt	Synthetic scale-free gene regulatory network compatibile with data_HepG2like_SNR_L=0.0054699_diff=1.6188e-05.txt. Sparsity (average node degree) is 4 including selfloops. Direction should be read from columns to rows.
network_K562like_sparsity4.txt	Synthetic scale-free gene regulatory network compatibile with data_K562like_SNR_L=0.0028692_diff=0.00017339.txt. Sparsity (average node degree) is 4 including selfloops. Direction should be read from columns to rows.
perturbations_HepG2&K562_2replicates.txt	Perturbation matrix including information about knockeddown RBPs. Data size: 232 RBPs vs 464 experiments (2 replicates).

Created by Garbulowski et al. (2024) as a part of the work entitled "Comprehensive analysis of the RBP regulome reveals functional modules and drug candidates in liver cancer"

House Price Regression Dataset
kaggle.com
zip
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prokshitha Polemoni (2024). House Price Regression Dataset [Dataset]. https://www.kaggle.com/datasets/prokshitha/home-value-insights
Explore at:
zip(27045 bytes)Available download formats
Dataset updated
Sep 6, 2024
Authors
Prokshitha Polemoni
Description
Home Value Insights: A Beginner's Regression Dataset

This dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.

Features:

Square_Footage: The size of the house in square feet. Larger homes typically have higher prices.

Num_Bedrooms: The number of bedrooms in the house. More bedrooms generally increase the value of a home.

Num_Bathrooms: The number of bathrooms in the house. Houses with more bathrooms are typically priced higher.

Year_Built: The year the house was built. Older houses may be priced lower due to wear and tear.

Lot_Size: The size of the lot the house is built on, measured in acres. Larger lots tend to add value to a property.

Garage_Size: The number of cars that can fit in the garage. Houses with larger garages are usually more expensive.

Neighborhood_Quality: A rating of the neighborhood’s quality on a scale of 1-10, where 10 indicates a high-quality neighborhood. Better neighborhoods usually command higher prices.

House_Price (Target Variable): The price of the house, which is the dependent variable you aim to predict.

Potential Uses:

Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.

Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.

Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.

Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.

Versatility:

The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.

It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.

This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.
Z
Mineral spectral refractive index and bulk optical property dataset for...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhang, Yuheng; Saito, Masanori; Yang, Ping; Schuster, Gregory; Trepte, Charles (2024). Mineral spectral refractive index and bulk optical property dataset for aerosol studies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8144788
Explore at:
Dataset updated
Dec 5, 2024
Dataset provided by
Texas A&M University
NASA Langley Research Center
Authors
Zhang, Yuheng; Saito, Masanori; Yang, Ping; Schuster, Gregory; Trepte, Charles
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Version 1.3, updated 11/15/2024.

Added a file with 27 regional dust sample mineral composition information 'NewRegionalSamples.xlsx',

along with the refractive index data.

All refractive index files here have 127 rows (wavelengths) and 27 columns (samples)

'kall27_coarse.dat' is the imaginary part of the coarse mode.

'kall27_fine.dat' is the imaginary part of the fine mode.

'nall27_coarse.dat' is the real part of the coarse mode.

'nall27_fine.dat' is the real part of the fine mode.

Version 1.2, updated 04/23/2024.Major changes: Changed all the data file names to new format: "mix"+{property name}+{number}, rearranged the number of mixing samples

Updated all the bulk optical property data. This version use constant values of standard deviation in the lognormal size distribution settings for the coarse mode and the fine mode respectively.

The phase matrices are separated from the other bulk properties due to their large file sizes. The readme file is updated correspondingly. The information of scattering angles (498 angles in total) is uploaded as "TAMUdust2020_Angle.dat".

Added supplemental file data in 'Supplemental.tar.gz'.

Additional refractive indices are zipped in 'AdditionalRefInd.tar.gz'

Version 1.1, updated 03/14/2024.Major changes: Added mixed bulk properties for "0 (99%coarse+1%fine)" and "11 (2.0 µm coarse+ 0.4 µm fine)";Added "reff.dat" in the 'BulkProperties.tar.gz'. The data include four columns: fine mode fraction, bulk projected area , bulk volume , effective radius r_eff. The information is for mixed sample number 0 to 11, each corresponds to one row.Added refractive indices for chlorite, mica, smectite, pyroxene, vermiculite and pyroxenes. These groups can be applied in some other models.

Version 1.0, uploaded 01/02/2024.

This database include supplemental data and files for the publication of this paper:

Sensitivities of Spectral Optical Properties of Dust Aerosols to their Mineralogical and Microphysical Properties. Yuheng Zhang, M. Saito, P. Yang, G. L. Schuster, and C. R. Trepte, J. Geophys. Res. Atmos. 2024.

The supplemental data include:

1) 'GroupRefInd.tar.gz' Mineral (group) refractive index files.E. g., 1All_Illite.dat contains the complex refractive index files of illite group. Format (from left to right columns): Wavelength (unit: µm), Real part (n), Imaginary part (k), standard deviation of n, standard deviation of k.

The file 'fine_log.dat' includes the mean and standard deviation values of n and k for all the generated fine mode dust samples at 11,044 wavelengths from 0.2 to 50 micron.

The file 'fine_log127.dat' only includes the values at 127 wavelengths from 0.2 to 50 micron (defined in 'swav.txt' and 'lwav.txt'), and is used for the bulk property computations.

The files 'coarse_log.dat' and 'coarse_log127.dat' are for the coarse mode dust samples.

2) 'CompositionFraction.xlsx': Mineral composition data sources/references and composition data (mean and standard deviation values of each group).'Vlog_coarse.dat': Randomly generated VOLUME FRACTION of 9 mineral groups for the coarse mode dust. Left to right: Illite, Kaolinite, Montmorillonite (Other clays), Quartz, Feldspar, Carbonate, Gypsum (Sulphate), Hematite, Goethite.

'Vlog_fine.dat': For the fine mode dust.

3) 'RefSources.xlsx': The data source references of mineral refractive indices. We didn't include the olivine, other silicates, soot and titanium-rich minerals in the paper, but the refractive indices are available for those who are interested. Chlorite, Mica and Vermiculite group are mentioned in some studies, and we included the refractive indices for these minerals as well.

4) 'DustSamples.tar.gz' Dust sample refractive index files.The files are enclosed in four folders: fine_sw/ fine_lw/ coarse_sw/ coarse_lw/.

fine: fine mode. coarse: coarse mode.

'sw' means shortwave (< 4 µm, in total 76 wavelengths defined in 'swav.txt') while 'lw' means longwave (>= 4 µm, in total 51 wavelengths defined in 'lwav.txt').

All files start with 'rdn', which means that they are computed based on randomly generated composition (data given in sheet 2 of 'CompositionFraction.xlsx').

The four digit number after 'rdn' is the index of each dust sample. In total, there are 5,000 samples. The sample composition is the same for the same sample index in the same size mode (fine/coarse). Data file format (from left to right columns): real part, imaginary part.

5) 'BulkProperties.tar.gz' Bulk property files (excluding phase matrices)'mixqx.dat' files format (from left to right columns): Extinction efficiency (Qext), Scattering efficiency (Qsca), Backscattering efficiency (Qbck), and Asymmetry coefficient (Qasy). To obtain asymmetry factor, use Qasy/Qsca.

'mixbkx.dat' files format (from left to right columns): P11(pi) P12(pi) P22(pi) P33(pi) P34(pi) P44(pi).

'x' refers to the number at the end of the file name. It can be 100 ~ 112, each represents a setting of coarse and fine mode effective radius and volume fraction (see details in "reff.dat")

'reff.dat' contains the effective radius information of the mixture. It has 7 columns: File number "x", Fine mode volume fraction, Fine mode effective radius (µm), Coarse mode effective radius (µm), Bulk projected area (µm^2), Bulk volume (µm^3), Bulk effective radius (µm).

6) 'PhaseMatrices.tar.gz' Phase matrices data'mixphswx.dat' files contain phase matrix results at 532 nm (shortwave). From left to right: P11, P12, P22, P33, P34, P44.

'mixphlwx.dat' files contain phase matrix results at 10.5 µm (longwave).

There are 635,000 rows in each data file. 635,000 rows = 127 wavelengths * 5,000 samples. Row 1~127 is sample 1, row 128~254 is sample 2, etc.. Suggest to use matlab function 'reshape(property, 127, 5000)' for each column when processing the data.

7) 'Supplemental.tar.gz'

We also include data files mentioned in the supplemental file of the paper. The adjusted source data files of the nine mineral groups are included.

The supplemental bulk property files are named based on the figure number.

8) 'AdditionalRefInd.tar.gz'

We also include additional refractive indices for chlorite, smectite, vermiculite, mica, dolomite, titanium-rich minerals, pyroxenes and soot. These data can be useful in other models.

For more detailed information and datasets, please contact: Yuheng Zhang, yuheng98@tamu.edu or yuhengz98@qq.com.
Data from: Size of plots for experiments with cactus pear cv. Gigante
scielo.figshare.com
jpeg
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bruno V. C. Guimarães; Sérgio L. R. Donato; Ignacio Aspiazú; Alcinei M. Azevedo; Abner J. de Carvalho (2023). Size of plots for experiments with cactus pear cv. Gigante [Dataset]. http://doi.org/10.6084/m9.figshare.8092634.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8092634.v1
Dataset updated
May 31, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
Bruno V. C. Guimarães; Sérgio L. R. Donato; Ignacio Aspiazú; Alcinei M. Azevedo; Abner J. de Carvalho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT The definition of experimental plot size is an essential tool to ensure precision in statistical analysis in experiments. The objective of this study was to estimate the plot size for the cactus pear cv. Gigante using the Modified Maximum Curvature Method, under the semi-arid conditions of Northeastern Brazil. The uniformity test was conducted at the Federal Institute of Bahia, Guanambi Campus, Bahia state, Brazil, during the agricultural period from 2009 to 2011. The spatial arrangement was composed of ten rows with 50 plants each, whose evaluated area was formed by the eight central rows with 48 plants per row, making 384 plants and area of 153.60 m2. The following variables were evaluated: plant height; length, width and thickness of cladode; number of cladodes; total area of cladodes; cladode area and green mass yield in the third production cycle. In the evaluations, each plant was considered as a basic experimental unit (BEU), with an area of 0.4 m2, comprising 384 basic units (BU), whose adjacent ones were combined to form 15 pre-established plot sizes with rectangular shapes and in rows. The characteristics total area of cladodes and green mass yield require larger plot sizes to be evaluated with greater experimental accuracy. For experimental evaluation of cactus pear cv. Gigante, plot size should be eight plants in the direction of the crop row.
Retail Market Basket Transactions Dataset
kaggle.com
Updated Aug 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wasiq Ali (2025). Retail Market Basket Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/retail-market-basket-transactions-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 25, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Wasiq Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview

The Market_Basket_Optimisation dataset is a classic transactional dataset often used in association rule mining and market basket analysis.
It consists of multiple transactions where each transaction represents the collection of items purchased together by a customer in a single shopping trip.

File Name: Market_Basket_Optimisation.csv

Format: CSV (Comma-Separated Values)

Structure: Each row corresponds to one shopping basket. Each column in that row contains an item purchased in that basket.

Nature of Data: Transactional, categorical, sparse.

Primary Use Case: Discovering frequent itemsets and association rules to understand shopping patterns, product affinities, and to build recommender systems.

Detailed Information

📊 Dataset Composition

Transactions: 7,501 (each row = one basket).

Items (unique): Around 120 distinct products (e.g., bread, mineral water, chocolate, etc.).

Columns per row: Up to 20 possible items (not fixed; some rows have fewer, some more).

Data Type: Purely categorical (no numerical or continuous features).

Missing Values: Present in the form of empty cells (since not every basket has all 20 columns).

Duplicates: Some baskets may appear more than once — this is acceptable in transactional data as multiple customers can buy the same set of items.

🛒 Nature of Transactions

Basket Definition: Each row captures items bought together during a single visit to the store.

Variability: Basket size varies from 1 to 20 items. Some customers buy only one product, while others purchase a full set of groceries.

Sparsity: Since there are ~120 unique items but only a handful appear in each basket, the dataset is sparse. Most entries in the one-hot encoded representation are zeros.

🔎 Examples of Data

Example transaction rows (simplified):

Item 1 Item 2 Item 3 Item 4 ...
Bread Butter Jam
Mineral water Chocolate Eggs Milk
Spaghetti Tomato sauce Parmesan

Here, empty cells mean no item was purchased in that slot.

📈 Applications of This Dataset

This dataset is frequently used in data mining, analytics, and recommendation systems. Common applications include:

Association Rule Mining (Apriori, FP-Growth):

Discover rules like {Bread, Butter} ⇒ {Jam} with high support and confidence.

Identify cross-selling opportunities.

Product Affinity Analysis:

Understand which items tend to be purchased together.

Helps with store layout decisions (placing related items near each other).

Recommendation Engines:

Build systems that suggest "You may also like" products.

Example: If a customer buys pasta and tomato sauce, recommend cheese.

Marketing Campaigns:

Bundle promotions and discounts on frequently co-purchased products.

Personalized offers based on buying history.

Inventory Management:

Anticipate demand for certain product combinations.

Prevent stockouts of items that drive the purchase of others.

📌 Key Insights Potentially Hidden in the Dataset

Popular Items: Some items (like mineral water, eggs, spaghetti) occur far more frequently than others.

Product Pairs: Frequent pairs and triplets (e.g., pasta + sauce + cheese) reflect natural meal-prep combinations.

Basket Size Distribution: Most customers buy fewer than 5 items, but a small fraction buy 10+ items, showing long-tail behavior.

Seasonality (if extended with timestamps): Certain items might show peaks in demand during weekends or holidays (though timestamps are not included in this dataset).

📂 Dataset Limitations

No Customer Identifiers:

We cannot track repeated purchases by the same customer.

Analysis is limited to basket-level insights.

No Timestamps:

No temporal analysis (trends over time, seasonality) is possible.

No Quantities or Prices:

We only know whether an item was purchased, not how many units or its cost.

Sparse & Noisy:

Many baskets are small (1–2 items), which may produce weak or trivial rules.

🔮 Potential Extensions

Synthetic Timestamps: Assign simulated timestamps to study temporal buying patterns.

Add Customer IDs: If merged with external data, one can perform personalized recommendations.

Price Data: Adding cost allows for profit-driven association rules (not just frequency-based).

Deep Learning Models: Sequence models (RNNs, Transformers) could be applied if temporal ordering of items is introduced.

...

Item 1	Item 2	Item 3	Item 4
Bread	Butter	Jam
Mineral water	Chocolate	Eggs	Milk
Spaghetti	Tomato sauce	Parmesan

Facebook

Twitter

Click to copy link

Link copied

Cite

Suradech Kongkiatpaiboon (2022). COVID19_datasets [Dataset]. https://www.kaggle.com/datasets/suradechk/covid19-datasets/discussion

COVID19_datasets

COVID-19 datasets obtained from github.com/nytimes/covid-19-data/ and cdc sites

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip(136322570 bytes)Available download formats

Dataset updated

Apr 2, 2022

Authors

Suradech Kongkiatpaiboon

Description

Collected COVID-19 datasets from various sources as part of DAAN-888 course, Penn State, Spring 2022. Collaborators: Mohamed Abdelgayed, Heather Beckwith, Mayank Sharma, Suradech Kongkiatpaiboon, and Alex Stroud

**1 - COVID-19 Data in the United States ** Source: The data is collected from multiple public health official sources by NY Times journalists and compiled in one single file. Description: Daily count of new COVID-19 cases and deaths for each state. Data is updated daily and runs from 1/21/2020 to 2/4/2022. URL: https://github.com/nytimes/covid-19-data/blob/master/us-states.csv Data size: 38,814 row and 5 columns.

**2 - Mask-Wearing Survey Data ** Source: The New York Times is releasing estimates of mask usage by county in the United States. Description: This data comes from a large number of interviews conducted online by the global data and survey firm Dynata, at the request of The New York Times. The firm asked a question about mask usage to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. URL: https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv Data size: 3,142 rows and 6 columns

**3a - Vaccine Data – Global ** Source: This data comes from the US Centers for Disease Control and Prevention (CDC), Our World in Data (OWiD) and the World Health Organization (WHO). Description: Time series data of vaccine doses administered and the number of fully and partially vaccinated people by country. This data was last updated on February 3, 2022 URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/global_data/time_series_covid19_vaccine_global.csv
Data Size: 162,521 rows and 8 columns

**3b -Vaccine Data – United States ** Source: The data is comprised of individual State's public dashboards and data from the US Centers for Disease Control and Prevention (CDC). Description: Time series data of the total vaccine doses shipped and administered by manufacturer, the dose number (first or second) by state. This data was last updated on February 3, 2022. URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/us_data/time_series/vaccine_data_us_timeline.csv
Data Size: 141,503 rows and 13 columns

**4 - Testing Data ** Source: The data is comprised of individual State's public dashboards and data from the U.S. Department of Health & Human Services. Description: Time series data of total tests administered by county and state. This data was last updated on January 25, 2022. URL: https://github.com/govex/COVID-19/blob/master/data_tables/testing_data/county_time_series_covid19_US.csv
Data size: 322,154 rows and 8 columns

**5 – US State and Territorial Public Mask Mandates ** Source: Data from state and territory executive orders, administrative orders, resolutions, and proclamations is gathered from government websites and cataloged and coded by one coder using Microsoft Excel, with quality checking provided by one or more other coders. Description: US State and Territorial Public Mask Mandates from April 10, 2020 through August 15, 2021 by County by Day URL: https://data.cdc.gov/Policy-Surveillance/U-S-State-and-Territorial-Public-Mask-Mandates-Fro/62d6-pm5i Data Size: 1,593,869 rows and 10 columns

**6 – Case Counts & Transmission Level ** Source: This open-source dataset contains seven data items that describe community transmission levels across all counties. This dataset provides the same numbers used to show transmission maps on the COVID Data Tracker and contains reported daily transmission levels at the county level. The dataset is updated every day to include the most current day's data. The calculating procedures below are used to adjust the transmission level to low, moderate, considerable, or high.
Description: US State and County case counts and transmission level from 16-Aug-2021 to 03-Feb-2022 URL: https://data.cdc.gov/Public-Health-Surveillance/United-States-COVID-19-County-Level-of-Community-T/8396-v7yb Data Size: 550,702 rows and 7 columns

**7 - World Cases & Vaccination Counts ** Source: This is an open-source dataset collected and maintained by Our World in Data. OWID provides research and data to help against the world’s largest problems.
Description: This dataset includes vaccinations, tests & positivity, hospital & ICU, confirmed cases, confirmed deaths, reproduction rate, policy responses and other variables of interest. URL: https://github.com/owid/covid-19-data/tree/master/public/data Data Size: 67 columns and 157,000 rows

**8 - COVID-19 Data in the European Union ** Source: This is an open-source dataset collected and maintained by ECDC. It is an EU agency aimed at strengthening Europe's defenses against infectious diseases.
Description: This dataset co...

Clear search

Close search

Google apps

Main menu

COVID19_datasets

Experiment 4: perceived size of test and reference arrays with lines only...

Data Center Construction Market Analysis, Size, and Forecast 2025-2029:...

Snapshot img

Data from: Variability, plot size and border effect in lettuce trials in...

2022 Bikeshare Data -Reduced File Size -All Months

9 Columns:

Revisions

Enterprise Data Platform Asset Total and Size

Training and testing XRD dataset for crystallite size and microstrain...

Company Data: Company Size, Address, Contact Details and Business Scope

Data from: Trade-offs between growth rate, tree size and lifespan of...

Bioinformatics Services Market Size and Forecast (2025 - 2035), Global and...

Visualizing Chicago Crime Data

Prelude

About the Dataset

Important Facts

Reliability

Processing the Data

Cleaning the Dataset

Examining the dataset

There are over 7.5 million rows of data

Putting a limit so it does not take a long time to run

Seeing which points are null

There are 85,000 null points so we can exclude them as it's not a significant amount since it is only ~1.3% of the dataset

Most of the null points are in the lat and long, which we will need later

Because we don't have the full address, we can't estimate the lat and long in SQL so we will have to delete the rows with Null Data

Deleting all null rows

Checking for any duplicates in the unique keys

None to be found

Study of Data Orchestration Tool Market by Cloud based and...

A Litopenaeus vannamei shrimp dataset with images and corresponding...

Data Center Cooling Market Report

Mobile Payment Data Protection Market Analysis by Contactless and Remote...

Synthetic datasets reflecting the shRNA-seq knockdown ENCODE data for HepG2...

House Price Regression Dataset

Home Value Insights: A Beginner's Regression Dataset

Features:

Potential Uses:

Versatility:

Mineral spectral refractive index and bulk optical property dataset for...

Data from: Size of plots for experiments with cactus pear cv. Gigante

Retail Market Basket Transactions Dataset

Overview

Detailed Information

📊 Dataset Composition

🛒 Nature of Transactions

🔎 Examples of Data

📈 Applications of This Dataset

📌 Key Insights Potentially Hidden in the Dataset

📂 Dataset Limitations

🔮 Potential Extensions

...

COVID19_datasets

COVID-19 datasets obtained from github.com/nytimes/covid-19-data/ and cdc sites