Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets were used to validate and test the data pipeline deployment following the RADON approach. The dataset has a CSV file that contains around 32000 Twitter tweets. 100 CSV files have been created from the single CSV file and each CSV file containing 320 tweets. Those 100 CSV files are used to validate and test (performance/load testing) the data pipeline components.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A comparison of common data import methods including ODBC, CSV uploads, QuickBooks Integration API, and third-party apps, focusing on speed, flexibility, and data handling capabilities.
Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
On the official website the dataset is available over SQL server (localhost) and CSVs to be used via Power BI Desktop running on Virtual Lab (Virtaul Machine). As per first two steps of Importing data are executed in the virtual lab and then resultant Power BI tables are copied in CSVs. Added records till year 2022 as required.
this dataset will be helpful in case you want to work offline with Adventure Works data in Power BI desktop in order to carry lab instructions as per training material on official website. The dataset is useful in case you want to work on Power BI desktop Sales Analysis example from Microsoft website PL 300 learning.
Download the CSV file(s) and import in Power BI desktop as tables. The CSVs are named as tables created after first two steps of importing data as mentioned in the PL-300 Microsoft Power BI Data Analyst exam lab.
Facebook
TwitterThis dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files
Facebook
TwitterCsv Marketing Export Import Data. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Data pulled from Traffy Fondue, by accessing the Traffy Fondue Open API. Date January 2022 until January 2025
The following code pulled the data:
import os
import json
import requests
from datetime import datetime, timedelta
import time
class TraffyDataFetcher:
def _init_(self, start_date, subfolder='traffyfonduedata'):
self.url = "https://publicapi.traffy.in.th/share/teamchadchart/search"
self.query = {'offset': '0'}
self.payload = {}
self.headers = {}
self.start_date = datetime.strptime(start_date, '%Y-%m-%d')
self.end_date = datetime.now()
self.subfolder = subfolder
self.max_requests_per_minute = 99
if not os.path.exists(self.subfolder):
os.makedirs(self.subfolder)
def add_days_to_date(self, start_date_str, days_to_add):
start_date = datetime.strptime(start_date_str, '%Y-%m-%d')
new_date = start_date + timedelta(days=days_to_add)
return new_date.strftime('%Y-%m-%d')
def fetch_data(self):
current_date = self.start_date
index = 0
while current_date <= self.end_date:
start_time = datetime.now()
self.query['start'] = current_date.strftime('%Y-%m-%d')
new_date = self.add_days_to_date(self.query['start'], 10)
self.query['end'] = new_date
response = requests.request("GET", self.url, headers=self.headers, data=self.payload, params=self.query)
print(f"offset: {index} response: {response.status_code}")
filename = f"traffy_{current_date.strftime('%Y-%m-%d')}.json"
file_path = os.path.join(self.subfolder, filename)
with open(file_path, "w") as outfile:
json_object = json.dumps(response.json(), indent=4)
outfile.write(json_object)
end_time = datetime.now()
elapsed_time = (end_time - start_time).total_seconds()
print(f"Elapsed time: {elapsed_time} s")
index += 950
current_date = datetime.strptime(new_date, '%Y-%m-%d') + timedelta(days=1)
if index % self.max_requests_per_minute == 0:
time.sleep(60 - elapsed_time)
if _name_ == "_main_":
fetcher = TraffyDataFetcher(start_date='2022-01-01')
fetcher.fetch_data()
--
And the following code converted the json to CSV files
import os
import glob
import json
import pandas as pd
#import numpy as np
class TraffyJSONFixer:
def _init_(self, path_to_json='*.json', subfolder='traffyfonduedata'):
self.path_to_json = path_to_json
self.subfolder = subfolder
self.outputfolder = 'fixedjson'
self.excelfolder = 'exceloutput'
self.file_path = os.path.join(self.subfolder, self.path_to_json)
self.json_files = glob.glob(self.file_path)
# Ensure the subfolder exists
if not os.path.exists(self.subfolder):
os.makedirs(self.subfolder)
# Ensure the outputfolder exists
if not os.path.exists(self.outputfolder):
os.makedirs(self.outputfolder)
# Ensure the excelfolder exists
if not os.path.exists(self.excelfolder):
os.makedirs(self.excelfolder)
# Debugging: Print the current working directory and the list of JSON files
print(f"Current working directory: {os.getcwd()}")
print(f"Found JSON files: {self.json_files}")
def fix_json_files(self):
for count, ele in enumerate(self.json_files):
new_file_name = os.path.join(self.outputfolder, f"data_{os.path.basename(ele)}")
try:
with open(ele, 'r', encoding='utf-8') as f:
data = json.load(f)
# Debugging: Print the type of data
print(f"Processing file: {ele}")
print(f"Type of data: {type(data)}")
# Handle different JSON structures
if isinstance(data, dict) and "results" in data:
results = data["results"]
elif isinstance(data, list):
results = data
else:
print(f"Unexpected JSON structure in file: {ele}")
continue
# Ensure results is a list or dict before writing
if isinstance(results, (list, dict)):
with open(new_file_name, 'w', encoding='utf-8') as f:
f.write(json.dumps(results, indent=4))
else:
print(f"Unexpected type for results in file: {ele}")
except (json.JSONDecodeError, KeyError) as e:
print(f"Error processing file {ele}: {e}")
def jsontoexcel(self):
jsonfile_path = os.path.join(self.out...
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
The Dog Food Data Extracted from Chewy (USA) dataset contains 4,500 detailed records of dog food products sourced from one of the leading pet supply platforms in the United States, Chewy. This dataset is ideal for businesses, researchers, and data analysts who want to explore and analyze the dog food market, including product offerings, pricing strategies, brand diversity, and customer preferences within the USA.
The dataset includes essential information such as product names, brands, prices, ingredient details, product descriptions, weight options, and availability. Organized in a CSV format for easy integration into analytics tools, this dataset provides valuable insights for those looking to study the pet food market, develop marketing strategies, or train machine learning models.
Key Features:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Facebook
TwitterThis dataset was created by DINESH JATAV
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Project Description:
Title: Pandas Data Manipulation and File Conversion
Overview: This project aims to demonstrate the basic functionalities of Pandas, a powerful data manipulation library in Python. In this project, we will create a DataFrame, perform some data manipulation operations using Pandas, and then convert the DataFrame into both Excel and CSV formats.
Key Objectives:
Tools and Libraries Used:
Project Implementation:
DataFrame Creation:
Data Manipulation:
File Conversion:
to_excel() function.to_csv() function.Expected Outcome:
Upon completion of this project, you will have gained a fundamental understanding of how to work with Pandas DataFrames, perform basic data manipulation tasks, and convert DataFrames into different file formats. This knowledge will be valuable for data analysis, preprocessing, and data export tasks in various data science and analytics projects.
Conclusion:
The Pandas library offers powerful tools for data manipulation and file conversion in Python. By completing this project, you will have acquired essential skills that are widely applicable in the field of data science and analytics. You can further extend this project by exploring more advanced Pandas functionalities or integrating it into larger data processing pipelines.in this data we add number of data and make that data a data frame.and save in single excel file as different sheet name and then convert that excel file in csv file .
Facebook
TwitterCsv Investments Private Limited Export Import Data. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large go-around, also referred to as missed approach, data set. The data set is in support of the paper presented at the OpenSky Symposium on November the 10th.
If you use this data for a scientific publication, please consider citing our paper.
The data set contains landings from 176 (mostly) large airports from 44 different countries. The landings are labelled as performing a go-around (GA) or not. In total, the data set contains almost 9 million landings with more than 33000 GAs. The data was collected from OpenSky Network's historical data base for the year 2019. The published data set contains multiple files:
go_arounds_minimal.csv.gz
Compressed CSV containing the minimal data set. It contains a row for each landing and a minimal amount of information about the landing, and if it was a GA. The data is structured in the following way:
Column name
Type
Description
time
date time
UTC time of landing or first GA attempt
icao24
string
Unique 24-bit (hexadecimal number) ICAO identifier of the aircraft concerned
callsign
string
Aircraft identifier in air-ground communications
airport
string
ICAO airport code where the aircraft is landing
runway
string
Runway designator on which the aircraft landed
has_ga
string
"True" if at least one GA was performed, otherwise "False"
n_approaches
integer
Number of approaches identified for this flight
n_rwy_approached
integer
Number of unique runways approached by this flight
The last two columns, n_approaches and n_rwy_approached, are useful to filter out training and calibration flight. These have usually a large number of n_approaches, so an easy way to exclude them is to filter by n_approaches > 2.
go_arounds_augmented.csv.gz
Compressed CSV containing the augmented data set. It contains a row for each landing and additional information about the landing, and if it was a GA. The data is structured in the following way:
Column name
Type
Description
time
date time
UTC time of landing or first GA attempt
icao24
string
Unique 24-bit (hexadecimal number) ICAO identifier of the aircraft concerned
callsign
string
Aircraft identifier in air-ground communications
airport
string
ICAO airport code where the aircraft is landing
runway
string
Runway designator on which the aircraft landed
has_ga
string
"True" if at least one GA was performed, otherwise "False"
n_approaches
integer
Number of approaches identified for this flight
n_rwy_approached
integer
Number of unique runways approached by this flight
registration
string
Aircraft registration
typecode
string
Aircraft ICAO typecode
icaoaircrafttype
string
ICAO aircraft type
wtc
string
ICAO wake turbulence category
glide_slope_angle
float
Angle of the ILS glide slope in degrees
has_intersection
string
Boolean that is true if the runway has an other runway intersecting it, otherwise false
rwy_length
float
Length of the runway in kilometre
airport_country
string
ISO Alpha-3 country code of the airport
airport_region
string
Geographical region of the airport (either Europe, North America, South America, Asia, Africa, or Oceania)
operator_country
string
ISO Alpha-3 country code of the operator
operator_region
string
Geographical region of the operator of the aircraft (either Europe, North America, South America, Asia, Africa, or Oceania)
wind_speed_knts
integer
METAR, surface wind speed in knots
wind_dir_deg
integer
METAR, surface wind direction in degrees
wind_gust_knts
integer
METAR, surface wind gust speed in knots
visibility_m
float
METAR, visibility in m
temperature_deg
integer
METAR, temperature in degrees Celsius
press_sea_level_p
float
METAR, sea level pressure in hPa
press_p
float
METAR, QNH in hPA
weather_intensity
list
METAR, list of present weather codes: qualifier - intensity
weather_precipitation
list
METAR, list of present weather codes: weather phenomena - precipitation
weather_desc
list
METAR, list of present weather codes: qualifier - descriptor
weather_obscuration
list
METAR, list of present weather codes: weather phenomena - obscuration
weather_other
list
METAR, list of present weather codes: weather phenomena - other
This data set is augmented with data from various public data sources. Aircraft related data is mostly from the OpenSky Network's aircraft data base, the METAR information is from the Iowa State University, and the rest is mostly scraped from different web sites. If you need help with the METAR information, you can consult the WMO's Aerodrom Reports and Forecasts handbook.
go_arounds_agg.csv.gz
Compressed CSV containing the aggregated data set. It contains a row for each airport-runway, i.e. every runway at every airport for which data is available. The data is structured in the following way:
Column name
Type
Description
airport
string
ICAO airport code where the aircraft is landing
runway
string
Runway designator on which the aircraft landed
n_landings
integer
Total number of landings observed on this runway in 2019
ga_rate
float
Go-around rate, per 1000 landings
glide_slope_angle
float
Angle of the ILS glide slope in degrees
has_intersection
string
Boolean that is true if the runway has an other runway intersecting it, otherwise false
rwy_length
float
Length of the runway in kilometres
airport_country
string
ISO Alpha-3 country code of the airport
airport_region
string
Geographical region of the airport (either Europe, North America, South America, Asia, Africa, or Oceania)
This aggregated data set is used in the paper for the generalized linear regression model.
Downloading the trajectories
Users of this data set with access to OpenSky Network's Impala shell can download the historical trajectories from the historical data base with a few lines of Python code. For example, you want to get all the go-arounds of the 4th of January 2019 at London City Airport (EGLC). You can use the Traffic library for easy access to the database:
import datetime from tqdm.auto import tqdm import pandas as pd from traffic.data import opensky from traffic.core import Traffic
df = pd.read_csv("go_arounds_minimal.csv.gz", low_memory=False) df["time"] = pd.to_datetime(df["time"])
airport = "EGLC" start = datetime.datetime(year=2019, month=1, day=4).replace( tzinfo=datetime.timezone.utc ) stop = datetime.datetime(year=2019, month=1, day=5).replace( tzinfo=datetime.timezone.utc )
df_selection = df.query("airport==@airport & has_ga & (@start <= time <= @stop)")
flights = [] delta_time = pd.Timedelta(minutes=10) for _, row in tqdm(df_selection.iterrows(), total=df_selection.shape[0]): # take at most 10 minutes before and 10 minutes after the landing or go-around start_time = row["time"] - delta_time stop_time = row["time"] + delta_time
# fetch the data from OpenSky Network
flights.append(
opensky.history(
start=start_time.strftime("%Y-%m-%d %H:%M:%S"),
stop=stop_time.strftime("%Y-%m-%d %H:%M:%S"),
callsign=row["callsign"],
return_flight=True,
)
)
Traffic.from_flights(flights)
Additional files
Additional files are available to check the quality of the classification into GA/not GA and the selection of the landing runway. These are:
validation_table.xlsx: This Excel sheet was manually completed during the review of the samples for each runway in the data set. It provides an estimate of the false positive and false negative rate of the go-around classification. It also provides an estimate of the runway misclassification rate when the airport has two or more parallel runways. The columns with the headers highlighted in red were filled in manually, the rest is generated automatically.
validation_sample.zip: For each runway, 8 batches of 500 randomly selected trajectories (or as many as available, if fewer than 4000) classified as not having a GA and up to 8 batches of 10 random landings, classified as GA, are plotted. This allows the interested user to visually inspect a random sample of the landings and go-arounds easily.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
Included Files
File Format: Downsampled Data
These are the "LP_
These data files can be easily loaded using the pandas library in Python through:
import pandas
data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
keep_default_na=False, na_values='')
Caveats
Facebook
TwitterData Set Information:
Diabetes patient records were obtained from two sources: an automatic electronic recording device and paper records. The automatic device had an internal clock to timestamp events, whereas the paper records only provided "logical time" slots (breakfast, lunch, dinner, bedtime). For paper records, fixed times were assigned to breakfast (08:00), lunch (12:00), dinner (18:00), and bedtime (22:00). Thus paper records have fictitious uniform recording times whereas electronic records have more realistic time stamps.
Diabetes files consist of four fields per record. Each field is separated by a tab and each record is separated by a newline.
File Names and format: (1) Date in MM-DD-YYYY format (2) Time in XX:YY format (3) Code (4) Value
The Code field is deciphered as follows:
33 = Regular insulin dose 34 = NPH insulin dose 35 = UltraLente insulin dose 48 = Unspecified blood glucose measurement 57 = Unspecified blood glucose measurement 58 = Pre-breakfast blood glucose measurement 59 = Post-breakfast blood glucose measurement 60 = Pre-lunch blood glucose measurement 61 = Post-lunch blood glucose measurement 62 = Pre-supper blood glucose measurement 63 = Post-supper blood glucose measurement 64 = Pre-snack blood glucose measurement 65 = Hypoglycemic symptoms 66 = Typical meal ingestion 67 = More-than-usual meal ingestion 68 = Less-than-usual meal ingestion 69 = Typical exercise activity 70 = More-than-usual exercise activity 71 = Less-than-usual exercise activity 72 = Unspecified special event
import pandas as pd
def convert(file):
df = pd.DataFrame(columns=['date', 'time', 'code', 'value'])
with open(str(file), 'r') as data:
for line in data.readlines():
line = line.replace('
', '') line_split = line.split('\t')
line_df = pd.DataFrame([line_split], columns=['date', 'time', 'code', 'value'])
df = df.append(line_df)
df.index = range(0, len(df))
data.close()
new_file = file.parent / f'{file.name}_csv.csv'
with open(new_file, 'w') as terminal_file:
df.to_csv(terminal_file)
terminal_file.close()
print(f"{file.name} was saved.")
path = 'C:/Users/Krittaphas/PycharmProjects/auto/Diabetes-Data'
for file in Path(path).iterdir():
convert(file)`
import pandas as pd
from pathlib import Path
path = 'C:/Users/Krittaphas/PycharmProjects/auto/Diabetes-Data'
main_df = pd.DataFrame(columns=['patient_id', 'date', 'time', 'code', 'value'])
row_list = []
for file in Path(path).iterdir():
number = file.name[5:7]
df = pd.read_csv(file)
df['patient_id'] = number
main_df = main_df.append(df)
main_df.drop('Unnamed: 0', inplace=True, axis=1)
print(main_df)
main_df.to_csv(path+'/diabetes_data_all_patient.csv')
print('Complete')
Facebook
TwitterThis dataset was create using hand signs in images and made the landmarks of the same were made into the attributes of the dataset, contains all 21 landmarks of with each coordinate(x,y,z) and 5 classes(1,2,3,4,5).
You can also add more classes to your dataset by running the following code, make sure to create an empty dataset or append to the dataset here and set the file path correctly
import numpy as np import pandas as pd import matplotlib.pyplot as plt import mediapipe as mp import cv2 import os
for t in range(1,6): path = 'data/'+str(t)+'/' images = os.listdir(path) for i in images: image = cv2.imread(path+i) mp_hands = mp.solutions.hands hands = mp_hands.Hands(static_image_mode=False,max_num_hands=1,min_detection_confidence=0.8,min_tracking_confidence=0.8) mp_draw = mp.solutions.drawing_utils image = cv2.cvtColor(image,cv2.COLOR_BGR2RGB) image.flags.writeable=False results = hands.process(image) image.flags.writeable=True ``` if results.multi_hand_landmarks:
for hand_no, hand_landmarks in enumerate(results.multi_hand_landmarks):
mp_draw.draw_landmarks(image = image, landmark_list = hand_landmarks,
connections = mp_hands.HAND_CONNECTIONS)
a = dict()
a['label'] = t
for i in range(21):
s = ('x','y','z')
k = (hand_landmarks.landmark[i].x,hand_landmarks.landmark[i].y,hand_landmarks.landmark[i].z)
for j in range(len(k)):
a[str(mp_hands.HandLandmark(i).name)+'_'+str(s[j])] = k[j]
df = df.append(a,ignore_index=True)
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The data was collected on 2024-04-05 containing 3492 problems.
Cleaned via the following script.
import json
import csv
from io import TextIOWrapper
def clean(data: dict):
questions = data['data']['problemsetQuestionList']['questions']
for q in questions:
yield {
'id': q['frontendQuestionId'],
'difficulty': q['difficulty'],
'title': q['title'],
'titleCn': q['titleCn'],
'titleSlug': q['titleSlug'],
'paidOnly': q['paidOnly'],
'acRate': round(q['acRate'], 3),
'topicTags': [t['name'] for t in q['topicTags']],
}
def out_jsonl(f: TextIOWrapper):
for id in range(0, 35):
with open(f'data/{id}.json', encoding='u8') as f2:
data = json.load(f2)
for q in clean(data):
f.write(json.dumps(q, ensure_ascii=False))
f.write('
')
def out_json(f: TextIOWrapper):
l = []
for id in range(0, 35):
with open(f'data/{id}.json', encoding='u8') as f2:
data = json.load(f2)
for q in clean(data):
l.append(q)
json.dump(l, f, ensure_ascii=False)
def out_csv(f: TextIOWrapper):
writer = csv.DictWriter(f, fieldnames=[
'id', 'difficulty', 'title', 'titleCn', 'titleSlug', 'paidOnly', 'acRate', 'topicTags'
])
writer.writeheader()
for id in range(0, 35):
with open(f'data/{id}.json', encoding='u8') as f2:
data = json.load(f2)
writer.writerows(clean(data))
with open('data.jsonl', 'w', encoding='u8') as f:
out_jsonl(f)
with open('data.json', 'w', encoding='u8') as f:
out_json(f)
with open('data.csv', 'w', encoding='u8', newline='') as f:
out_csv(f)
Facebook
Twitterhttps://data.gov.tw/licensehttps://data.gov.tw/license
Provide "Statistics of Import and Export Trade Volume of Each Park" to let the public understand the import and export and its growth trend of each park. In addition to updating this information every month, CSV file format is also provided for free download and use by the public.The dataset includes statistics on the import and export trade volume of parks such as Nanzih, Kaohsiung, Taichung, Zhonggang, Pingtung, and other parks (Lingguang, Chenggong, Gaoruan), with main fields including "Park, Import and Export (This Month, Year-to-Date)", "Export (This Month, Year-to-Date)", "Import (This Month, Year-to-Date)", and other important information.
Facebook
Twitterhttps://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
This is The initial dataset we scraped from open maps
this dataset has not been cleaned yet be aware!
!pip install requests
script
import csv import time import requests from urllib.parse import quote
OUT_CSV = "jabodetabek_sports_osm.csv"
BBOX = (-6.80, 106.30, -5.90, 107.20)
OVERPASS_URL = "https://overpass-api.de/api/interpreter" WIKIDATA_ENTITY_URL = "https://www.wikidata.org/wiki/Special:EntityData/{qid}.json"
FETCH_WIKIDATA_IMAGES =… See the full description on the dataset page: https://huggingface.co/datasets/Shiowo2/Initial-Data-FitMatrix.
Facebook
TwitterThis dataset is a clean CSV file with the most recent estimates of the population of the countries according to Wolrdometer. The data is taken from the following link: https://www.worldometers.info/world-population/population-by-country/
The data has been generated by websraping the aforementioned link on the 16th August 2021. Below is the code used to make CSV data in Python 3.8:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.worldometers.info/world-population/population-by-country/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
countries = soup.find_all("table")[0]
dataframe = pd.read_html(str(countries))[0]
dataframe.to_csv("countries_by_population_2021.csv", index=False)
The creation of this dataset would not be possible without a team of Worldometers, a data aggregation website.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets were used to validate and test the data pipeline deployment following the RADON approach. The dataset has a CSV file that contains around 32000 Twitter tweets. 100 CSV files have been created from the single CSV file and each CSV file containing 320 tweets. Those 100 CSV files are used to validate and test (performance/load testing) the data pipeline components.