Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Definitions of all variables in the full data table for Webb & Mindel, Global Patterns of Extinction Risk in Marine and Non-marine Systems, Current Biology. Links to full data table and R code to generate figures and analyses.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Data using the definition of alcohol-specific deaths in addition to a count of deaths caused by chronic hepatitis and fibrosis and cirrhosis of the liver in the UK
Facebook
TwitterData tables are in two excel file worksheets. The first sheet, labeled 'Fitted Filtration Efficiency' has columns with subject, mask & condition (baseline or with clip), Chamber Relative Humidity (%) and Temperature (degrees Celsius), the Overall Fitted Filtration Efficiency mean (across four exercises) and standard deviation. The second sheet, labeled 'Sex' has columns with the subject number and their biological sex (F = Female, M = Male). This dataset is associated with the following publication: Pennington, E., J. Griffin, E. McInroe, W. Steinhardt, H. Chen, J. Samet, and S. Prince. Variation in the Fitted Filtration Efficiency of Disposable Face Masks by Sex. Journal of Exposure Science and Environmental Epidemiology. Nature Publishing Group, London, UK, s41370-024-00697-4, (2024).
Facebook
TwitterData includes consumption for a range of property characteristics such as age and type, as well as a range of household characteristics such as the number of adults and household income.
The content covers:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contains the Excel created to compile and analyze all the values for position registered during the stereo versions analysis. The first sheet contains an overview of the analysis of the 7 songs and the following sheets present the individual table for each song. The last sheet also contains a comparison of mean and standard deviation values between both formats analyzed, stereo and surround sound.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The Travel Time to Work indicator compares the mean, or average, commute time for Champaign County residents to the mean commute time for residents of Illinois and the United States as a whole. On its own, mean travel time of all commuters on all mode types could be reflective of a number of different conditions. Congestion, mode choice, changes in residential patterns, changes in the location of major employment centers, and changes in the transit network can all impact travel time in different and often conflicting ways. Since the onset of the COVID-19 pandemic in 2020, the workplace location (office vs. home) is another factor that can impact the mean travel time of an area. We don’t recommend trying to draw any conclusions about conditions in Champaign County, or anywhere else, based on mean travel time alone.
However, when combined with other indicators in the Mobility category (and other categories), mean travel time to work is a valuable measure of transportation behaviors in Champaign County.
Champaign County’s mean travel time to work is lower than the mean travel time to work in Illinois and the United States. Based on this figure, the state of Illinois has the longest commutes of the three analyzed areas.
The year-to-year fluctuations in mean travel time have been statistically significant in the United States since 2014, and in Illinois most recently in 2021 and 2022. Champaign County’s year-to-year fluctuations in mean travel time were statistically significant from 2021 to 2022, the first time since this data first started being tracked in 2005.
Mean travel time data was sourced from the U.S. Census Bureau’s American Community Survey (ACS) 1-Year Estimates, which are released annually.
As with any datasets that are estimates rather than exact counts, it is important to take into account the margins of error (listed in the column beside each figure) when drawing conclusions from the data.
Due to the impact of the COVID-19 pandemic, instead of providing the standard 1-year data products, the Census Bureau released experimental estimates from the 1-year data in 2020. This includes a limited number of data tables for the nation, states, and the District of Columbia. The Census Bureau states that the 2020 ACS 1-year experimental tables use an experimental estimation methodology and should not be compared with other ACS data. For these reasons, and because data is not available for Champaign County, no data for 2020 is included in this Indicator.
For interested data users, the 2020 ACS 1-Year Experimental data release includes a dataset on Travel Time to Work.
Sources: U.S. Census Bureau; American Community Survey, 2024 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (18 November 2025).; U.S. Census Bureau; American Community Survey, 2023 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (16 October 2024).; U.S. Census Bureau; American Community Survey, 2022 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (10 October 2023).; U.S. Census Bureau; American Community Survey, 2021 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (17 October 2022).; U.S. Census Bureau; American Community Survey, 2019 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (29 March 2021).; U.S. Census Bureau; American Community Survey, 2018 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using data.census.gov; (29 March 2021).; U.S. Census Bureau; American Community Survey, 2017 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (13 September 2018).; U.S. Census Bureau; American Community Survey, 2016 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (14 September 2017).; U.S. Census Bureau; American Community Survey, 2015 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (19 September 2016).; U.S. Census Bureau; American Community Survey, 2014 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2013 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2012 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2011 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2010 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2009 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2008 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2007 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2006 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2005 American Community Survey 1-Year Estimates, Table S0801; generated by CCRPC staff; using American FactFinder; (16 March 2016).
Facebook
Twitterhttps://data.mfe.govt.nz/license/attribution-3-0-new-zealand/https://data.mfe.govt.nz/license/attribution-3-0-new-zealand/
The ocean waters surrounding New Zealand vary in temperature from north to south. They interact with heat and moisture in the atmosphere and affect our weather. Sea surface temperature changes with climate drivers such as El Niño, and will change with climate change. The sea surface temperature anomaly provides an indication of the heat change in the ocean. Long-term changes and short-term variability in sea-surface temperatures can affect marine processes, habitats, and species. some species may find it hard to survive in changing environmental conditions. The oceanic sea surface temperature data comes from the NIWA Sea surface temperature Archive (NSA). There are 2 datasets, NSA Annual Means and NSA Annual Anomolies ,covering the Tasman, subtropical (STW) and Southern Antarctic (SAW) area and the total area. The data is available from 1993 to 2013 and the unit of measure is degrees Celsius . For more information please see: Uddstrom, MJ (2015) Sea Surface Temperature Data and Analysis for the 2015 Synthesis Report. For Ministry for the Environment. Available at https://data.mfe.govt.nz/x/hRbGUJ on the Ministry for the Environment dataservice (https://data.mfe.govt.nz). Trend results can be found in the excel file "Sea surface temperature trend statistics" found at https://data.mfe.govt.nz/x/DGXFS6. This dataset relates to the "Sea surface temperature" measure on the Environmental Indicators, Te taiao Aotearoa website.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset extracted from the post 10 Important Questions on Fundamental Analysis of Stocks – Meaning, Parameters, and Step-by-Step Guide on Smart Investello.
Facebook
TwitterThe "SPM LTDS Appendix 2 Transformer Data (Table 2) 33 to 11kV" data table provides the parameters for each group of transformers on the SP Manweb (SPM) system.Click here to access our full Long Term Development Statements for both SP Distribution (SPD) & SP Manweb (SPM).The table gives the following information:Node 1 & 2 per substation groupPositive sequence impedance R & X per substation groupZero sequence reactance per substation groupMinimum and maximum tap percentage per substation groupTransformer ratingReverse power capabilityFor additional information on column definitions, please click on the Dataset schema link below. DisclaimerWhilst all reasonable care has been taken in the preparation of this data, SP Energy Networks does not accept any responsibility or liability for the accuracy or completeness of this data, and is not liable for any loss that may be attributed to the use of this data. For the avoidance of doubt, this data should not be used for safety critical purposes without the use of appropriate safety checks and services e.g. LineSearchBeforeUDig etc. Please raise any potential issues with the data which you have received via the feedback form available at the Feedback tab above (must be logged in to see this). Data TriageAs part of our commitment to enhancing the transparency, and accessibility of the data we share, we publish the results of our Data Triage process.Our Data Triage documentation includes our Risk Assessments; detailing any controls we have implemented to prevent exposure of sensitive information. Click here to access the Data Triage documentation for the Long Term Development Statement dataset.To access our full suite of Data Triage documentation, visit the SP Energy Networks Data & Information page.Download dataset metadata (JSON)
Facebook
TwitterObjectives: Inborn error of immunity (IEI) comprises a broad group of inherited immunological disorders that usually display an overlap in many clinical manifestations challenging their diagnosis. The identification of disease-causing variants comprises the gold-standard approach to ascertain IEI diagnosis. The efforts to increase the availability of clinically relevant genomic data for these disorders constitute an important improvement in the study of rare genetic disorders. This work aims to make available whole-exome sequencing (WES) data of Brazilian patients' suspicion of IEI without a genetic diagnosis. We foresee a broad use of this dataset by the scientific community in order to provide a more accurate diagnosis of IEI disorders.Data description: Twenty singleton unrelated patients treated at four different hospitals in the state of Rio de Janeiro, Brazil were enrolled in our study. Half of the patients were male with mean ages of 9±3, while females were 12±10 years old. The WES was performed in the Illumina NextSeq platform with at least 90% of sequenced bases with a minimum of 30 reads depth. Each sample had an average of 20,274 variants, comprising 116 classified as rare pathogenic or likely pathogenic according to ACMG guidelines. The genotype-phenotype association was impaired by the lack of detailed clinical and laboratory information, besides the unavailability of molecular and functional studies which, comprise the limitations of this study. Overall, the access to clinical exome sequencing data is limited, challenging exploratory analyses and the understanding of genetic mechanisms underlying disorders. Therefore, by making these data available, we aim to increase the number of WES data from Brazilian samples despite contributing to the study of monogenic IEI-disorders.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A compilation of data definition for the Global Knowledge Base for Underutilised Crops.
Facebook
TwitterThese data arise from a field study of groundfish catch monitoring in Kodiak, AK trawl fisheries. Two monitoring components were included in the study: 1) at-sea sampling methods used by observers to sample species composition of catch and 2) shore-side sampling of delivered catch by observers to validate landings species composition reports. The at-sea portion of the study consisted of a side-by-side comparison (two observers) of a proposed new sampling method and the standard sampling method. Observer data were recorded at-sea on paper and transferred to an Oracle database. The shoreside component of this study consisted of observer species composition sampling in plants for later comparison with landings data. The shore-side data were collected by observers in processing plants, recorded on paper and transferred to an Oracle database. Data collection started in April 2011 and continued through August 2011. Third party landings data (NOAA Fisheries, Alaska Regional Office, Sustainable Fisheries Division) that were used in the analysis are stored in an oracle database. Data for both project components (at-sea and shoreside) were collected during normal fishing activities onboard commercial trawl catcher vessels and during normal processing activities in shore-based processing plants.
Facebook
TwitterThe "SPD LTDS Appendix 1 Circuit Data (Table 1)" data table provides data that is derived from power system analysis software and, therefore, the circuit parameters detailed are based on the equipment between analytical node points. As some circuits may have intermediate node points, or a number of components, this aspect should be taken into consideration when assessing overall (end-to-end) circuit parameters. Those circuits labelled S/C, or short-circuit, represent circuit breakers, switches, or busbar connections of effectively zero impedance.Click here to access our full Long Term Development Statements for both SP Distribution (SPD) & SP Manweb (SPM).The table gives the following information:Resitance, reactance and susceptance per GSPCircuit from and to for each GSPWinter, summer, spring and autumn maximum continuous ratingFor additional information on column definitions, please click on the Dataset schema link below. DisclaimerWhilst all reasonable care has been taken in the preparation of this data, SP Energy Networks does not accept any responsibility or liability for the accuracy or completeness of this data, and is not liable for any loss that may be attributed to the use of this data. For the avoidance of doubt, this data should not be used for safety critical purposes without the use of appropriate safety checks and services e.g. LineSearchBeforeUDig etc. Please raise any potential issues with the data which you have received via the feedback form available at the Feedback tab above (must be logged in to see this). Data TriageAs part of our commitment to enhancing the transparency, and accessibility of the data we share, we publish the results of our Data Triage process.Our Data Triage documentation includes our Risk Assessments; detailing any controls we have implemented to prevent exposure of sensitive information. Click here to access the Data Triage documentation for the Long Term Development Statement dataset.To access our full suite of Data Triage documentation, visit the SP Energy Networks Data & Information page.Download dataset metadata (JSON)
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Although the American Community Survey (ACS) produces population, demographic and housing unit estimates, the decennial census is the official source of population totals for April 1st of each decennial year. In between censuses, the Census Bureau's Population Estimates Program produces and disseminates the official estimates of the population for the nation, states, counties, cities, and towns and estimates of housing units and the group quarters population for states and counties..Information about the American Community Survey (ACS) can be found on the ACS website. Supporting documentation including code lists, subject definitions, data accuracy, and statistical testing, and a full list of ACS tables and table shells (without estimates) can be found on the Technical Documentation section of the ACS website.Sample size and data quality measures (including coverage rates, allocation rates, and response rates) can be found on the American Community Survey website in the Methodology section..Source: U.S. Census Bureau, 2023 American Community Survey 1-Year Estimates.ACS data generally reflect the geographic boundaries of legal and statistical areas as of January 1 of the estimate year. For more information, see Geography Boundaries by Year..Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted roughly as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see ACS Technical Documentation). The effect of nonsampling error is not represented in these tables..Users must consider potential differences in geographic boundaries, questionnaire content or coding, or other methodological issues when comparing ACS data from different years. Statistically significant differences shown in ACS Comparison Profiles, or in data users' own analysis, may be the result of these differences and thus might not necessarily reflect changes to the social, economic, housing, or demographic characteristics being compared. For more information, see Comparing ACS Data..The age dependency ratio is derived by dividing the combined under-18 and 65-and-over populations by the 18-to-64 population and multiplying by 100..The old-age dependency ratio is derived by dividing the population 65 and over by the 18-to-64 population and multiplying by 100..The child dependency ratio is derived by dividing the population under 18 by the 18-to-64 population and multiplying by 100..When information is missing or inconsistent, the Census Bureau logically assigns an acceptable value using the response to a related question or questions. If a logical assignment is not possible, data are filled using a statistical process called allocation, which uses a similar individual or household to provide a donor value. The "Allocated" section is the number of respondents who received an allocated value for a particular subject..Estimates of urban and rural populations, housing units, and characteristics reflect boundaries of urban areas defined based on 2020 Census data. As a result, data for urban and rural areas from the ACS do not necessarily reflect the results of ongoing urbanization..Explanation of Symbols:- The estimate could not be computed because there were an insufficient number of sample observations. For a ratio of medians estimate, one or both of the median estimates falls in the lowest interval or highest interval of an open-ended distribution. For a 5-year median estimate, the margin of error associated with a median was larger than the median itself.N The estimate or margin of error cannot be displayed because there were an insufficient number of sample cases in the selected geographic area. (X) The estimate or margin of error is not applicable or not available.median- The median falls in the lowest interval of an open-ended distribution (for example "2,500-")median+ The median falls in the highest interval of an open-ended distribution (for example "250,000+").** The margin of error could not be computed because there were an insufficient number of sample observations.*** The margin of error could not be computed because the median falls in the lowest interval or highest interval of an open-ended distribution.***** A margin of error is not appropriate because the corresponding estimate is controlled to an independent population or housing estimate. Effectively, the corresponding estimate has no sampling error and the margin of error may be treated as zero.
Facebook
Twitterhttp://researchdatafinder.qut.edu.au/display/n39033http://researchdatafinder.qut.edu.au/display/n39033
QUT Research Data Respository Dataset Resource available for download
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Methane hydrates are present in marine seep systems and occur within the gas hydrate stability zone. Very little is known about their crystallite sizes and size distributions because they are notoriously difficult to measure. Crystal size distributions are usually considered as one of the key petrophysical parameters because they influence mechanical properties and possible compositional changes, which may occur with changing environmental conditions. Variations in grain size are relevant for gas substitution in natural hydrates by replacing CH4 with CO2 for the purpose of carbon dioxide sequestration. Here we show that crystallite sizes of gas hydrates from some locations in the Indian Ocean, Gulf of Mexico and Black Sea are in the range of 200–400 µm; larger values were obtained for deeper-buried samples from ODP Leg 204. The crystallite sizes show generally a log-normal distribution and appear to vary sometimes rapidly with location.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.
I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.
Key Features:
Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects
The database consists of four main tables:
This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.
https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data
Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings
Usage with LIKE queries: ``` import aiosqlite import asyncio
class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file
async def _aenter_(self):
self.conn = await aiosqlite.connect(self.db_file)
return self
async def _aexit_(self, exc_type, exc_val, exc_tb):
await self.conn.close()
async def search_pages_by_title(self, title):
query = """
SELECT pages.page_id, pages.item_id, pages.title, pages.views,
items.labels AS item_labels, items.description AS item_description,
link_annotated_text.sections
FROM pages
JOIN items ON pages.item_id = items.id
JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id
WHERE pages.title LIKE ?
"""
async with self.conn.execute(query, (f"%{title}%",)) as cursor:
return await cursor.fetchall()
async def search_items_by_label_or_description(self, keyword):
query = """
SELECT id, labels, description
FROM items
WHERE labels LIKE ? OR description LIKE ?
"""
async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor:
return await cursor.fetchall()
async def search_items_by_label(self, label):
query = """
SELECT id, labels, description
FROM items
WHERE labels LIKE ?
"""
async with self.conn.execute(query, (f"%{label}%",)) as cursor:
return await cursor.fetchall()
async def search_properties_by_label_or_desc...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.
Rahima411/text-to-pandas:
The data is divided into Train with 57.5k and Test with 19.2k.
The data has two columns as you can see in the example:
txt
Input | Pandas Query
-----------------------------------------------------------|-------------------------------------------
Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique()
Table Name: management (head_id (object), |
temporary_acting (object)) |
What are the distinct ages of the heads who are acting? |hiltch/pandas-create-context:
question | context | answer
----------------------------------------|--------------------------------------------------------|---------------------------------------
What was the lowest # of total votes? | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()
As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was:
- Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote.
- Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question.
You will find all of this in this code.
- You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code.
```py
def extract_table_creation(text:str)->(str,str):
"""
Extracts DataFrame creation statements and questions from the given text.
Args:
text (str): The input text containing table definitions and questions.
Returns:
tuple: A tuple containing a concatenated DataFrame creation string and a question.
"""
# Define patterns
table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
# Find all table names and column definitions
matches = re.findall(table_pattern, text)
# Initialize a list to hold DataFrame creation statements
df_creations = []
for table_name, columns_str in matches:
# Extract column names
columns = re.findall(column_pattern, columns_str)
column_names = [col[0] for col in columns]
# Format DataFrame creation statement
df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
df_creations.append(df_creation)
# Concatenate all DataFrame creation statements
df_creation_concat = '
'.join(df_creations)
# Extract and clean the question
question = text[text.rindex(')')+1:].strip()
return df_creation_concat, question
After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
> - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
> - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
> - `Question` : It is ...
Facebook
TwitterNo description is available. Visit https://dataone.org/datasets/ed07bae5e056f183052d39a9c4dd53cf for complete metadata about this dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset (WH_D1_4_meanmonthly.csv) contains mean monthly water table depth data for 211 point locations, for which the data were originally captured at a higher temporal resolution and were additionally clipped to the temporal window (2015 onwards) of the available Earth Observations in the Sentinel-1 and Sentinel-2 archive. Links to higher resolution/longer time series of these source data, where these are already in the public domain, have been identified in the data submission in case future data users require more detailed water table datasets.Information on site co-ordinates, data period, condition class, and other details, are provided in the associated metadata file (WH_D1_4_metadata.csv). Further links to 165 additional water table dynamics data have been provided for future users, but were not summarised as monthly means in this data submission in case the source data are updated in future. Please refer to the README file for methodological details and important disclaimers.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Definitions of all variables in the full data table for Webb & Mindel, Global Patterns of Extinction Risk in Marine and Non-marine Systems, Current Biology. Links to full data table and R code to generate figures and analyses.