Unit 2 - Transfer data CSV SQL DataFrame
CBSE Revision Notes
Class-11 Informatics Practices (New Syllabus)
Unit 2: Data Handling (DH-1)
Transfer data CSV SQL DataFrame
A DataFrame is a table much like in SQL or Excel. It’s similar in structure, too, making it possible to use similar operations such as aggregation, filtering, and pivoting. However, because DataFrames are built in Python, it’s possible to use Python to program more advanced operations and manipulations than SQL and Excel can offer. As a bonus, the creators of pandas have focused on making the DataFrame operate very quickly, even over large datasets.
DataFrames are particularly useful because powerful methods are built into them. In Python, methods are associated with objects, so you need your data to be in the DataFrame to use these methods. DataFrames can load data through a number of different data structures and files, including lists and dictionaries, csv files, excel files, and database records.
The Pandas library documentation defines a DataFrame as a “two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)”. In plain terms, think of a DataFrame as a table of data, i.e. a single set of formatted two-dimensional data, with the following characteristics:
- There can be multiple rows and columns in the data.
- Each row represents a sample of data,
- Each column contains a different variable that describes the samples (rows).
- The data in every column is usually the same type of data – e.g. numbers, strings, dates.
- Usually, unlike an excel data set, DataFrames avoid having missing values, and there are no gaps and empty values between rows or columns.
By way of example, the following data sets that would fit well in a Pandas DataFrame:
- In a school system DataFrame – each row could represent a single student in the school, and columns may represent the students name (string), age (number), date of birth (date), and address (string).
- In an economics DataFrame, each row may represent a single city or geographical area, and columns might include the the name of area (string), the population (number), the average age of the population (number), the number of households (number), the number of schools in each area (number) etc.
- In a shop or e-commerce system DataFrame, each row in a DataFrame may be used to represent a customer, where there are columns for the number of items purchased (number), the date of original registration (date), and the credit card number (string).
pandas.DataFrame.to_sql
DataFrame.
to_sql
(name, con, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None)
Write records stored in a DataFrame to a SQL database.
Databases supported by SQLAlchemy are supported. Tables can be newly created, appended to, or overwritten.
Parameters: | name : string con : sqlalchemy.engine.Engine or sqlite3.Connection schema : string, optional if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
index : boolean, default True index_label : string or sequence, default None chunksize : int, optional dtype : dict, optional |
---|---|
Raises: | ValueError |
Examples
Create an in-memory SQLite database.
>>> from sqlalchemy import create_engine
>>> engine = create_engine('sqlite://', echo=False)
Create a table from scratch with 3 rows.
>>> df = pd.DataFrame({'name' : ['User 1', 'User 2', 'User 3']})
>>> df
name
0 User 1
1 User 2
2 User 3
>>> df.to_sql('users', con=engine)
>>> engine.execute("SELECT * FROM users").fetchall()
[(0, 'User 1'), (1, 'User 2'), (2, 'User 3')]
>>> df1 = pd.DataFrame({'name' : ['User 4', 'User 5']})
>>> df1.to_sql('users', con=engine, if_exists='append')
>>> engine.execute("SELECT * FROM users").fetchall()
[(0, 'User 1'), (1, 'User 2'), (2, 'User 3'),
(0, 'User 4'), (1, 'User 5')]
Overwrite the table with just df1
.
>>> df1.to_sql('users', con=engine, if_exists='replace',
... index_label='id')
>>> engine.execute("SELECT * FROM users").fetchall()
[(0, 'User 4'), (1, 'User 5')]
Specify the dtype (especially useful for integers with missing values). Notice that while pandas is forced to store the data as floating point, the database supports nullable integers. When fetching the data with Python, we get back integer scalars.
>>> df = pd.DataFrame({"A": [1, None, 2]})
>>> df
A
0 1.0
1 NaN
2 2.0
>>> from sqlalchemy.types import Integer
>>> df.to_sql('integers', con=engine, index=False,
... dtype={"A": Integer()})
>>> engine.execute("SELECT * FROM integers").fetchall()
[(1,), (None,), (2,)]
Moving data to SQL, CSV, Pandas etc.
CSV
This uses the standard library csv
module:
"""Export to CSV."""
import sys
import csv
from dbfread import DBF
table = DBF('files/people.dbf')
writer = csv.writer(sys.stdout)
writer.writerow(table.field_names)
for record in table:
writer.writerow(list(record.values()))
The output is:
NAME,BIRTHDATE
Alice,1987-03-01
Bob,1980-11-12
Pandas Data Frames
"""
Load content of a DBF file into a Pandas data frame.
The iter() is required because Pandas doesn't detect that the DBF
object is iterable.
"""
from dbfread import DBF
from pandas import DataFrame
dbf = DBF('files/people.dbf')
frame = DataFrame(iter(dbf))
print(frame)
This will print:
BIRTHDATE NAME
0 1987-03-01 Alice
1 1980-11-12 Bob
The iter()
is required. Without it Pandas will not realize that it can iterate over the table.
Pandas will create a new list internally before converting the records to data frames. This means they will all be loaded into memory. There seems to be no way around this at the moment.
dataset (SQL)
The dataset package makes it easy to move data to a modern database. Here’s how you can insert the people
table into an SQLite database:
"""
Convert a DBF file to an SQLite table.
Requires dataset: https://dataset.readthedocs.io/
"""
import dataset
from dbfread import DBF
# Change to "dataset.connect('people.sqlite')" if you want a file.
db = dataset.connect('sqlite:///:memory:')
table = db['people']
for record in DBF('files/people.dbf', lowernames=True):
table.insert(record)
# Select and print a record just to show that it worked.
print(table.find_one(name='Alice'))
(This also creates the schema.)
dbf2sqlite
You can use the included example program dbf2sqlite
to insert tables into an SQLite database:
dbf2sqlite -o example.sqlite table1.dbf table2.dbf
This will create one table for each DBF file. You can also omit the -o example.sqlite
option to have the SQL printed directly to stdout.
If you get character encoding errors you can pass --encoding
to override the encoding, for example:
dbf2sqlite --encoding=latin1 ...