Unit 2 - Transfer data CSV SQL DataFrame

 CBSE Revision Notes

Class-11 Informatics Practices (New Syllabus)
Unit 2: Data Handling (DH-1)


Transfer data CSV SQL DataFrame

DataFrame is a table much like in SQL or Excel. It’s similar in structure, too, making it possible to use similar operations such as aggregation, filtering, and pivoting. However, because DataFrames are built in Python, it’s possible to use Python to program more advanced operations and manipulations than SQL and Excel can offer. As a bonus, the creators of pandas have focused on making the DataFrame operate very quickly, even over large datasets.

DataFrames are particularly useful because powerful methods are built into them. In Python, methods are associated with objects, so you need your data to be in the DataFrame to use these methods. DataFrames can load data through a number of different data structures and files, including lists and dictionaries, csv files, excel files, and database records.

The Pandas library documentation defines a DataFrame as a “two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)”. In plain terms, think of a DataFrame as a table of data, i.e. a single set of formatted two-dimensional data, with the following characteristics:

  • There can be multiple rows and columns in the data.
  • Each row represents a sample of data,
  • Each column contains a different variable that describes the samples (rows).
  • The data in every column is usually the same type of data – e.g. numbers, strings, dates.
  • Usually, unlike an excel data set, DataFrames avoid having missing values, and there are no gaps and empty values between rows or columns.

By way of example, the following data sets that would fit well in a Pandas DataFrame:

  • In a school system DataFrame – each row could represent a single student in the school, and columns may represent the students name (string), age (number), date of birth (date), and address (string).
  • In an economics DataFrame, each row may represent a single city or geographical area, and columns might include the the name of area (string), the population (number), the average age of the population (number), the number of households (number), the number of schools in each area (number) etc.
  • In a shop or e-commerce system DataFrame, each row in a DataFrame may be used to represent a customer, where there are columns for the number of items purchased (number), the date of original registration (date), and the credit card number (string).

pandas.DataFrame.to_sql

DataFrame.to_sql(nameconschema=Noneif_exists='fail'index=Trueindex_label=Nonechunksize=Nonedtype=None)

Write records stored in a DataFrame to a SQL database.

Databases supported by SQLAlchemy are supported. Tables can be newly created, appended to, or overwritten.

Parameters:

name : string
Name of SQL table.

con : sqlalchemy.engine.Engine or sqlite3.Connection
Using SQLAlchemy makes it possible to use any DB supported by that library. Legacy support is provided for sqlite3.Connection objects.

schema : string, optional
Specify the schema (if database flavor supports this). If None, use default schema.

if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
How to behave if the table already exists.

  • fail: Raise a ValueError.
  • replace: Drop the table before inserting new values.
  • append: Insert new values to the existing table.

index : boolean, default True
Write DataFrame index as a column. Uses index_label as the column name in the table.

index_label : string or sequence, default None
Column label for index column(s). If None is given (default) and index is True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.

chunksize : int, optional
Rows will be written in batches of this size at a time. By default, all rows will be written at once.

dtype : dict, optional
Specifying the datatype for columns. The keys should be the column names and the values should be the SQLAlchemy types or strings for the sqlite3 legacy mode.

Raises:

ValueError
When the table already exists and if_exists is ‘fail’ (the default).

Examples

Create an in-memory SQLite database.

>>> from sqlalchemy import create_engine
>>> engine = create_engine('sqlite://', echo=False)

Create a table from scratch with 3 rows.

>>> df = pd.DataFrame({'name' : ['User 1', 'User 2', 'User 3']})
>>> df
name
0 User 1
1 User 2
2 User 3
>>> df.to_sql('users', con=engine)
>>> engine.execute("SELECT * FROM users").fetchall()
[(0, 'User 1'), (1, 'User 2'), (2, 'User 3')]
>>> df1 = pd.DataFrame({'name' : ['User 4', 'User 5']})
>>> df1.to_sql('users', con=engine, if_exists='append')
>>> engine.execute("SELECT * FROM users").fetchall()
[(0, 'User 1'), (1, 'User 2'), (2, 'User 3'),
(0, 'User 4'), (1, 'User 5')]

Overwrite the table with just df1.

>>> df1.to_sql('users', con=engine, if_exists='replace',
... index_label='id')
>>> engine.execute("SELECT * FROM users").fetchall()
[(0, 'User 4'), (1, 'User 5')]

Specify the dtype (especially useful for integers with missing values). Notice that while pandas is forced to store the data as floating point, the database supports nullable integers. When fetching the data with Python, we get back integer scalars.

>>> df = pd.DataFrame({"A": [1, None, 2]})
>>> df
A
0 1.0
1 NaN
2 2.0
>>> from sqlalchemy.types import Integer
>>> df.to_sql('integers', con=engine, index=False,
... dtype={"A": Integer()})
>>> engine.execute("SELECT * FROM integers").fetchall()
[(1,), (None,), (2,)]

Moving data to SQL, CSV, Pandas etc.

CSV

This uses the standard library csv module:

"""Export to CSV."""
import sys
import csv
from dbfread import DBF

table = DBF('files/people.dbf')
writer = csv.writer(sys.stdout)

writer.writerow(table.field_names)
for record in table:
writer.writerow(list(record.values()))

The output is:

NAME,BIRTHDATE
Alice,1987-03-01
Bob,1980-11-12

Pandas Data Frames

"""
Load content of a DBF file into a Pandas data frame.

The iter() is required because Pandas doesn't detect that the DBF
object is iterable.
"""
from dbfread import DBF
from pandas import DataFrame

dbf = DBF('files/people.dbf')
frame = DataFrame(iter(dbf))

print(frame)

This will print:

    BIRTHDATE   NAME
0 1987-03-01 Alice
1 1980-11-12 Bob

The iter() is required. Without it Pandas will not realize that it can iterate over the table.

Pandas will create a new list internally before converting the records to data frames. This means they will all be loaded into memory. There seems to be no way around this at the moment.

dataset (SQL)

The dataset package makes it easy to move data to a modern database. Here’s how you can insert the people table into an SQLite database:

"""
Convert a DBF file to an SQLite table.

Requires dataset: https://dataset.readthedocs.io/
"""
import dataset
from dbfread import DBF

# Change to "dataset.connect('people.sqlite')" if you want a file.
db = dataset.connect('sqlite:///:memory:')
table = db['people']

for record in DBF('files/people.dbf', lowernames=True):
table.insert(record)

# Select and print a record just to show that it worked.
print(table.find_one(name='Alice'))

(This also creates the schema.)

dbf2sqlite

You can use the included example program dbf2sqlite to insert tables into an SQLite database:

dbf2sqlite -o example.sqlite table1.dbf table2.dbf

This will create one table for each DBF file. You can also omit the -o example.sqlite option to have the SQL printed directly to stdout.

If you get character encoding errors you can pass --encoding to override the encoding, for example:

dbf2sqlite --encoding=latin1 ...