Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why? What? How? #1

Open
nelsonic opened this issue Jan 9, 2017 · 21 comments
Open

Why? What? How? #1

nelsonic opened this issue Jan 9, 2017 · 21 comments

Comments

@nelsonic
Copy link
Member

nelsonic commented Jan 9, 2017

@iteles Please add the photos you took of the sketches from today into this description of the issue (or send them to me!) thanks! ⭐️

img_0695
img_0698
img_0696
img_0697

Note: @alexeyklyukin has good experience with PostgreSQL and gave a Presentation on the problem at PGConf 2014: https://docs.google.com/presentation/d/1TV0bExFwVy-_d6C7A8Z2JL9Z9tvtkuZv3D58fkC3GWQ
it might be worth reaching out to him to understand what the "state of the tools" is in 2017...?

@nelsonic
Copy link
Member Author

nelsonic commented Jan 9, 2017

@des-des could this be useful:

@des-des
Copy link
Member

des-des commented Jan 11, 2017

Ok this exists for sql server. It finds the difference between two dbs, and builds a migration script. This is the kind of thing we are interested in but for postgresql. https://opendbdiff.codeplex.com/

https://www.npmjs.com/package/dbdiff

@des-des
Copy link
Member

des-des commented Jan 11, 2017

Ok, This is some js diffing two dbs: https://github.com/gimenete/dbdiff/blob/master/dbdiff.js.

Works for postgresql!

@des-des
Copy link
Member

des-des commented Jan 11, 2017

What are we proposing to build?

The title here is postgres-schema-migration-checker. To me this suggests that we are building a ci tool to check if a schema migration will work.

Ie given a migration script, can we apply it to the database without breaking anything. This seems different to my impression of our discussion but makes more sense.

Given this outlook, I possible steps as.

  1. Create a dummy database with existing schema. Fill it with either anonymised or generated dummy data.

Attempt to apply the migration, if it works we are okay to apply the migration to production.

To outline what I think @nelsonic was suggesting (the whole migration process).

  1. Push up feature branch with schema change.
  2. build new db from schema.
  3. try to transfer anomymised / mock data into the new database.
  4. run tests on the new and populated db.
  5. success
  6. merge
  7. build a new db from the schema
  8. transfer data from old db to new db.
  9. Switch production databases.

In this process, I am not sure if we can depend on reliability of transactions happening during the copy in step 8.

What problem are we trying to solve?

Oxford abstracts have many migration scripts.
Each one is run on server startup. (I assume this will happen when new code is deployed)
These migration scripts need to work even if they have been run before, as they get applied many times to a production db.

As the number of migration scripts grow, this process is becoming hard to manage. The db schema become more confused, as to understand the current schema the effect of many migration scripts must be taken into account.

I think another problem here is testing time. Buillding a test db is taking a long time, as all the migration scripts need to be applied.

@Conorc1000 @roryc89 Do you feel my description of your problem is accurate? Is there anything you can add.

@Conorc1000
Copy link

@des-des I think this is a good description of the problem. (Yes we currently run our migrations each time we deploy to heroku).

@roryc89
Copy link
Member

roryc89 commented Jan 11, 2017

@des-des I agree that this an accurate description of our most pressing schema migration issues. Thanks for helping out! :)

@des-des
Copy link
Member

des-des commented Jan 18, 2017

@nelsonic It seems to me that normally you would apply a migration schema to a db. Rather than build a script that moved data between two dbs with different schema, I think this might be confusing me.

This is also related to my previous question: During the actual deployment step, how can we safely keep the client / server live while the copy is in progress? eg If we send a POST to both live db and db being created can we be sure that the data will not get added twice.

@iteles Do you have the photos?

@iteles
Copy link
Member

iteles commented Jan 19, 2017

@des-des So sorry, I remember uploading them and checking the preview to make sure they were there but must have forgotten to hit 'Update comment' after that 😭

I've updated the top comment with them now. Apologies again.

@nelsonic
Copy link
Member Author

nelsonic commented Jan 26, 2017

Hi @des-des yes, "normally" a migration would be applied to the DB.
What we need to know is if the migration schema will "break" any existing data in the Database.

More "advanced" or "mature" schema migrators like Active Record will attempt to do this for you.
We need a similar process for our hand-written SQL table definitions.
It could be that there is already a tool out there that looks at a two versions a database
with slightly different schemas e.g. changing a date field's type from varchar to date
which then Tests if the existing data will be converted without corruption/loss.

Our idea was to investigate the <option> of using an existing Schema Migration script/tool
or if we don't find one to write one that does exactly what we need.

If we need to clarify the requirements further, I'm happy to do so.
I agree that knowing exactly what we are trying to achieve before we set out on the journey
will save huge amounts of time and frustration,
so thanks for asking questions and please keep them coming! 👍

Also, to be clear if we can make this project/work reasonably generic,
i.e. it can be used to test if data in any existing Postgres database
will be affected by a schema migration (change in the structure of tables).
we will be able to re-use this for all/any project that has a PoSQL DB.
that will include many of our existing projects and several future ones.
and crucially there are tens of thousands of projects where this would be useful.
Because Postgres is not "going away" any time soon... dwyl/how-to-choose-a-database#4

@des-des
Copy link
Member

des-des commented Jan 27, 2017

@nelsonic awesome. That makes sense! Will try to put something together soon!

@des-des
Copy link
Member

des-des commented Jan 29, 2017

Ok, so I see there being three parts here

1: Managing the application of migration scripts in production.

This is the focus of most of the tools @nelsonic has linked to. The main idea here is to ensure migration scripts are only applied once. The target db will have a table of previously applied schema.
@Conorc1000 @roryc89 This provides a partial solution to your problem. Deploying a tool similar to these would help you, since your migration scripts would not need to guard against being doubly applied. You could have confidence, that for any db the migrations would be applied only once and in order.

2: Automatic creation of migration schema

Given a pull request with a change of schema, can we automate the creation of a migration script?
Yes, we can use a tool like dbdiff.
Q: What is the ux when this fails?
Q: This tool does not provide a rollback, is this a problem?
Q: how can we insure that

GetSchemaFromLiveDb(ApplyMigration(dbN-1, createMigrationFromDiff(schemaN-1, schemaN))

and

GetSchemaFromLiveDb(createDbFromSchema(schemaN))

are the same
Where are the migration schema kept? Does the ci tool commit them to the repo? I think we just run the creation tool locally

3: Testing if a schema does not lose data.

Too me, although it may be tangential to the problem at hand, this project needs a way of being confident that migrations are not losing data before it leaves beta.

This is the problem Nelson's drawings are describing.
We can break this problem into two parts

  1. Create a test db as a playground to test the migration. We can do this in three ways: 1 (May not keep client data secure) copy the db; 2, create mock data by inspecting the schema; 3, anonymise data from prod db to use in mock. Q: Our tool does not want to access the data for security reasons, that means we cannot access the data to anonimyse it. We would need to build two separate tools? One has access to the data and can spit out what we want, the other (the ci tool we are discussing) does not have access. I am not 100% sure of the benefit here.
  2. Run tests on the mock db:
    1. Build a tool to automate the creation of these tests. Q. is this the way we want to go, this seems like a hard problem, I can focus my reseach here?
    2. Run tests build by the user. Ie we have a staging db with lots of mock data in it. We have integration tests pulling data from this db. We can check if the migration makes them fail.

@nelsonic
Copy link
Member Author

@des-des yes, this would be a good approach. 👍

@des-des
Copy link
Member

des-des commented Jan 29, 2017

@nelsonic could you be more specific?

@nelsonic
Copy link
Member Author

@des-des the steps you have described don't appear logical to me.
The UX when the migration step fails is: exit with a "non-zero" terminal output
consisting of the reason why the script failed.

Migration schemas would be automatically created on the developer's machine by the script.
i.e. on the localhost. and then preferably they should be included in the commit/PR so that we can version-control them.

As for having access to production data, we can simulate valid records for both schemas based on the data types for the columns.
Then it should be straightforward to attempt to insert a row that is valid for one schema into the same table(s) in the revised schema and check if it worked. 👍

@des-des
Copy link
Member

des-des commented Mar 6, 2017

Sorry to have left this so long. FAC1N happened and kinda took over for a couple of weeks then updating back here got lost in my todo list.

About a month ago I spoke to the OA team, I'll quickly summarise the outcome of that conversation. OA have actually gone ahead and solved a part of their problem. Their DB now holds a table of run migrations. This means that for any instance of their db, they can make sure migrations are run once in order.

Anyway, I think we need to step back a little and think about his problem in the context of dwyl's new stack..

@iteles
Copy link
Member

iteles commented Mar 6, 2017

@roryc89 @Conorc1000 @naazy Is this still required for your project?

Postgres is still part of our stack, so this functionality may still be required.

@iteles iteles removed their assignment Mar 6, 2017
@des-des
Copy link
Member

des-des commented Mar 6, 2017

@iteles

@roryc89 @Conorc1000 @naazy and if the answer is yes, can you be explicit about what is needed?

@roryc89
Copy link
Member

roryc89 commented Mar 6, 2017

@iteles and @des-des. As our main DB pain point, the running of migrations multiple times, has been addressed, I don't think we require a data migrations tool in our project at the moment. Although we may need it in the future

@nelsonic
Copy link
Member Author

nelsonic commented Mar 7, 2017

@roryc89 thanks for replying and confirming. 👍
if that's the case and given that we are moving most of our other projects to "PETE" which has really good schema migrations, should we put this project on the "back burner" for now? 💭

@khairnarTK
Copy link

Is possible to write python script for production db data migrate to staging db for postgresql??

@nelsonic
Copy link
Member Author

@khairnarTK it's definitely possible, Django has migrations: https://docs.djangoproject.com/en/2.2/topics/migrations/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants